The hyper-v transparent bonding should have used master_dev_link.
The netvsc device should look like a master bond device not
like the upper side of a tunnel.
This makes the semantics the same so that userspace applications
looking at network devices see the correct master relationshipship.
Fixes: 0c195567a8 ("netvsc: transparent VF management")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefer the direct use of octal for permissions.
Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.
Miscellanea:
o Whitespace neatening around these conversions.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As defined in hyperv_net.h, the NVSP_STAT_SUCCESS is one not zero.
Some functions returns 0 when it actually means NVSP_STAT_SUCCESS.
This patch fixes them.
In netvsc_receive(), it puts the last RNDIS packet's receive status
for all packets in a vmxferpage which may contain multiple RNDIS
packets.
This patch puts NVSP_STAT_FAIL in the receive completion if one of
the packets in a vmxferpage fails.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make common function for detaching internals of device
during changes to MTU and RSS. Make sure no more packets
are transmitted and all packets have been received before
doing device teardown.
Change the wait logic to be common and use usleep_range().
Changes transmit enabling logic so that transmit queues are disabled
during the period when lower device is being changed. And enabled
only after sub channels are setup. This avoids issue where it could
be that a packet was being sent while subchannel was not initialized.
Fixes: 8195b1396e ("hv_netvsc: fix deadlock on hotplug")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dev_uc/mc_sync calls need to have the device address list
locked. This was spotted by running with lockdep enabled.
Fixes: bee9d41b37 ("hv_netvsc: propagate rx filters to VF")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rx_mode operation handler is different than other callbacks
in that is not always called with rtnl held. Therefore use
RCU to ensure that references are valid.
Fixes: bee9d41b37 ("hv_netvsc: propagate rx filters to VF")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netvsc device should propagate filters to the SR-IOV VF
device (if present). The flags also need to be propagated to the
VF device as well. This only really matters on local Hyper-V
since Azure does not support multiple addresses.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VF is used for accelerated networking it will likely have
more queues (and different policy) than the synthetic NIC.
This patch defers the queue policy to the VF so that all the
queues can be used. This impacts workloads like local generate UDP.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't wake transmit queues if link is not up yet.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the transmit queue is known full, then don't keep aggregating
data. And the cp_partial flag which indicates that the current
aggregation buffer is full can be folded in to avoid more
conditionals.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netvsc_receive_callback function was using RCU to find the
appropriate underlying netvsc_device. Since calling function already
had that pointer, this was unnecessary.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The caller (netvsc_receive) already has the net device pointer,
and should just pass that to functions rather than the hyperv device.
This eliminates several impossible error paths in the process.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When skb can not be allocated, update ethtool statisitics
rather than rx_dropped which is intended for netif_receive.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The values were not computed correctly. There are no significant
visible impact, though.
The intended size of RX buffer is 16 MB, and the default slot size is 1728.
So, NETVSC_DEFAULT_RX should be 16*1024*1024 / 1728 = 9709.
The intended size of TX buffer is 1 MB, and the slot size is 6144.
So, NETVSC_DEFAULT_TX should be 1024*1024 / 6144 = 170.
The patch puts the formula directly into the macro, and moves them to
hyperv_net.h, together with related macros.
Fixes: 5023a6db73 ("netvsc: increase default receive buffer size")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memset of the whole maximum possible RNDIS header is unnecessary.
For the main part of the header use a structure assignment.
No need to memset the whole per packet info. Instead rely on caller to
set what it wants. Also get rid of cast to void and signed/unsigned
conversion. Now return pointer to per packet data (rather than the
header) which simplifies use by code setting up the packet data.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every packet sent checks the available ring space. The calculation
can be sped up by using reciprocal divide which is multiplication.
Since ring_size can only be configured by module parameter, so it doesn't
have to be passed around everywhere. Also it should be unsigned
since it is number of pages.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rndis_filter_device_add() is called both from netvsc_probe() when we
initially create the device and from set channels/mtu/ringparam
routines where we basically remove the device and add it back.
hw_features is reset in rndis_filter_device_add() and filled with
host data. However, we lose all additional flags which are set outside
of the driver, e.g. register_netdevice() adds NETIF_F_SOFT_FEATURES and
many others.
Unfortunately, calls to rndis_{query_hwcaps(), _set_offload_params()}
calls cannot be avoided on every RNDIS reset: host expects us to set
required features explicitly. Moreover, in theory hardware capabilities
can change and we need to reflect the change in hw_features.
Reset net->hw_features bits according to host data in
rndis_netdev_set_hwcaps(), clear corresponding feature bits
from net->features in case some features went missing (will never happen
in real life I guess but let's be consistent).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename this variable because it is the Receive indirection
table.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch supports the options to switch TCP hash level between
L3 and L4 by ethtool command. TCP over IPv4 and v6 can be set
differently. The default hash level is L4. We currently only
allow switching TX hash level from within the guests.
For example, for TCP over IPv4 on eth0:
To include TCP port numbers in hashing:
ethtool -N eth0 rx-flow-hash tcp4 sdfn
To exclude TCP port numbers in hashing:
ethtool -N eth0 rx-flow-hash tcp4 sd
To show TCP hash level:
ethtool -n eth0 rx-flow-hash tcp4
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This simplifies the logic and make it easier to add more
options.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report the numbers of events for stop_queue and wake_queue in
ethtool stats.
Example:
ethtool -S eth0
NIC statistics:
...
stop_queue: 7
wake_queue: 7
...
Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For older hosts without multi-channel (vRSS) support, and some error
cases, we still need to set the real number of queues to one.
This patch adds this missing setting.
Fixes: 8195b1396e ("hv_netvsc: fix deadlock on hotplug")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If MTU is changed the host would reject the send buffer change.
This problem is result of recent change to allow changing send
buffer size.
Every time we change the MTU, we store the previous net_device section
count before destroying the buffer, but we don’t store the previous
section size. When we reinitialize the buffer, its size is calculated
by multiplying the previous count and previous size. Since we
continuously increase the MTU, the host returns us a decreasing count
value while the section size is reinitialized to 1728 bytes every
time.
This eventually leads to a condition where the calculated buf_size is
so small that the host rejects it.
Fixes: 8b5327975a ("netvsc: allow controlling send/recv buffer size")
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default receive buffer size was reduced by recent change
to a value which was appropriate for 10G and Windows Server 2016.
But the value is too small for full performance with 40G on Azure.
Increase the default back to maximum supported by host.
Fixes: 8b5327975a ("netvsc: allow controlling send/recv buffer size")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a virtual device is added dynamically (via host console), then
the vmbus sends an offer message for the primary channel. The processing
of this message for networking causes the network device to then
initialize the sub channels.
The problem is that setting up the sub channels needs to wait until
the subsequent subchannel offers have been processed. These offers
come in on the same ring buffer and work queue as where the primary
offer is being processed; leading to a deadlock.
This did not happen in older kernels, because the sub channel waiting
logic was broken (it wasn't really waiting).
The solution is to do the sub channel setup in its own work queue
context that is scheduled by the primary channel setup; and then
happens later.
Fixes: 732e49850c ("netvsc: fix race on sub channel creation")
Reported-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The limit of setting receive indirection table value should be
the current number of channels, not the VRSS_CHANNEL_MAX.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because of the following code, net->num_tx_queues equals to
VRSS_CHANNEL_MAX, and max_chn is less than or equals to VRSS_CHANNEL_MAX.
netvsc_drv.c:
alloc_etherdev_mq(sizeof(struct net_device_context),
VRSS_CHANNEL_MAX);
rndis_filter.c:
net_device->max_chn = min_t(u32, VRSS_CHANNEL_MAX, num_possible_rss_qs);
So this patch removes the unnecessary limit check before comparing
with "max_chn".
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the parameter, num_queue in
rndis_filter_set_rss_param(), which is no longer in use.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If VF is attached then can still allow netvsc driver module to
be removed. Just have to make sure and do the cleanup.
Also, avoid extra rtnl round trip when calling unregister.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use one routine for datapath up/down. Don't need to reopen
the rndis layer.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a deadlock possible when canceling the link status
delayed work queue. The removal process is run with RTNL held,
and the link status callback is acquring RTNL.
Resolve the issue by using trylock and rescheduling.
If cancel is in process, that block it from happening.
Fixes: 122a5f6410 ("staging: hv: use delayed_work for netvsc_send_garp()")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We now remove rndis filter before unregister_netdev(), which calls
device close. It involves closing rndis filter already removed.
This patch fixes this error.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch add the functions to switch UDP hash level between
L3 and L4 by ethtool command. UDP over IPv4 and v6 can be set
differently. The default hash level is L4. We currently only
allow switching TX hash level from within the guests.
On Azure, fragmented UDP packets have high loss rate with L4
hashing. Using L3 hashing is recommended in this case.
For example, for UDP over IPv4 on eth0:
To include UDP port numbers in hasing:
ethtool -N eth0 rx-flow-hash udp4 sdfn
To exclude UDP port numbers in hasing:
ethtool -N eth0 rx-flow-hash udp4 sd
To show UDP hash level:
ethtool -n eth0 rx-flow-hash udp4
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ethtool statistics for case where send chimmeny buffer is
exhausted and driver has to fall back to doing scatter/gather
send. Also, add statistic for case where ring buffer is full and
receive completions are delayed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Control the size of the buffer areas via ethtool ring settings.
They aren't really traditional hardware rings, but host API breaks
receive and send buffer into chunks. The final size of the chunks are
controlled by the host.
The default value of send and receive buffer area for host DMA
is much larger than it needs to be. Experimentation shows that
4M receive and 1M send is sufficient.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function init_page_array is always called with a valid pointer
to RNDIS header. No check for NULL is needed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Assignment to a typed pointer is sufficient in C.
No cast is needed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If setting new values fails, and the attempt to restore original
settings fails. Then log an error and leave device down.
This should never happen, but if it does don't go down in flames.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If VF is slaved to synthetic device, then any change to netvsc
MAC address should be propagated to the slave device.
If slave device doesn't support MAC address change then it
should also be an error to attempt to change synthetic NIC MAC
address.
It also fixes the error unwind in the original code.
If give a bad address, the old code would change the device
MAC address anyway.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VF device is discovered, delay bring it automatically up in
order to allow userspace to some simple changes (like renaming).
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Go back to switching datapath directly in the notifier callback.
Otherwise datapath might not get switched on unregister.
No need for calling the NOTIFY_PEERS notifier since that is only for
a gratitious ARP/ND packet; but that is not required with Hyper-V
because both VF and synthetic NIC have the same MAC address.
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 0c195567a8 ("netvsc: transparent VF management")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With new transparent VF support, it is possible to get a deadlock
when some of the deferred work is running and the unregister_vf
is trying to cancel the work element. The solution is to use
trylock and reschedule (similar to bonding and team device).
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 0c195567a8 ("netvsc: transparent VF management")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements transparent fail over from synthetic NIC to
SR-IOV virtual function NIC in Hyper-V environment. It is a better
alternative to using bonding as is done now. Instead, the receive and
transmit fail over is done internally inside the driver.
Using bonding driver has lots of issues because it depends on the
script being run early enough in the boot process and with sufficient
information to make the association. This patch moves all that
functionality into the kernel.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).
Signed-off-by: David S. Miller <davem@davemloft.net>