Since Add/Remove Device perform the page scan updates independently
from the HCI command completion we've introduced a potential race when
multiple mgmt commands are queued. Doing the page scan updates through
the req_workqueue ensures that the state changes are performed in a
race-free manner.
At the same time, to make the request helper more widely usable,
extend it to also cover Inquiry Scan changes since those are behind
the same HCI command. This is also reflected in the new name of the
API as well as the work struct name.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
sock_cgroup_data is a struct containing an anonymous union.
sock_cgroup_set_prioidx() and sock_cgroup_set_classid() were
initializing a field inside the anonymous union as follows.
struct sock_ccgroup_data skcd_buf = { .val = VAL };
While this is fine on more recent compilers, gcc-4.4.7 triggers the
following errors.
include/linux/cgroup-defs.h: In function ‘sock_cgroup_set_prioidx’:
include/linux/cgroup-defs.h:619: error: unknown field ‘val’ specified in initializer
include/linux/cgroup-defs.h:619: warning: missing braces around initializer
include/linux/cgroup-defs.h:619: warning: (near initialization for ‘skcd_buf.<anonymous>’)
This is because .val belongs to the anonymous union nested inside the
struct but the initializer is missing the nesting. Fix it by adding
an extra pair of braces.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alaa Hleihel <alaa@dev.mellanox.co.il>
Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: David S. Miller <davem@davemloft.net>
The cmac_ops structures are never modified, so declare them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
These fields are updated but never read.
Remove the overhead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under heavy TX load, bnx2x_poll() can loop forever and trigger
soft lockup bugs.
A napi poll handler must yield after one TX completion round,
risk of livelock is too high otherwise.
Bug is very easy to trigger using a debug build, and udp flood, because
of added cpu cycles in TX completion, and we do not receive enough
packets to break the loop.
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch 9497df88ab ("rhashtable:
Fix reader/rehash race") added a pair of barriers. In fact the
wmb is superfluous because every subsequent write to the old or
new hash table uses rcu_assign_pointer, which itself carriers a
full barrier prior to the assignment.
Therefore we may remove the explicit wmb.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai says:
====================
Update Kconfig and some fixes for cxgb4
This series update Kconfig to add description for Chelsio's next
generation T6 family of adapters, also fixes ethtool stats alignment
and prevents simultaneous execution of service_ofldq thread, deals with
queue wrap around and adds some fl counters for debugging purpose and
device ID for new T5 adapters.
This patch series has been created against net-next tree and includes
patches on cxgb4 driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
Thanks
V2: Declare 'service_ofldq_running' as bool in Patch 4/7 ("cxgb4: prevent
simultaneous execution of service_ofldq()") based on review comment
by David Miller
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add Free List DMA Mapping Errors to SGE Queue info for
Free Lists. Add Free List "Low" counter to count the number of times we
see the number of pointers that we _think_ the hardware sees in the
Free List below the Egress Threshold.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The WR headers may not fit within one descriptor.
So we need to deal with wrap-around here.
Based on original patch by Pranjal Joshi <pjoshi@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change mutual exclusion mechanism to prevent multiple threads of
execution from running in service_ofldq() at the same time. The old
mechanism used an implicit guard on the down-call path and none on the
restart path and wasn't working. This checking makes the mechanism
explicit and is much easier to understand as a result.
Based on original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use helper macro ACCESS_ONCE() to load from the SGE status page
to prevent the compiler loading multiple times.
Based on original work by Mike Werner <werner@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
here is the patch raising the performance of XGE by:
1)changes the way page management method for enet momery, and
2)reduces the count of rmb, and
3)adds Memory prefetching
Signed-off-by: Kejian Yan <yankejian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct
and its accessors are defined in cgroup-defs.h. This is to prepare
for overloading the fields with a cgroup pointer.
This patch mostly performs equivalent conversions but the followings
are noteworthy.
* Equality test before updating classid is removed from
sock_update_classid(). This shouldn't make any noticeable
difference and a similar test will be implemented on the helper side
later.
* sock_update_netprioidx() now takes struct sock_cgroup_data and can
be moved to netprio_cgroup.h without causing include dependency
loop. Moved.
* The dummy version of sock_update_netprioidx() converted to a static
inline function while at it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
netprio builds per-netdev contiguous priomap array which is indexed by
css->id. The array is allocated using kzalloc() effectively limiting
the maximum ID supported to some thousand range. This patch caps the
maximum supported css->id to USHRT_MAX which should be way above what
is actually useable.
This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed
up part will be used to overload the cgroup related fields.
sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
two cgroup related fields are adjacent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 0d76d6e8b2 and merge
commit c402293bd7, reversing changes made
to c89359a42e.
The virtio-vsock device specification is not finalized yet. Michael
Tsirkin voiced concerned about merging this code when the hardware
interface (and possibly the userspace interface) could still change.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov says:
====================
sh_eth: optimize MDIO code
Here's a set of 3 patches against DaveM's 'net-next.git' repo which
gets rid of ~35 LoCs in the MDIO bitbang methods.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After the MDIO bitbang code consolidation, there's no need anymore for
bb_{set|clr}() as well as bb_read() -- just expand them inline, thus
saving more LoCs...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
sh_mm[cd]_ctrl() and sh_set_mdio() all look mostly the same -- factor out
their common code and put it into sh_mdio_ctrl().
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MDIO control bits are always mapped to the same bits of the same register
(PIR), so there's no need to store their masks in the 'struct bb_info'...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xgene_mac_ops and xgene_port_ops structures are never modified, so
declare them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
* support bcm4359 which can operate in two bands concurrently
* disable runtime pm for USB avoiding issues
* use generic pm callback in PCIe driver
* support wowlan wake indication reporting
* add beamforming support
* unified handling of firmware files
ath10k
* support Manegement Frame Protection (MFP)
* add thermal throttling support for 10.4 firmware
* add support for pktlog in QCA99X0
* add debugfs file to enable Bluetooth coexistence feature
* use firmware's native mesh interface type instead of raw mode
iwlwifi
* BT coex improvements
* D3 operation bugfixes
* rate control improvements
* firmware debugging infra improvements
* ground work for multi Rx
* various security fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQEcBAABAgAGBQJWZcQtAAoJEG4XJFUm622bgcMH/1VRwtzAKpYVwAyAN0MtVLxe
uANi0Pw1PmEAeEI3TxEBckEH0JYfpg+aAwX7S8scnQSvLP3FYeC5IcHG551vlh0s
FlkGexcXqGrDqjt8mz8hxqqAmMH9YEVlzj2HJf6YFjNS4K84CEgpaSjaSG8S7Wc8
hTSA5K+XxrnEeX41W7FYmeBFLejisg0gVTkS3ZCe4qYz4Gh1oamoA0pOdU+AYOMy
0XBkCT8fqTVXWLHh9/+J7IZOYrjBl4rVaHofeygEAfSRNNfmmjZXX1R+FCQoJEZC
IOEQ31T64G4A37t2N2RGOhiG+2vckdbPg2JsqJosI1L2OZSBbeGZsVW48w+5cgc=
=GQDA
-----END PGP SIGNATURE-----
Merge tag 'wireless-drivers-next-for-davem-2015-12-07' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Kalle Vallo says:
====================
brcfmac
* support bcm4359 which can operate in two bands concurrently
* disable runtime pm for USB avoiding issues
* use generic pm callback in PCIe driver
* support wowlan wake indication reporting
* add beamforming support
* unified handling of firmware files
ath10k
* support Manegement Frame Protection (MFP)
* add thermal throttling support for 10.4 firmware
* add support for pktlog in QCA99X0
* add debugfs file to enable Bluetooth coexistence feature
* use firmware's native mesh interface type instead of raw mode
iwlwifi
* BT coex improvements
* D3 operation bugfixes
* rate control improvements
* firmware debugging infra improvements
* ground work for multi Rx
* various security fixes
====================
Conflicts:
drivers/net/wireless/ath/ath10k/pci.c
The conflict resolution at:
http://article.gmane.org/gmane.linux.kernel.next/37391
by Stephen Rothwell was used.
Signed-off-by: David S. Miller <davem@davemloft.net>
As the kernel generally uses negated error numbers, *err needs to be
compared with -EAGAIN (d'oh).
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Fixes: ea3793ee29 ("core: enable more fine-grained datagram reception control")
Signed-off-by: David S. Miller <davem@davemloft.net>
the simple_strtoul function is obsolete. This patch replace it by
kstrtox.
Signed-off-by: LABBE Corentin <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Armstrong says:
====================
Further fix for dsa unbinding
This series fixes further issues for DSA dynamic unbinding.
The first patch completely removes the PHY link state polling.
The two following cleans up the dsa state upon removal.
The last patch moves slave destroy code as slave function and
adds missing netdev and phy cleanup calls.
v1: http://lkml.kernel.org/r/562F8ECB.6050709@baylibre.com
v2: http://lkml.kernel.org/r/56321D9A.8010109@baylibre.com
remove phy fix and add missing calls in dsa_switch_destroy
then add dedicated dsa_slave_destroy
v3: remove polling instead of fixing it, make single patch for
dsa slave destroy
====================
Acked-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move dsa slave dedicated code from dsa_switch_destroy to a new
dsa_slave_destroy function in slave.c.
Add the netif_carrier_off and phy_disconnect calls in order to
correctly cleanup the netdev state and PHY state machine.
Signed-off-by: Frode Isaksen <fisaksen@baylibre.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon probe failure or unbinding, add missing dev_put() calls.
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that we unassign the master_netdev dsa_ptr to make the packet
processing go through the regular Ethernet receive path.
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since no more DSA driver uses the polling callback, and since
the phylib handles the link detection, remove the link polling
work and timer code.
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
needing to reshuffle and fix some bugs. I merged mac80211
to get the right base for some of these changes.
* new mac80211 API for upcoming driver changes: EOSP handling,
key iteration
* scan abort changes allowing to cancel an ongoing scan
* VHT IBSS 80+80 MHz support
* re-enable full AP client state tracking after fixes
* various small fixes (that weren't relevant for mac80211)
* various cleanups
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJWZVw7AAoJEGt7eEactAAdQcgP/1bOBBKgCHWZ8xhqmhLIPPUP
AgkkyBcjCbSOWyE1axm5WQZM+fQvyGAcYsnhsK7h0Wy5Jvv6goNYhxkoD3L5lAKC
LkiiqokTpLx1Em6Iugn1sdgag8q7EquYYQN+hOEOWtp32pTsx3/pDglCtGu0SX1N
eystHEAu6mzPezat99M4s80fRlfBop3yaUuL5XopQFGtU37zfUgoXJB3BoXgxNjK
XyD22jtPDreDMndZ9ugfvMaiq3iKRBhKXqgGb3SqMaStIyRK8zAkHb5jg3CllMeq
bEsz4Rb4r+vtm2AVsUMWjfd/upQKwPwuvdvCvv4AQCO+aR9Rm+tR/wnnD4Gtnek5
zPQ6XWt/0V4CKGl+W9shnDSA1DZ3hTijJlaGsK+RUqEtdq903lEP7fc2GsSvlund
jXHfOExieuZOToKWTKpmNGsCw6fjJaGXNd/iLWo5VGAZS2X+JLmFZ94g43a6zOGZ
s1Gz4F3tz4u4Bd26NAK2Z6CQRvDS4OOyLIjl9vpB9Fk/9nQx3f7WD8aBTRuCVAtG
U2sFEUscz3rkdct30Gvkjm3ovmgc4pomTDvOpmNIsSCi2ygzGWHbEvSrrHdIjzVy
KDcvRs6bRtCL/WxaaEIk46M6+6aKlSnZytPLl7vkNnvxXuEF7GYdnNVSUbSH9Nte
XzT4+rZRiqyPZEGhBekw
=+5dd
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-12-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
This pull request got a bit bigger than I wanted, due to
needing to reshuffle and fix some bugs. I merged mac80211
to get the right base for some of these changes.
* new mac80211 API for upcoming driver changes: EOSP handling,
key iteration
* scan abort changes allowing to cancel an ongoing scan
* VHT IBSS 80+80 MHz support
* re-enable full AP client state tracking after fixes
* various small fixes (that weren't relevant for mac80211)
* various cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham says:
====================
net: thunderx: Miscellaneous cleanups
This patch series contains contains couple of cleanup patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we have moved on to using allocated pages to carve receive
buffers instead of netdev_alloc_skb() there is no need to store
any pointers for later retrieval. Earlier we had to store
skb and skb->data pointers which later are used to handover
received packet to network stack.
This will avoid an unnecessary cache miss as well.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same switch-case repeates for nivc_*_intr functions.
In this patch it is moved to a helper nicvf_int_type_to_mask().
By the way:
- Unneeded write to NICVF register dropped if int_type is unknown.
- netdev_dbg() is used instead of netdev_err().
Signed-off-by: Yury Norov <yury.norov@auriga.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Acked-by: Vadim Lomovtsev <Vadim.Lomovtsev@caiumnetworks.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of HW ROC, when the driver reports that the ROC expired,
it is not sufficient to purge the ROCs based on the remaining
time, as it possible that the device finished the ROC session
before the actual requested duration.
To handle such cases, in case of ROC expired notification from
the driver, complete all the ROCs which are marked with hw_begun,
regardless of the remaining duration.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The current unix_dgram_recvsmg code acquires the u->readlock mutex in
order to protect access to the peek offset prior to calling
__skb_recv_datagram for actually receiving data. This implies that a
blocking reader will go to sleep with this mutex held if there's
presently no data to return to userspace. Two non-desirable side effects
of this are that a later non-blocking read call on the same socket will
block on the ->readlock mutex until the earlier blocking call releases it
(or the readers is interrupted) and that later blocking read calls
will wait longer than the effective socket read timeout says they
should: The timeout will only start 'ticking' once such a reader hits
the schedule_timeout in wait_for_more_packets (core.c) while the time it
already had to wait until it could acquire the mutex is unaccounted for.
The patch avoids both by using the __skb_try_recv_datagram and
__skb_wait_for_more packets functions created by the first patch to
implement a unix_dgram_recvmsg read loop which releases the readlock
mutex prior to going to sleep and reacquires it as needed
afterwards. Non-blocking readers will thus immediately return with
-EAGAIN if there's no data available regardless of any concurrent
blocking readers and all blocking readers will end up sleeping via
schedule_timeout, thus honouring the configured socket receive timeout.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The __skb_recv_datagram routine in core/ datagram.c provides a general
skb reception factility supposed to be utilized by protocol modules
providing datagram sockets. It encompasses both the actual recvmsg code
and a surrounding 'sleep until data is available' loop. This is
inconvenient if a protocol module has to use additional locking in order
to maintain some per-socket state the generic datagram socket code is
unaware of (as the af_unix code does). The patch below moves the recvmsg
proper code into a new __skb_try_recv_datagram routine which doesn't
sleep and renames wait_for_more_packets to
__skb_wait_for_more_packets, both routines being exported interfaces. The
original __skb_recv_datagram routine is reimplemented on top of these
two functions such that its user-visible behaviour remains unchanged.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device tree properties for a phy device are expected to be in the phy
node. The current code for the DP83867 also tries to look in the
parent node. The devices binding documentation does not mention this,
no current device tree file makes use of this, and it is not behaviour
we want. So remove looking in the parent device.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a writable sysfs attribute for the "NDP to end"
quirk flag.
This makes it easier for end users to test new devices for
this firmware bug. We've been lucky so far, but we should
not depend on reporters capable of rebuilding the driver.
Cc: Enrico Mioso <mrkiko.rs@gmail.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz says:
====================
Add HA and LAG support for mlx4 SRIOV VFs
This series is built upon the code added in commit ce388ff "Merge branch
'mlx4-next'" which added HA and LAG support to mlx4 RoCE and SRIOV services.
We add HA and Link Aggregation support to single ported mlx4 Ethernet VFs.
In this case, the PF Ethernet interfaces are bonded, the VFs see single
port HW devices (already supported) -- however, now this port is highly
available. This means that all VF HW QPs (both VF Ethernet driver and VF
RoCE / RAW QPs) are subject to the V2P (Virtual-To-Physical) mapping which
is managed by the PF driver, and hence resilient across link failures and
such events.
When bonding operates in Dynamic link aggregation (802.3ad) mode, traffic
from each VF will go over the VF "base port" (the port this VF is assigned
to by the admin) as long as this port is up. When the port fails, traffic
from all VFs that are defined on this port will move to the other port, and
be back to their base-port when it recovers.
Moni and Or.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When the mlx4 driver runs in HA mode, and all VFs are single ported
ones, we make their single port Highly-Available.
This is done by taking advantage of the HA mode properties (following
bonding changes with programming the port V2P map, etc) and adding
the missing parts which are unique to SRIOV such as mirroring VF
steering rules on both ports.
Due to limits on the MAC and VLAN table this mode is enabled only when
number of total VFs is under 64.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under HA mode, it's possible that the VF registered its GID
(and expects to get mads through the PV scheme) on a port which is
different from the one this mad arrived on, due to HA fail over.
Therefore, if the gid is not matched on the port that the packet arrived
on, check for a match on the other port if HA mode is active -- and if a
match is found on the other port, continue processing the mad using that
other port.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to HW limitations, indexes to MAC and VLAN tables are always taken
from the table of the actual port. So, if a resource holds an index to
a table, it may refer to different values during the lifetime of the
resource, unless the tables are mirrored. Also, even when
driver is not in HA mode the policy of allocating an index to these
tables is such to make sure, as much as possible, that when the time
comes the mirroring will be successful. This means that in multifunction
mode the allocation of a free index in a port's table tries to make sure
that the same index in the other's port table is also free.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under HA mode, steering rules set by VFs should be mirrored on both
ports of the device so packets will be accepted no matter on which
port they arrived.
Since getting into HA mode is done dynamically when the user bonds mlx4
Ethernet netdevs, we keep hold of the VF DMFS rule mbox with the port
value flipped (1->2,2->1) and execute the mirroring when getting into
HA mode. Later, when going out of HA mode, we unset the mirrored rules.
In that context note that mirrored rules cannot be removed explicitly.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under HA mode, the link down event should be sent to VFs only if both
ports are down.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In HA mode, the link state for VFs for which the policy is "auto"
(i.e. follow the physical link state) should be ORed from both ports.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unneeded variable used to store return value.
Generated by: scripts/coccinelle/misc/returnvar.cocci
CC: Asias He <asias@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>