Turns out tin_quantum_prio isn't used anymore and is a leftover from a
previous implementation of diffserv tins. Since the variable isn't used
in any calculations it can be eliminated.
Drop variable and places where it was set. Rename remaining variable
and consolidate naming of intermediate variables that set it.
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann says:
====================
s390/qeth: features 2019-12-18
please apply the following patch series to your net-next tree.
Nothing major, just the usual mix of small improvements and cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
qeth_qdio_start_poll() is called from the qdio layer's IRQ handler,
while IRQs are masked.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the old code to use struct qeth_ipa_caps, and while at it remove
all unused helper macros.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As commit df2a2a5225 ("s390/qeth: convert IP table spinlock to mutex")
converted the ip_lock to a mutex, we no longer have to yield it while
the subsequent IO sleep-waits for completion.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a leftover from back when a recovery action didn't go through
dev_close(), and was meant to shoot down all remaining af_iucv sockets
on the interface.
Now that the offline path always calls dev_close(), the
NETDEV_GOING_DOWN event from __dev_close_many() is sufficient and this
hack can be removed.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use inet_make_mask() to replace some complicated bit-fiddling.
Also use the right data types to replace some raw memcpy calls with
proper assignments.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate some duplicated code for adding RXIP/VIPA addresses, and
move the locking to where it's actually needed.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code that dumps the RXIP/VIPA/IPATO addresses via sysfs
first checks whether the buffer still provides sufficient space to hold
another formatted address.
But the maximum length of an formatted IPv4 address is 15 characters,
not 12. So we underestimate the max required length and if the buffer
was previously filled to _just_ the right level, a formatted address can
end up being truncated.
Revamp these code paths to use the _actually_ required length of the
formatted IP address, and while at it suppress a gratuitous newline.
Also use scnprintf() to format the output. In case of a truncation, this
would allow us to return the number of characters that were actually
written.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
card->wait_q is shared by different users, for different wake-up
conditions. qeth_irq() can potentially trigger multiple of these
conditions:
1) A change to channel->irq_pending, which qeth_send_control_data() is
waiting for.
2) A change to card->state, which qeth_clear_channel() and
qeth_halt_channel() are waiting for.
As qeth_irq() does only a single wake_up(), we might miss to wake up
a second eligible waiter. Luckily all waiters are guarded with a
timeout, so this situation should recover on its own eventually.
To make things work robustly, add an additional wake_up() for changes
to channel->state. And extract a helper that updates
channel->irq_pending along with the needed wake_up().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A qeth device that's offline should not be receiving any IRQs - all
pending IOs have been terminated, and we avoid starting any new ones.
So rather than immediately registering the IRQ handler when the device
is probed, only register it while the device is online.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu says:
====================
net: stmmac: TSN support using TAPRIO API
This series adds TSN support (EST and Frame Preemption) for stmmac driver.
1) Adds the HW specific support for EST in GMAC5+ cores.
2) Adds the HW specific support for EST in XGMAC3+ cores.
3) Integrates EST HW specific support with TAPRIO scheduler API.
4) Adds the Frame Preemption suppor on stmmac TAPRIO implementation.
5) Adds the HW specific support for Frame Preemption in GMAC5+ cores.
6) Adds the HW specific support for Frame Preemption in XGMAC3+ cores.
7) Adds support for HW debug counters for Frame Preemption available in
GMAC5+ cores.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This can be useful for debug. Add these counters on GMAC5+ cores just
like we did for XGMAC.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds the HW specific support for Frame Preemption on XGMAC3+ cores.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds the HW specific support for Frame Preemption on GMAC5+ cores.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds the support for Frame Preemption using TAPRIO API. This works along
with EST feature and allows to select if preemptable traffic shall be
sent during specific queues opening time.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have the EST code for XGMAC and QoS we can use it with the
TAPRIO scheduler. Integrate it into the main driver and use the API to
configure the EST feature.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds the support for EST in XGMAC cores. This feature allows to offload
scheduling of queues opening time to the IP.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds the support for EST in GMAC5+ cores. This feature allows to offload
scheduling of queues opening time to the IP.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu says:
====================
net: stmmac: Improvements for -next
Misc improvements for stmmac.
1) Adds more information regarding HW Caps in the DebugFS file.
2) Allows interrupts to be independently enabled or disabled so that we don't
have to schedule both TX and RX NAPIs.
3) Stops using a magic number in coalesce timer re-arm.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have pending packets we re-arm the TX timer with a magic value.
This changes the re-arm of the timer from 10us to the user-defined
coalesce value. As we support different speeds, having a magic value of
10us can be either too short or to large depending on the speed so we
let user configure it. The default value of the timer is 1ms but it can
be reconfigured by ethtool.
Changes from v1:
- Reword commit message (Jakub)
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By using this mechanism we can get rid of the not so nice method of
scheduling TX NAPI when the RX was scheduled. No bandwidth reduction was
seen with this change.
Changes from v1:
- Remove useless comment (Jakub)
- Do not bind the TX clean to NAPI budget (Jakub)
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DMA Capabilites have grown but the DebugFS that shows this info has not
been updated. Lets add the missing information.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removing the 'hotplug-status' node in netback_remove() is wrong; the script
may not have completed. Only remove the node once the watch has fired and
has been unregistered.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
...as the comment above the function states.
The switch to Initialising at the start of the function is somewhat bogus
as the toolstack will have set that initial state anyway. To behave
correctly, a backend should switch to InitWait once it has set up all
xenstore values that may be required by a initialising frontend. This
patch calls backend_switch_state() to make the transition at the
appropriate point.
NOTE: backend_switch_state() ignores errors from xenbus_switch_state()
and so this patch removes an error path from netback_probe(). This
means a failure to change state at this stage (in the absence of
other failures) will leave the device instantiated. This is highly
unlikley to happen as a failure to change state would indicate a
failure to write to xenstore, and that will trigger other error
paths. Also, a 'stuck' device can still be cleaned up using 'unbind'
in any case.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
...of xenbus.c
This is a cosmetic function re-ordering to reduce churn in a subsequent
patch. Some style fix-up was done to make checkpatch.pl happier.
No functional change.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shahjada Abul Husain says:
====================
cxgb4/chtls: fix issues related to high priority region
The high priority region introduced by:
commit c219399988 ("cxgb4: add support for high priority filters")
had caused regression in some code paths, leading to connection
failures for the ULDs.
This series of patches attempt to fix the regressions.
Patch 1 fixes some code paths that have been missed to consider
the high priority region.
Patch 2 fixes ULD connection failures due to wrong TID base that
had been shifted after the high priority region.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the hardware TID index is assumed to start from index 0.
However, with the following changeset,
commit c219399988 ("cxgb4: add support for high priority filters")
hardware TID index can start after the high priority region, which
has introduced a regression resulting in connection failures for
ULDs.
So, fix all related code to properly recalculate the TID start index
based on whether high priority filters are enabled or not.
Fixes: c219399988 ("cxgb4: add support for high priority filters")
Signed-off-by: Shahjada Abul Husain <shahjada@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit c219399988 ("cxgb4: add support for high priority filters")
has missed considering high priority region calculation in some code
paths. This patch fixes them.
Fixes: c219399988 ("cxgb4: add support for high priority filters")
Signed-off-by: Shahjada Abul Husain <shahjada@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 77373d49de ("net: dsa: Move the phylink driver calls into
port.c") moved and exported a bunch of symbols, but they are not used
outside of net/dsa/port.c at the moment, so no reason to export them.
Reported-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the commit referred to below we eliminated sending of the 'gap'
indicator in regular ACK messages, reserving this to explicit NACK
ditto.
Unfortunately we missed to also eliminate building of the 'gap block'
area in ACK messages. This area is meant to report gaps in the
received packet sequence following the initial gap, so that lost
packets can be retransmitted earlier and received out-of-sequence
packets can be released earlier. However, the interpretation of those
blocks is dependent on a complete and correct sequence of gaps and
acks. Hence, when the initial gap indicator is missing a single gap
block will be interpreted as an acknowledgment of all preceding
packets. This may lead to packets being released prematurely from the
sender's transmit queue, with easily predicatble consequences.
We now fix this by not building any gap block area if there is no
initial gap to report.
Fixes: commit 02288248b0 ("tipc: eliminate gap indicator from ACK messages")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ajay Gupta says:
====================
net: stmmac: dwc-qos: ACPI device support
Version 3 of patches have fixes for comments from Jakub Kicinski.
These two changes are needed to enable ACPI based devices to use stmmac
driver. First patch is to use generic device api (device_*) instead of
device tree based api (of_*). Second patch avoids clock and reset accesses
for Tegra ACPI based devices. ACPI interface will be used to access clock
and reset for Tegra ACPI devices in later patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no clocks, resets or gpios referenced by Tegra ACPI
device so don't access clocks, resets or gpios interface with
ACPI device.
Clocks, resets and GPIOs for ACPI devices will be handled via
ACPI interface.
Signed-off-by: Ajay Gupta <ajayg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use generic device api so that driver can work both with DT
or ACPI based devices.
Signed-off-by: Ajay Gupta <ajayg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Biao Huang says:
====================
net-next: stmmac: dwmac-mediatek: add more support for RMII
changes in v2:
PATCH 1/2 net-next: stmmac: mediatek: add more support for RMII
As Andrew's comments, add the "rmii_internal" clock to the list of clocks.
PATCH 2/2 net-next: dt-binding: dwmac-mediatek: add more description for RMII
document the "rmii_internal" clock in dt-bindings
rewrite the sample dts in dt-bindings.
v1:
This series is for support RMII when MT2712 SoC provides the reference clock.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
MT2712 SoC can provide RMII reference clock,
so add corresponding description in dt-binding.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MT2712 SoC can provide the rmii reference clock, and the clock
will output from TXC pin only, which means ref_clk pin of external
PHY should connect to TXC pin in this case.
Add corresponding clock and timing settings.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King says:
====================
improve clause 45 support in phylink
These three patches improve the clause 45 support in phylink, fixing
some corner cases that have been noticed with the addition of SFP+
NBASE-T modules, but are actually a little more wisespread than I
initially realised.
The first issue was spotted with a NBASE-T PHY on a SFP+ module plugged
into a mvneta platform. When these PHYs are not operating in USXGMII
mode, but are in a single-lane Serdes mode, they will switch between
one of several different PHY interface modes.
If we call the MAC validate() function with the current PHY interface
mode, we will restrict the supported and advertising masks to the link
modes that the current PHY interface mode supports. For example, if we
determine that we want to start the PHY with an interface mode of
2500BASE-X, then this setup will restrict the advertisement and
supported masks to 2.5G speed link modes.
What we actually want for these PHYs is to allow them to support any
link modes that the PHY supports _and_ the MAC is also capable of
supporting. Without knowing the details of the PHY interface modes that
may be used, we can do this by using PHY_INTERFACE_MODE_NA to validate
and restrict the link modes to any that the MAC supports.
mvpp2 with the 88X3310 PHY avoids this problem, because the validate()
implementation allows all MAC supported speeds not only for
PHY_INTERFACE_MODE_NA, but also for XAUI and 10GKR modes.
The first patch addresses this; current MAC drivers should continue to
work as-is, but there will be a follow-on patch to fixup at least
mvpp2.
The second issue addresses a very similar problem that occurs when
trying to use ethtool to alter the advertisement mask - we call
the MAC validate() function with the current interface mode, the
current support and requested advertisement masks. This immediately
restricts the advertisement in the same way as the above.
This patch series addresses both issues, although the patches are not
in the above order.
v2: fix patch 3 missing 1G link modes for SGMII and RGMII interface
modes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix up the mvpp2 validate implementation to adopt the same behaviour as
mvneta:
- only allow the link modes that the specified PHY interface mode
supports with the exception of 1000base-X and 2500base-X.
- use the basex helper to deal with SFP modules that can be switched
between 1000base-X vs 2500base-X.
This gives consistent behaviour between mvneta and mvpp2.
This commit depends on "net: phylink: extend clause 45 PHY validation
workaround" so is not marked for backporting to stable kernels.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e45d1f5288 ("net: phylink: support Clause 45 PHYs on SFP+
modules") added a workaround to support clause 45 PHYs which
dynamically switch their interface mode on SFP+ modules. This was
implemented by validating the PHYs supported/advertising using
PHY_INTERFACE_MODE_NA, rather than the specific interface mode that
we attached the PHY with.
However, we already have a situation where phylink is used to connect
a Marvell 88X3310 PHY which also behaves in exactly the same way, but
which seemingly doesn't need this. The reason seems to be that the
mvpp2 driver sets a whole bunch of link modes for
PHY_INTERFACE_MODE_10GKR down to 10Mb/s, despite 10GBASE-R not actually
supporting anything but 10Gb/s speeds.
When testing with drivers that (correctly) take the mvneta approach,
where the validate() method only returns what can be supported /
advertised for the specified link mode, we find that Clause 45 PHYs do
not behave as we expect: their advertisement is restricted to what
the current link will support, rather than what the PHY supports
through its dynamic switching.
Extend this workaround to all such cases; if we have a Clause 45 PHY
attaching via any means, except in USXGMII, XAUI and RXAUI which are
all unable to support this dynamic switching or have other solutions
to it, then we need to validate using PHY_INTERFACE_MODE_NA.
This should allow mvpp2 to switch to a more conformant validate()
implementation.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing ethtool with the Methode DM7052 module, it was noticed
that attempting to set the advertising mask results in the mask being
truncated to the support offered by the currently chosen PHY interface
mode.
When a PHY dynamically changes the PHY interface mode, limiting the
advertising mask in this way is not correct - if the PHY happened to
negotiate 10GBASE-T, and selected 10GBASE-R as the host interface, we
don't want to restrict the advertisement to just 10GBASE-* modes.
Rework setting the advertisement to take account of this; do not pass
the requested advertisement through phylink_validate(), but rely on
the advertisement restriction (supported mask) set when the PHY was
initially setup.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason A. Donenfeld says:
====================
WireGuard CI and housekeeping
This is a collection of commits gathered during the last 1.5 weeks since
merging WireGuard. If you'd prefer, I can send tree pull requests
instead, but I figure it might be best for now to just send things as
full patch sets to netdev.
The first part of this adds in the CI test harness that we've been using
for quite some time with success. You can type `make` and get the
selftests running in a fresh VM immediately. This has been an
instrumental tool in developing WireGuard, and I think it'd benefit most
from being in-tree alongside the selftests that are already there. Once
this lands, I plan to get build.wireguard.com building wireguard-
linux.git and net-next.git on every single commit pushed, and do so on a
bunch of different architectures. As this migrates into Linus' tree
eventually and then into net.git, I'll get net.git building there too on
every commit. Future work with this involves generalizing it to include
more networking subsystem tests beyond just WireGuard, but one step at a
time. In the process of porting this to the tree, the builder uncovered
a mistake in the config menu file, which the second commit fixes.
The last three commits are small housekeeping things, fixing spelling
mistakes, replacing call_rcu with kfree_rcu, and removing an unused
include.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove <linux/version.h> from the includes for main.c, which is unused.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
[Jason: reworded commit message]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes two spelling errors in source code comments.
Signed-off-by: Josh Soref <jsoref@gmail.com>
[Jason: rewrote commit message]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the crypto selection submenu depenencies. Otherwise, we'd
wind up issuing warnings in which certain dependencies we also select
couldn't be satisfied. This condition was triggered by the addition of
the test suite autobuilder in the previous commit.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WireGuard has been using this on build.wireguard.com for the last
several years with considerable success. It allows for very quick and
iterative development cycles, and supports several platforms.
To run the test suite on your current platform in QEMU:
$ make -C tools/testing/selftests/wireguard/qemu -j$(nproc)
To run it with KASAN and such turned on:
$ DEBUG_KERNEL=yes make -C tools/testing/selftests/wireguard/qemu -j$(nproc)
To run it emulated for another platform in QEMU:
$ ARCH=arm make -C tools/testing/selftests/wireguard/qemu -j$(nproc)
At the moment, we support aarch64_be, aarch64, arm, armeb, i686, m68k,
mips64, mips64el, mips, mipsel, powerpc64le, powerpc, and x86_64.
The system supports incremental rebuilding, so it should be very fast to
change a single file and then test it out and have immediate feedback.
This requires for the right toolchain and qemu to be installed prior.
I've had success with those from musl.cc.
This is tailored for WireGuard at the moment, though later projects
might generalize it for other network testing.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In caif_xmit, there is a crash if the ptr dev is NULL. However, by
returning the error to the callers, the error can be handled. The
patch fixes this issue.
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
In fore200e_send and fore200e_close, the pointers from the arguments
are dereferenced in the variable declaration block and then checked
for NULL. The patch fixes these issues by avoiding NULL pointer
dereferences.
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel says:
====================
Simplify IPv4 route offload API
Motivation
==========
The aim of this patch set is to simplify the IPv4 route offload API by
making the stack a bit smarter about the notifications it is generating.
This allows driver authors to focus on programming the underlying device
instead of having to duplicate the IPv4 route insertion logic in their
driver, which is error-prone.
This is the first patch set out of a series of four. Subsequent patch
sets will simplify the IPv6 API, add offload/trap indication to routes
and add tests for all the code paths (including error paths). Available
here [1].
Details
=======
Today, whenever an IPv4 route is added or deleted a notification is sent
in the FIB notification chain and it is up to offload drivers to decide
if the route should be programmed to the hardware or not. This is not an
easy task as in hardware routes are keyed by {prefix, prefix length,
table id}, whereas the kernel can store multiple such routes that only
differ in metric / TOS / nexthop info.
This series makes sure that only routes that are actually used in the
data path are notified to offload drivers. This greatly simplifies the
work these drivers need to do, as they are now only concerned with
programming the hardware and do not need to replicate the IPv4 route
insertion logic and store multiple identical routes.
The route that is notified is the first FIB alias in the FIB node with
the given {prefix, prefix length, table ID}. In case the route is
deleted and there is another route with the same key, a replace
notification is emitted. Otherwise, a delete notification is emitted.
The above means that in the case of multiple routes with the same key,
but different TOS, only the route with the highest TOS is notified.
While the kernel can route a packet based on its TOS, this is not
supported by any hardware devices I am familiar with. Moreover, this is
not supported by IPv6 nor by BIRD/FRR from what I could see. Offload
drivers should therefore use the presence of a non-zero TOS as an
indication to trap packets matching the route and let the kernel route
them instead. mlxsw has been doing it for the past two years.
Testing
=======
To ensure there is no degradation in route insertion rates, I averaged
the insertion rate of 512k routes (/24 and /32) over 50 runs. Did not
observe any degradation.
Functional tests are available here [1]. They rely on route trap
indication, which is only added in the last patch set.
In addition, I have been running syzkaller for the past week with all
four patch sets and debug options enabled. Did not observe any problems.
Patch set overview
==================
Patches #1-#8 gradually introduce the new FIB notifications
Patch #9 converts mlxsw to use the new notifications
Patch #10 converts the remaining listeners and removes the old
notifications
v2:
* Extend fib_find_alias() with another argument instead of introducing a
new function (David Ahern)
RFC: https://patchwork.ozlabs.org/cover/1170530/
[1] https://github.com/idosch/linux/tree/fib-notifier
====================
Signed-off-by: David S. Miller <davem@davemloft.net>