linux/drivers/net
Jussi Maki 9e2ee5c7e7 net, bonding: Add XDP support to the bonding driver
XDP is implemented in the bonding driver by transparently delegating
the XDP program loading, removal and xmit operations to the bonding
slave devices. The overall goal of this work is that XDP programs
can be attached to a bond device *without* any further changes (or
awareness) necessary to the program itself, meaning the same XDP
program can be attached to a native device but also a bonding device.

Semantics of XDP_TX when attached to a bond are equivalent in such
setting to the case when a tc/BPF program would be attached to the
bond, meaning transmitting the packet out of the bond itself using one
of the bond's configured xmit methods to select a slave device (rather
than XDP_TX on the slave itself). Handling of XDP_TX to transmit
using the configured bonding mechanism is therefore implemented by
rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
performance impact this check is guarded by a static key, which is
incremented when a XDP program is loaded onto a bond device. This
approach was chosen to avoid changes to drivers implementing XDP. If
the slave device does not match the receive device, then XDP_REDIRECT
is transparently used to perform the redirection in order to have
the network driver release the packet from its RX ring. The bonding
driver hashing functions have been refactored to allow reuse with
xdp_buff's to avoid code duplication.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters. An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter (see patch
2/3). The statistics were collected using "sar -n dev -u 1 10". On top
of that, for ice, further work is in progress on improving the XDP_TX
numbers [4].

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

  [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
  [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
  [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
  [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#t

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
2021-08-09 23:20:14 +02:00
..
appletalk appletalk: use ndo_siocdevprivate 2021-07-27 20:11:43 +01:00
arcnet
bonding net, bonding: Add XDP support to the bonding driver 2021-08-09 23:20:14 +02:00
caif Networking fixes for 5.14-rc2, including fixes from bpf and netfilter. 2021-07-14 09:24:32 -07:00
can Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-07-31 09:14:46 -07:00
dsa net: dsa: mt7530: drop paranoid checks in .get_tag_protocol() 2021-08-02 15:06:55 +01:00
ethernet net/mlx4: make the array states static const, makes object smaller 2021-08-02 15:02:13 -07:00
fddi fddi: use ndo_siocdevprivate 2021-07-27 20:11:43 +01:00
fjes Tracing updates for 5.14: 2021-07-03 11:13:22 -07:00
hamradio hamradio: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
hippi hippi: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
hyperv Networking changes for 5.14. 2021-06-30 15:51:09 -07:00
ieee802154 ieee802154: hwsim: avoid possible crash in hwsim_del_edge_nl() 2021-06-22 21:26:59 +02:00
ipa net: ipa: don't suspend endpoints if setup not complete 2021-07-28 00:06:27 +01:00
ipvlan ipvlan: Add handling of NETDEV_UP events 2021-07-29 22:17:37 +01:00
mctp mctp: Add initial driver infrastructure 2021-07-29 15:06:50 +01:00
mdio net: mdiobus: withdraw fwnode_mdbiobus_register 2021-06-25 11:46:29 -07:00
mhi net: mhi: Improve MBIM packet counting 2021-07-26 12:21:00 +01:00
netdevsim netdevsim: make array res_ids static const, makes object smaller 2021-08-02 09:12:24 -07:00
pcs net: pcs: xpcs: Fix a less than zero u16 comparison error 2021-06-17 11:14:06 -07:00
phy net: phy: mscc: make some arrays static const, makes object smaller 2021-08-02 09:15:07 -07:00
plip slip/plip: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
ppp ppp: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
slip slip/plip: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
team
usb dev_ioctl: split out ndo_eth_ioctl 2021-07-27 20:11:45 +01:00
vmxnet3 vmxnet3: update to version 6 2021-07-16 17:32:14 -07:00
wan net: split out ndo_siowandev ioctl 2021-07-27 20:11:45 +01:00
wireguard
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-07-31 09:14:46 -07:00
wwan wwan: core: Fix missing RTM_NEWLINK event for default link 2021-07-23 17:16:19 +01:00
xen-netback
bareudp.c bareudp: allow redirecting bareudp packets to eth devices 2021-06-28 12:44:17 -07:00
dummy.c
eql.c eql: use ndo_siocdevprivate 2021-07-27 20:11:43 +01:00
geneve.c
gtp.c gtp: reset mac_header after decap 2021-06-28 12:44:17 -07:00
ifb.c
Kconfig mctp: Add initial driver infrastructure 2021-07-29 15:06:50 +01:00
LICENSE.SRC
loopback.c
macsec.c net: macsec: fix the length used to copy the key for offloading 2021-06-24 12:41:12 -07:00
macvlan.c dev_ioctl: split out ndo_eth_ioctl 2021-07-27 20:11:45 +01:00
macvtap.c
Makefile mctp: Add initial driver infrastructure 2021-07-29 15:06:50 +01:00
mdio.c
mii.c
net_failover.c
netconsole.c
nlmon.c
ntb_netdev.c
rionet.c
sb1000.c sb1000: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
Space.c
sungem_phy.c
tap.c
thunderbolt.c
tun.c
veth.c veth: use skb_prepare_for_gro() 2021-07-29 12:18:12 +01:00
virtio_net.c Networking fixes for 5.14-rc2, including fixes from bpf and netfilter. 2021-07-14 09:24:32 -07:00
vrf.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-29 15:45:27 -07:00
vsockmon.c
vxlan.c vxlan: add missing rcu_read_lock() in neigh_reduce() 2021-06-22 09:48:38 -07:00
xen-netfront.c