linux/net
Jakub Kicinski 7f9eee196e Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux
Pavel Begunkov says:

====================
io_uring zerocopy send

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking UDP with an optimised version of the selftest (see [1]), which
sends a bunch of requests, waits for completions and repeats. "+ flush" column
posts one additional "buffer-free" notification per request, and just "zc"
doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc    | zc             | zc + flush
4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc    | zc             | zc + flush
8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting. There
is also an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

For TCP on localhost (with hacks enabling localhost zerocopy) and including
additional overhead for receive:

IO size | non-zc    | zc
1200    | 4174      | 4148
4096    | 7597      | 11228

Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help, should look better for 4000,
but couldn't test properly because of setup problems.

Links:

  liburing (benchmark + tests):
  [1] https://github.com/isilence/liburing/tree/zc_v4

  kernel repo:
  [2] https://github.com/isilence/linux/tree/zc_v4

  RFC v1:
  [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

  RFC v2:
  https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/

  Net patches based:
  git@github.com:isilence/linux.git zc_v4-net-base
  or
  https://github.com/isilence/linux/tree/zc_v4-net-base

API design overview:

  The series introduces an io_uring concept of notifactors. From the userspace
  perspective it's an entity to which it can bind one or more requests and then
  requesting to flush it. Flushing a notifier makes it impossible to attach new
  requests to it, and instructs the notifier to post a completion once all
  requests attached to it are completed and the kernel doesn't need the buffers
  anymore.

  Notifications are stored in notification slots, which should be registered as
  an array in io_uring. Each slot stores only one notifier at any particular
  moment. Flushing removes it from the slot and the slot automatically replaces
  it with a new notifier. All operations with notifiers are done by specifying
  an index of a slot it's currently in.

  When registering a notification the userspace specifies a u64 tag for each
  slot, which will be copied in notification completion entries as
  cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
  sequence number counting notifiers of a slot.

====================

Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19 14:22:41 -07:00
..
6lowpan net: 6lowpan: constify lowpan_nhc structures 2022-06-09 21:53:28 +02:00
9p xen: switch gnttab_end_foreign_access() to take a struct page pointer 2022-05-27 11:05:29 +02:00
802
8021q Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-14 15:27:35 -07:00
appletalk net: remove noblock parameter from skb_recv_datagram() 2022-04-06 13:45:26 +01:00
atm net: SO_RCVMARK socket option for SO_MARK with recvmsg() 2022-04-28 13:08:15 -07:00
ax25 ax25: use GFP_KERNEL in ax25_dev_device_up() 2022-06-17 20:34:13 -07:00
batman-adv net: wrap the wireless pointers in struct net_device in an ifdef 2022-05-22 21:51:54 +01:00
bluetooth Bluetooth: core: Fix deadlock on hci_power_on_sync. 2022-07-05 13:20:03 -07:00
bpf bpf, test_run: Remove unnecessary prog type checks 2022-06-03 14:53:33 -07:00
bpfilter uaccess: remove CONFIG_SET_FS 2022-02-25 09:36:06 +01:00
bridge Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-06-30 16:31:00 -07:00
caif net: remove noblock parameter from skb_recv_datagram() 2022-04-06 13:45:26 +01:00
can Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-07 12:07:37 -07:00
ceph libceph: use swap() macro instead of taking tmp variable 2022-05-25 20:45:13 +02:00
core Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux 2022-07-19 14:22:41 -07:00
dcb net: dcb: disable softirqs in dcbnl_flush_dev() 2022-03-03 08:01:55 -08:00
dccp Revert "net: Add a second bind table hashed by port and address" 2022-06-16 11:07:59 -07:00
decnet net, neigh: introduce interval_probe_time_ms for periodic probe 2022-06-30 13:14:35 +02:00
dns_resolver
dsa net: dsa: tag_ksz: add tag handling for Microchip LAN937x 2022-07-02 16:34:05 +01:00
ethernet net: ethernet: set default assignment identifier to NET_NAME_ENUM 2022-04-07 21:04:03 -07:00
ethtool Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-06-23 12:33:24 -07:00
hsr net: add per-cpu storage and net->core_stats 2022-03-11 23:17:24 -08:00
ieee802154 net: SO_RCVMARK socket option for SO_MARK with recvmsg() 2022-04-28 13:08:15 -07:00
ife
ipv4 Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux 2022-07-19 14:22:41 -07:00
ipv6 Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux 2022-07-19 14:22:41 -07:00
iucv net: keep sk->sk_forward_alloc as small as possible 2022-06-10 16:21:27 -07:00
kcm net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
key Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2022-06-01 17:44:04 -07:00
l2tp l2tp: l2tp_debugfs: fix Clang -Wformat warnings 2022-07-08 12:14:36 +01:00
l3mdev l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu 2022-04-15 14:27:24 -07:00
lapb
llc net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-14 15:27:35 -07:00
mac802154 net: mac802154: Fix symbol durations 2022-04-30 20:29:47 +02:00
mctp Networking changes for 5.19. 2022-05-25 12:22:58 -07:00
mpls net: mpls: fix memdup.cocci warning 2022-04-07 21:06:41 -07:00
mptcp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-14 15:27:35 -07:00
ncsi net/ncsi: use proper "mellanox" DT vendor prefix 2022-06-23 20:51:06 -07:00
netfilter netfilter: nf_tables: replace BUG_ON by element length check 2022-07-09 16:25:09 +02:00
netlabel netlabel: fix out-of-bounds memory accesses 2022-03-21 10:59:11 +00:00
netlink Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-05-12 16:15:30 -07:00
netrom net: remove noblock parameter from skb_recv_datagram() 2022-04-06 13:45:26 +01:00
nfc net: nfc: Directly use ida_alloc()/free() 2022-05-28 15:28:47 +01:00
nsh
openvswitch Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-06-23 12:33:24 -07:00
packet net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
phonet net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
psample
qrtr net: remove noblock parameter from skb_recv_datagram() 2022-04-06 13:45:26 +01:00
rds Linux 5.18 2022-05-24 12:40:28 -03:00
rfkill rfkill: make new event layout opt-in 2022-03-18 13:09:17 +02:00
rose net: rose: fix UAF bug caused by rose_t0timer_expiry 2022-07-06 19:49:11 -07:00
rxrpc net: rxrpc: fix clang -Wformat warning 2022-07-08 20:15:11 -07:00
sched net/sched: sch_cbq: Delete unused delay_timer 2022-07-15 11:29:09 +01:00
sctp net: keep sk->sk_forward_alloc as small as possible 2022-06-10 16:21:27 -07:00
smc net/smc: Extend SMC-R link group netlink attribute 2022-07-18 11:19:17 +01:00
strparser strparser: pad sk_skb_cb to avoid straddling cachelines 2022-07-08 18:38:44 -07:00
sunrpc Notable regression fixes: 2022-07-02 11:20:56 -07:00
switchdev net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
tipc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-06-30 16:31:00 -07:00
tls tls: rx: decrypt into a fresh skb 2022-07-18 11:24:11 +01:00
unix Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2022-07-09 12:24:16 -07:00
vmw_vsock hyperv-next for 5.19 2022-05-28 11:39:01 -07:00
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-14 15:27:35 -07:00
x25 x25: remove redundant pointer dev 2022-05-10 11:59:22 +02:00
xdp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-07 12:07:37 -07:00
xfrm net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
compat.c skbuff: carry external ubuf_info in msghdr 2022-07-19 14:20:42 -07:00
devres.c
Kconfig page_pool: Add allocation stats 2022-03-03 09:55:28 +00:00
Kconfig.debug net: CONFIG_DEBUG_NET depends on CONFIG_NET 2022-06-02 10:15:05 -07:00
Makefile
socket.c Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux 2022-07-19 14:22:41 -07:00
sysctl_net.c