linux/net/ipv6
Jakub Kicinski 7f9eee196e Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux
Pavel Begunkov says:

====================
io_uring zerocopy send

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking UDP with an optimised version of the selftest (see [1]), which
sends a bunch of requests, waits for completions and repeats. "+ flush" column
posts one additional "buffer-free" notification per request, and just "zc"
doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc    | zc             | zc + flush
4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc    | zc             | zc + flush
8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting. There
is also an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

For TCP on localhost (with hacks enabling localhost zerocopy) and including
additional overhead for receive:

IO size | non-zc    | zc
1200    | 4174      | 4148
4096    | 7597      | 11228

Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help, should look better for 4000,
but couldn't test properly because of setup problems.

Links:

  liburing (benchmark + tests):
  [1] https://github.com/isilence/liburing/tree/zc_v4

  kernel repo:
  [2] https://github.com/isilence/linux/tree/zc_v4

  RFC v1:
  [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

  RFC v2:
  https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/

  Net patches based:
  git@github.com:isilence/linux.git zc_v4-net-base
  or
  https://github.com/isilence/linux/tree/zc_v4-net-base

API design overview:

  The series introduces an io_uring concept of notifactors. From the userspace
  perspective it's an entity to which it can bind one or more requests and then
  requesting to flush it. Flushing a notifier makes it impossible to attach new
  requests to it, and instructs the notifier to post a completion once all
  requests attached to it are completed and the kernel doesn't need the buffers
  anymore.

  Notifications are stored in notification slots, which should be registered as
  an array in io_uring. Each slot stores only one notifier at any particular
  moment. Flushing removes it from the slot and the slot automatically replaces
  it with a new notifier. All operations with notifiers are done by specifying
  an index of a slot it's currently in.

  When registering a notification the userspace specifies a u64 tag for each
  slot, which will be copied in notification completion entries as
  cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
  sequence number counting notifiers of a slot.

====================

Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19 14:22:41 -07:00
..
ila net: ipv6: check return value of rhashtable_init 2021-09-28 12:59:24 +01:00
netfilter netfilter: conntrack: skip verification of zero UDP checksum 2022-05-13 18:56:28 +02:00
addrconf_core.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
addrconf.c net: ipv6: new accept_untracked_na option to accept na only if in-network 2022-07-15 18:55:50 -07:00
addrlabel.c ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init 2020-11-25 11:20:16 -08:00
af_inet6.c net: Introduce a new proto_ops ->read_skb() 2022-06-20 14:05:52 +02:00
ah6.c ipv6: ah6: use swap() to make code cleaner 2021-11-18 12:00:15 +00:00
anycast.c ipv6: fix memory leaks on IPV6_ADDRFORM path 2020-07-30 16:30:55 -07:00
calipso.c cipso,calipso: resolve a number of problems with the DOI refcounts 2021-03-04 15:26:57 -08:00
datagram.c net: annotate races around sk->sk_bound_dev_if 2022-05-16 10:31:05 +01:00
esp6_offload.c net: Fix esp GSO on inter address family tunnels. 2022-03-07 13:14:04 +01:00
esp6.c xfrm: free not used XFRM_ESP_NO_TRAILER flag 2022-05-06 08:24:20 +02:00
exthdrs_core.c
exthdrs_offload.c
exthdrs.c net: ipv6: add skb drop reasons to TLV parse 2022-04-13 13:09:57 +01:00
fib6_notifier.c
fib6_rules.c ipv6: change fib6_rules_net_exit() to batch mode 2022-02-08 20:41:34 -08:00
fou6.c net: Add MODULE_DESCRIPTION entries to network modules 2020-06-20 21:33:57 -07:00
icmp.c icmp: Fix data-races around sysctl_icmp_echo_enable_probe. 2022-07-13 12:56:49 +01:00
inet6_connection_sock.c lsm,selinux: pass flowi_common instead of flowi to the LSM hooks 2020-11-23 18:36:21 -05:00
inet6_hashtables.c ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH() 2022-05-16 10:31:06 +01:00
ioam6_iptunnel.c ipv6: ioam: Insertion frequency in lwtunnel output 2022-02-04 20:24:45 -08:00
ioam6.c net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option 2022-03-03 14:38:48 +00:00
ip6_checksum.c
ip6_fib.c ipv6: annotate accesses to fn->fn_sernum 2022-01-20 20:18:37 -08:00
ip6_flowlabel.c ipv6: per-netns exclusive flowlabel checks 2022-02-16 20:37:47 -08:00
ip6_gre.c ip6_gre: use actual protocol to select xmit 2022-07-13 12:10:22 +01:00
ip6_icmp.c net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending 2021-02-23 11:29:52 -08:00
ip6_input.c ipv6: fix NULL deref in ip6_rcv_core() 2022-04-15 14:05:18 -07:00
ip6_offload.c ipv6/gro: insert temporary HBH/jumbo header 2022-05-16 10:18:56 +01:00
ip6_offload.h
ip6_output.c ipv6/udp: support externally provided ubufs 2022-07-19 14:20:54 -07:00
ip6_tunnel.c ip6_tunnel: allow to inherit from VLAN encapsulated IP 2022-07-13 12:10:22 +01:00
ip6_udp_tunnel.c net: Make locking in sock_bindtoindex optional 2020-06-01 14:57:14 -07:00
ip6_vti.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
ip6mr.c net: ip6mr: add RTM_GETROUTE netlink op 2022-07-13 13:53:48 +01:00
ipcomp6.c xfrm: remove hdr_offset indirection 2021-06-11 14:48:50 +02:00
ipv6_sockglue.c ipv6: annotate some data-races around sk->sk_prot 2022-02-18 11:53:28 +00:00
Kconfig ipv6: ioam: Add support for the ip6ip6 encapsulation 2021-10-04 12:53:35 +01:00
Makefile net: ipv6: use ipv6-y directly instead of ipv6-objs 2021-09-28 13:13:40 +01:00
mcast_snoop.c net: bridge: mcast: fix broken length + header check for MRDv6 Adv. 2021-04-27 14:02:06 -07:00
mcast.c mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter() 2022-04-30 15:19:08 +01:00
mip6.c xfrm: ipv6: move mip6_rthdr_offset into xfrm core 2021-06-11 14:48:50 +02:00
ndisc.c net: ipv6: new accept_untracked_na option to accept na only if in-network 2022-07-15 18:55:50 -07:00
netfilter.c netfilter: Use l3mdev flow key when re-routing mangled packets 2022-05-16 13:03:29 +02:00
output_core.c ipv6: use prandom_u32() for ID generation 2021-05-31 22:12:08 -07:00
ping.c net: ping6: Fix ping -6 with interface name 2022-06-01 12:44:42 +02:00
proc.c net: udp: introduce UDP_MIB_MEMERRORS for udp_mem 2020-11-09 15:34:44 -08:00
protocol.c
raw.c raw: remove unused variables from raw6_icmp_error() 2022-06-22 18:48:08 -07:00
reassembly.c net: ipv6: Handle delivery_time in ipv6 defrag 2022-03-03 14:38:48 +00:00
route.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-14 15:27:35 -07:00
rpl_iptunnel.c net: ipv6: rpl_iptunnel: simplify the return expression of rpl_do_srh() 2020-12-08 16:22:54 -08:00
rpl.c net: ipv6: rpl*: Fix strange kerneldoc warnings due to bad header 2020-10-30 12:12:52 -07:00
seg6_hmac.c net: ipv6: unexport __init-annotated seg6_hmac_net_init() 2022-06-28 21:23:30 -07:00
seg6_iptunnel.c seg6: fix skb checksum evaluation in SRH encapsulation/insertion 2022-07-14 10:15:15 +02:00
seg6_local.c seg6: fix skb checksum in SRv6 End.B6 and End.B6.Encaps behaviors 2022-07-14 10:15:15 +02:00
seg6.c icmp: ICMPV6: Examine invoking packet for Segment Route Headers. 2022-01-04 12:17:35 +00:00
sit.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-06-30 16:31:00 -07:00
syncookies.c tcp: make sure treq->af_specific is initialized 2022-04-25 12:10:11 +01:00
sysctl_net_ipv6.c net: sysctl: introduce sysctl SYSCTL_THREE 2022-05-03 10:15:06 +02:00
tcp_ipv6.c net: Find dst with sk's xfrm policy not ctl_sk 2022-07-11 13:39:56 +01:00
tcpv6_offload.c net: move gro definitions to include/net/gro.h 2021-11-16 13:16:54 +00:00
tunnel6.c tunnel6: add tunnel6_input_afinfo for ipip and ipv6 tunnels 2020-07-09 12:52:37 +02:00
udp_impl.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
udp_offload.c gro: remove rcu_read_lock/rcu_read_unlock from gro_receive handlers 2021-11-24 17:21:42 -08:00
udp.c net: add per_cpu_fw_alloc field to struct proto 2022-06-10 16:21:26 -07:00
udplite.c net: add per_cpu_fw_alloc field to struct proto 2022-06-10 16:21:26 -07:00
xfrm6_input.c xfrm: state: remove extract_input indirection from xfrm_state_afinfo 2020-05-06 09:40:08 +02:00
xfrm6_output.c xfrm: fix tunnel model fragmentation behavior 2022-03-01 12:08:40 +01:00
xfrm6_policy.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
xfrm6_protocol.c
xfrm6_state.c xfrm: remove output_finish indirection from xfrm_state_afinfo 2020-05-06 09:40:08 +02:00
xfrm6_tunnel.c xfrm: remove description from xfrm_type struct 2021-06-09 09:38:52 +02:00