linux/net
John Fastabend 29173d07f7 bpf, sockmap: Convert schedule_work into delayed_work
Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.

The flow for Cilium's common use case with a stream parser is,

 tcp_read_sock()
  sk_psock_verdict_recv
    ret = bpf_prog_run_pin_on_cpu()
    sk_psock_verdict_apply(sock, skb, ret)
     // if system is under memory pressure or app is slow we may
     // need to queue skb. Do this queuing through ingress_skb and
     // then kick timer to wake up handler
     skb_queue_tail(ingress_skb, skb)
     schedule_work(work);

The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.

Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.

To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.

To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.

>From on list discussion. This commit

 bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")

was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
2023-05-23 16:09:56 +02:00
..
6lowpan 6lowpan: Remove redundant initialisation. 2023-03-29 08:22:52 +01:00
9p Including fixes from netfilter. 2023-05-05 19:12:01 -07:00
802
8021q vlan: Add MACsec offload operations for VLAN interface 2023-04-21 08:22:14 +01:00
appletalk
atm Networking changes for 6.4. 2023-04-26 16:07:23 -07:00
ax25
batman-adv net: vlan: introduce skb_vlan_eth_hdr() 2023-04-23 14:16:44 +01:00
bluetooth Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
bpf bpf: add test_run support for netfilter program type 2023-04-21 11:34:50 -07:00
bpfilter
bridge net: add vlan_get_protocol_and_depth() helper 2023-05-10 10:25:55 +01:00
caif net: caif: Fix use-after-free in cfusbl_device_notify() 2023-03-02 22:22:07 -08:00
can Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-04-06 12:01:20 -07:00
ceph Networking changes for 6.3. 2023-02-21 18:24:12 -08:00
core bpf, sockmap: Convert schedule_work into delayed_work 2023-05-23 16:09:56 +02:00
dcb net: dcb: add helper functions to retrieve PCP and DSCP rewrite maps 2023-01-20 09:33:22 +00:00
dccp netfilter: keep conntrack reference until IPsecv6 policy checks are done 2023-03-22 21:50:23 +01:00
devlink devlink: change per-devlink netdev notifier to static one 2023-05-11 18:06:50 -07:00
dns_resolver
dsa net: dsa: tag_ocelot: call only the relevant portion of __skb_vlan_pop() on TX 2023-04-23 14:16:45 +01:00
ethernet
ethtool ethtool: Fix uninitialized number of lanes 2023-05-03 09:13:20 +01:00
handshake net/handshake: Fix section mismatch in handshake_exit 2023-04-21 20:24:57 -07:00
hsr hsr: ratelimit only when errors are printed 2023-03-16 21:11:03 -07:00
ieee802154 net: ieee802154: remove an unnecessary null pointer check 2023-03-17 09:13:53 +01:00
ife
ipv4 bpf, sockmap: Pass skb ownership through read_skb 2023-05-23 16:09:47 +02:00
ipv6 erspan: get the proto with the md version for collect_md 2023-05-13 16:58:58 +01:00
iucv net/iucv: Fix size of interrupt data 2023-03-16 17:34:40 -07:00
kcm net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
key af_key: Fix heap information leak 2023-02-13 09:30:14 +00:00
l2tp l2tp: generate correct module alias strings 2023-03-31 09:25:12 +01:00
l3mdev
lapb
llc net: deal with most data-races in sk_wait_event() 2023-05-10 10:03:32 +01:00
mac80211 wireless-next patches for v6.4 2023-04-21 07:35:51 -07:00
mac802154 mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() 2023-04-05 13:48:04 +00:00
mctp mctp: remove MODULE_LICENSE in non-modules 2023-03-09 23:06:21 -08:00
mpls net: mpls: fix stale pointer if allocation fails during device rename 2023-02-15 10:26:37 +00:00
mptcp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-04-20 16:29:51 -07:00
ncsi net/ncsi: clear Tx enable mode when handling a Config required AEN 2023-04-28 09:35:33 +01:00
netfilter netfilter: conntrack: fix possible bug_on with enable_hooks=1 2023-05-10 08:50:39 +02:00
netlabel
netlink netlink: annotate accesses to nlk->cb_running 2023-05-10 09:28:38 +01:00
netrom netrom: Fix use-after-free caused by accept on already connected socket 2023-01-30 07:30:47 +00:00
nfc nfc: change order inside nfc_se_io error path 2023-03-07 13:37:05 -08:00
nsh
openvswitch net: openvswitch: fix race on port output 2023-04-07 19:42:53 -07:00
packet net: add vlan_get_protocol_and_depth() helper 2023-05-10 10:25:55 +01:00
phonet net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
psample
qrtr net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume() 2023-04-13 09:35:30 +02:00
rds rds: rds_rm_zerocopy_callback() correct order for list_add_tail() 2023-02-13 09:33:39 +00:00
rfkill net: rfkill-gpio: Add explicit include for of.h 2023-04-06 20:36:27 +02:00
rose net/rose: Fix to not accept on connected socket 2023-01-28 00:19:57 -08:00
rxrpc Including fixes from netfilter. 2023-05-05 19:12:01 -07:00
sched net/sched: flower: fix error handler on replace 2023-05-05 10:01:31 +01:00
sctp sctp: delete the nested flexible array hmac 2023-04-21 08:19:30 +01:00
smc net: deal with most data-races in sk_wait_event() 2023-05-10 10:03:32 +01:00
strparser
sunrpc NFSD 6.4 Release Notes 2023-04-29 11:04:14 -07:00
switchdev
tipc net: deal with most data-races in sk_wait_event() 2023-05-10 10:03:32 +01:00
tls net: deal with most data-races in sk_wait_event() 2023-05-10 10:03:32 +01:00
unix bpf, sockmap: Pass skb ownership through read_skb 2023-05-23 16:09:47 +02:00
vmw_vsock bpf, sockmap: Pass skb ownership through read_skb 2023-05-23 16:09:47 +02:00
wireless Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
x25 net/x25: Fix to not accept on connected socket 2023-01-25 09:51:04 +00:00
xdp bpf-next-for-netdev 2023-04-13 16:43:38 -07:00
xfrm ipsec-next-2023-04-19 2023-04-19 18:46:17 -07:00
compat.c net/compat: Update msg_control_is_user when setting a kernel pointer 2023-04-14 11:09:27 +01:00
devres.c
Kconfig net/handshake: Add Kunit tests for the handshake consumer API 2023-04-19 18:48:48 -07:00
Kconfig.debug
Makefile net/handshake: Create a NETLINK service for handling handshake requests 2023-04-19 18:48:48 -07:00
socket.c net: annotate sk->sk_err write from do_recvmmsg() 2023-05-10 09:58:29 +01:00
sysctl_net.c