linux/net
Yunsheng Lin c4fef01ba4 net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.

As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2ce ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.

In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.

As the enqueuing is not within the protection of qdisc->seqlock,
there is still a potential data race as mentioned by Jakub [1]:

      thread1               thread2             thread3
qdisc_run_begin() # true
                        qdisc_run_begin(q)
                             set(MISSED)
pfifo_fast_dequeue
  clear(MISSED)
  # recheck the queue
qdisc_run_end()
                            enqueue skb1
                                             qdisc empty # true
                                          qdisc_run_begin() # true
                                          sch_direct_xmit() # skb2
                         qdisc_run_begin()
                            set(MISSED)

When above happens, skb1 enqueued by thread2 is transmited after
skb2 is transmited by thread3 because MISSED state setting and
enqueuing is not under the qdisc->seqlock. If qdisc bypass is
disabled, skb1 has better chance to be transmited quicker than
skb2.

This patch does not take care of the above data race, because we
view this as similar as below:
Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, because there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.

There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
   in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
   dequeue all skb in one round and call __netif_schedule(), which
   might result in a non-empty qdisc without MISSED set. In order
   to avoid this, the MISSED state is set for lockless qdisc and
   __netif_schedule() will be called at the end of qdisc_run_end.

2. The MISSED state also need to be set for lockless qdisc instead
   of calling __netif_schedule() directly when requeuing a skb for
   a similar reason.

3. For netdev queue stopped case, the MISSED case need clearing
   while the netdev queue is stopped, otherwise there may be
   unnecessary __netif_schedule() calling. So a new DRAINING state
   is added to indicate this case, which also indicate a non-empty
   qdisc.

4. As there is already netif_xmit_frozen_or_stopped() checking in
   dequeue_skb() and sch_direct_xmit(), which are both within the
   protection of qdisc->seqlock, but the same checking in
   __dev_xmit_skb() is without the protection, which might cause
   empty indication of a lockless qdisc to be not reliable. So
   remove the checking in __dev_xmit_skb(), and the checking in
   the protection of qdisc->seqlock seems enough to avoid the cpu
   consumption problem for netdev queue stopped case.

1. https://lkml.org/lkml/2021/5/29/215

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:17:35 -07:00
..
6lowpan 6lowpan: Fix some typos in nhc_udp.c 2021-03-24 17:52:11 -07:00
9p 9p/trans_virtio: Fix spelling mistakes 2021-06-02 14:01:55 -07:00
802
8021q net: vlan: pass thru all GSO_SOFTWARE in hw_enc_features 2021-06-18 11:58:03 -07:00
appletalk Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
atm atm: Use list_for_each_entry() to simplify code in resources.c 2021-06-10 14:08:09 -07:00
ax25 net/ax25: Delete obsolete TODO file 2021-03-30 16:54:50 -07:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
bpf bpf: Prepare bpf syscall to be used from kernel and user space. 2021-05-19 00:33:40 +02:00
bpfilter net: remove redundant 'depends on NET' 2021-01-27 17:04:12 -08:00
bridge bridge: cfm: remove redundant return 2021-06-22 10:35:15 -07:00
caif net: caif: modify the label out_err to out 2021-06-18 12:07:09 -07:00
can Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
ceph libceph: Fix spelling mistakes 2021-06-03 13:24:23 -07:00
core net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc 2021-06-23 12:17:35 -07:00
dcb net: dcb: Return the correct errno code 2021-06-01 17:01:33 -07:00
dccp dccp: tfrc: fix doc warnings in tfrc_equation.c 2021-06-10 14:08:49 -07:00
decnet decnet: Fix spelling mistakes 2021-06-02 14:01:55 -07:00
dns_resolver net: remove redundant 'depends on NET' 2021-01-27 17:04:12 -08:00
dsa net: dsa: remove cross-chip support from the MRP notifiers 2021-06-21 12:50:20 -07:00
ethernet of: net: pass the dst buffer to of_get_mac_address() 2021-04-13 14:35:02 -07:00
ethtool ethtool: Validate module EEPROM offset as part of policy 2021-06-22 10:40:54 -07:00
hsr net: hsr: don't check sequence number if tag removal is offloaded 2021-06-16 12:13:01 -07:00
ieee802154 ieee802154: fix error return code in ieee802154_llsec_getparams() 2021-06-03 10:59:49 +02:00
ife net: remove redundant 'depends on NET' 2021-01-27 17:04:12 -08:00
ipv4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
ipv6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
iucv net/af_iucv: clean up some forward declarations 2021-06-12 13:06:33 -07:00
kcm revert "net: kcm: fix memory leak in kcm_sendmsg" 2021-06-07 13:34:37 -07:00
key
l2tp l2tp: Fix spelling mistakes 2021-06-07 14:08:30 -07:00
l3mdev l3mdev: Correct function names in the kerneldoc comments 2021-03-28 17:56:55 -07:00
lapb net: lapb: Use list_for_each_entry() to simplify code in lapb_iface.c 2021-06-08 16:31:25 -07:00
llc llc2: Remove redundant assignment to rc 2021-04-27 14:16:14 -07:00
mac80211 mac80211: handle various extensible elements correctly 2021-06-18 13:25:49 +02:00
mac802154 net: mac802154: Fix general protection fault 2021-04-06 22:42:16 +02:00
mpls mpls: Remove redundant assignment to err 2021-04-27 14:17:00 -07:00
mptcp mptcp: refine mptcp_cleanup_rbuf 2021-06-22 14:36:01 -07:00
ncsi net/ncsi: Fix spelling mistakes 2021-06-07 14:08:30 -07:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
netlabel netlabel: Fix memory leak in netlbl_mgmt_add_common 2021-06-15 11:19:04 -07:00
netlink netlink: disable IRQs for netlink_lock_table() 2021-05-17 15:31:03 -07:00
netrom net: netrom: nr_in: Remove redundant assignment to ns 2021-04-28 13:59:08 -07:00
nfc Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2021-06-07 13:01:52 -07:00
nsh
openvswitch openvswitch: add trace points 2021-06-22 10:47:32 -07:00
packet Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
phonet
psample psample: Add additional metadata attributes 2021-03-14 15:00:43 -07:00
qrtr Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-18 19:47:02 -07:00
rfkill Another set of updates, all over the map: 2021-04-20 16:44:04 -07:00
rose net: rose: Fix fall-through warnings for Clang 2021-03-10 12:45:15 -08:00
rxrpc rxrpc: Fix a typo 2021-06-02 14:01:55 -07:00
sched net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc 2021-06-23 12:17:35 -07:00
sctp sctp: process sctp over udp icmp err on sctp side 2021-06-22 11:28:52 -07:00
smc net/smc: Fix ENODATA tests in smc_nl_get_fback_stats() 2021-06-21 12:16:58 -07:00
strparser
sunrpc xprtrdma: Revert 586a0787ce 2021-05-27 08:46:19 -04:00
switchdev net: bridge: propagate extack through switchdev_port_attr_set 2021-02-14 17:38:11 -08:00
tipc tipc:subscr.c: fix a spelling mistake 2021-06-10 13:48:43 -07:00
tls skbuff: add a parameter to __skb_frag_unref 2021-06-07 14:11:47 -07:00
unix __unix_find_socket_byname(): don't pass hash and type separately 2021-06-21 12:28:49 -07:00
vmw_vsock virtio/vsock: avoid NULL deref in virtio_transport_seqpacket_allow() 2021-06-22 09:49:37 -07:00
wireless cfg80211: avoid double free of PMSR request 2021-06-18 13:25:24 +02:00
x25 net: x25: Use list_for_each_entry() to simplify code in x25_route.c 2021-06-10 14:08:09 -07:00
xdp xdp: Extend xdp_redirect_map with broadcast support 2021-05-26 09:46:16 +02:00
xfrm xfrm: ipcomp: remove unnecessary get_cpu() 2021-04-19 12:49:29 +02:00
compat.c net: Return the correct errno code 2021-06-03 15:13:56 -07:00
devres.c net: devres: Correct a grammatical error 2021-06-11 12:55:28 -07:00
Kconfig bpf, kconfig: Add consolidated menu entry for bpf with core options 2021-05-11 13:56:16 -07:00
Makefile net: l3mdev: use obj-$(CONFIG_NET_L3_MASTER_DEV) form in net/Makefile 2021-01-27 17:03:52 -08:00
socket.c net: add pf_family_names[] for protocol family 2021-06-21 14:41:54 -07:00
sysctl_net.c net: Ensure net namespace isolation of sysctls 2021-04-12 13:27:11 -07:00