linux/Documentation/networking
Jakub Kicinski 95d1815f09 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter.

2) Add DATA_SENT state to SCTP connection tracking helper, from
   Sriram Yagnaraman.

3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal.

4) Add bitmask support for ipset, from Vishwanath Pai.

5) Handle icmpv6 redirects as RELATED, from Florian Westphal.

6) Add WARN_ON_ONCE() to impossible case in flowtable datapath,
   from Li Qiong.

7) A large batch of IPVS updates to replace timer-based estimators by
   kthreads to scale up wrt. CPUs and workload (millions of estimators).

Julian Anastasov says:

	This patchset implements stats estimation in kthread context.
It replaces the code that runs on single CPU in timer context every 2
seconds and causing latency splats as shown in reports [1], [2], [3].
The solution targets setups with thousands of IPVS services,
destinations and multi-CPU boxes.

	Spread the estimation on multiple (configured) CPUs and multiple
time slots (timer ticks) by using multiple chains organized under RCU
rules.  When stats are not needed, it is recommended to use
run_estimation=0 as already implemented before this change.

RCU Locking:

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array. For now we limit
kthreads depending on the number of CPUs.

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is set to 0 to disable the estimation.

- the allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the max number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.

- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains and as result,
reducing the initial delay below 2 seconds.

Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.

The new IPVS estimators do not use workqueue infrastructure
because:

- The estimation can take long time when using multiple IPVS rules (eg.
  millions estimator structures) and especially when box has multiple
  CPUs due to the for_each_possible_cpu usage that expects packets from
  any CPU. With est_nice sysctl we have more control how to prioritize the
  estimation kthreads compared to other processes/kthreads that have
  latency requirements (such as servers). As a benefit, we can see these
  kthreads in top and decide if we will need some further control to limit
  their CPU usage (max number of structure to estimate per kthread).

- with kthreads we run code that is read-mostly, no write/lock
  operations to process the estimators in 2-second intervals.

- work items are one-shot: as estimators are processed every
  2 seconds, they need to be re-added every time. This again
  loads the timers (add_timer) if we use delayed works, as there are
  no kthreads to do the timings.

[1] Report from Yunhong Jiang:
    https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
    https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  ipvs: run_estimation should control the kthread tasks
  ipvs: add est_cpulist and est_nice sysctl vars
  ipvs: use kthreads for stats estimation
  ipvs: use u64_stats_t for the per-cpu counters
  ipvs: use common functions for stats allocation
  ipvs: add rcu protection to stats
  netfilter: flowtable: add a 'default' case to flowtable datapath
  netfilter: conntrack: set icmpv6 redirects as RELATED
  netfilter: ipset: Add support for new bitmask parameter
  netfilter: conntrack: merge ipv4+ipv6 confirm functions
  netfilter: conntrack: add sctp DATA_SENT state
  netfilter: nft_inner: fix IS_ERR() vs NULL check
====================

Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12 14:45:36 -08:00
..
caif tty: cumulate and document tty_struct::flow* members 2021-05-13 16:57:16 +02:00
device_drivers net/mlx5: E-Switch, Implement devlink port function cmds to control migratable 2022-12-07 20:09:18 -08:00
devlink Documentation: devlink: add devlink documentation for the etas_es58x driver 2022-12-12 11:39:13 +01:00
dsa docs: net: dsa: update information about multiple CPU ports 2022-09-20 10:32:36 +02:00
mac80211_hwsim docs: net: convert two README files to ReST format 2019-07-31 13:31:56 -06:00
6lowpan.rst docs: networking: convert 6lowpan.txt to ReST 2020-02-28 14:52:36 +01:00
6pack.rst docs: networking: convert 6pack.txt to ReST 2020-04-28 14:38:38 -07:00
af_xdp.rst doc, af_xdp: Fix bind flags option typo 2021-07-12 16:55:01 +02:00
alias.rst
arcnet-hardware.rst docs: networking: arcnet-hardware.rst: don't duplicate chapter names 2020-05-01 12:24:43 -07:00
arcnet.rst Documentation: networking: arcnet: drop doubled word 2020-07-04 17:46:21 -07:00
atm.rst docs: networking: convert atm.txt to ReST 2020-04-28 14:38:38 -07:00
ax25.rst Documentation: networking: ax25: drop doubled word 2020-07-04 17:46:21 -07:00
bareudp.rst Documentation: bareudp: Corrected description of bareudp module. 2020-07-28 17:53:03 -07:00
batman-adv.rst batman-adv: Move IRC channel to hackint.org 2021-08-08 20:05:46 +02:00
bonding.rst Documentation: bonding: correct xmit hash steps 2022-12-02 10:46:45 +00:00
bridge.rst
can_ucan_protocol.rst Documentation: networking: can_ucan_protocol: drop doubled words 2020-07-04 17:46:21 -07:00
can.rst can: add termination resistor documentation 2022-10-19 21:33:29 +02:00
cdc_mbim.rst docs: networking: convert cdc_mbim.txt to ReST 2020-04-28 14:38:39 -07:00
checksum-offloads.rst docs: networking: convert netdev-features.txt to ReST 2020-04-30 12:56:36 -07:00
dccp.rst net: dccp: Add SIOCOUTQ IOCTL support (send buffer fill) 2020-07-22 17:00:37 -07:00
dctcp.rst docs: networking: convert dctcp.txt to ReST 2020-04-28 14:38:39 -07:00
dns_resolver.rst docs: networking: convert dns_resolver.txt to ReST 2020-04-28 14:39:46 -07:00
driver.rst Documentation: networking: correct possessive "its" 2022-08-31 12:36:08 -07:00
eql.rst docs: networking: convert eql.txt to ReST 2020-04-28 14:39:46 -07:00
ethtool-netlink.rst ethtool: add netlink based get rss support 2022-12-05 17:25:00 -08:00
failover.rst
fib_trie.rst docs: networking: convert fib_trie.txt to ReST 2020-04-28 14:39:46 -07:00
filter.rst treewide: use get_random_u32() when possible 2022-10-11 17:42:58 -06:00
gen_stats.rst docs: networking: convert gen_stats.txt to ReST 2020-04-28 14:39:46 -07:00
generic_netlink.rst Documentation: networking: Update generic_netlink_howto URL 2022-11-23 17:25:02 -08:00
generic-hdlc.rst docs: networking: convert generic-hdlc.txt to ReST 2020-04-28 14:39:46 -07:00
gtp.rst docs: networking: convert gtp.txt to ReST 2020-04-28 14:39:46 -07:00
ieee802154.rst docs: net: ieee802154.rst: fix C expressions 2020-10-15 07:49:41 +02:00
ila.rst docs: networking: convert ila.txt to ReST 2020-04-28 14:39:47 -07:00
index.rst Documentation: networking: TC queue based filtering 2022-10-25 10:32:40 +02:00
ioam6-sysctl.rst ipv6: ioam: Documentation for new IOAM sysctls 2021-07-21 08:14:33 -07:00
ip_dynaddr.rst docs: networking: convert ip_dynaddr.txt to ReST 2020-04-28 14:39:47 -07:00
ip-sysctl.rst sctp: add sysctl net.sctp.l3mdev_accept 2022-11-18 11:42:54 +00:00
ipddp.rst docs: networking: convert ipddp.txt to ReST 2020-04-28 14:39:47 -07:00
ipsec.rst docs: networking: convert ipsec.txt to ReST 2020-04-28 14:39:47 -07:00
ipv6.rst docs: networking: convert ipv6.txt to ReST 2020-04-28 14:40:18 -07:00
ipvlan.rst Documentation: networking: correct possessive "its" 2022-08-31 12:36:08 -07:00
ipvs-sysctl.rst ipvs: run_estimation should control the kthread tasks 2022-12-10 22:44:43 +01:00
j1939.rst can: j1939: add tables for the CAN identifier and its fields 2020-11-20 09:43:29 +01:00
kapi.rst wimax: move out to staging 2020-10-29 19:27:45 +01:00
kcm.rst docs: networking: convert kcm.txt to ReST 2020-04-28 14:40:19 -07:00
l2tp.rst Documentation: networking: correct possessive "its" 2022-08-31 12:36:08 -07:00
lapb-module.rst docs: networking: convert lapb-module.txt to ReST 2020-04-30 12:56:35 -07:00
mac80211-auth-assoc-deauth.txt
mac80211-injection.rst doc: networking: wireless: fix wiki website url 2020-06-08 10:05:53 +02:00
mctp.rst mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag control 2022-02-09 12:00:11 +00:00
mpls-sysctl.rst docs: networking: convert mpls-sysctl.txt to ReST 2020-04-30 12:56:36 -07:00
mptcp-sysctl.rst Documentation: mptcp: fix pm_type formatting 2022-09-13 10:18:44 +02:00
msg_zerocopy.rst docs: use the lore redirector everywhere 2021-10-12 13:58:19 -06:00
multiqueue.rst docs: networking: convert multiqueue.txt to ReST 2020-04-30 12:56:36 -07:00
net_dim.rst docs: networking: add full DIM API 2020-04-10 18:11:04 -07:00
net_failover.rst Documentation: networking: net_failover: Fix documentation 2021-11-17 13:59:49 +00:00
netconsole.rst docs: networking: convert netconsole.txt to ReST 2020-04-30 12:56:36 -07:00
netdev-features.rst net: hsr: add offloading support 2021-02-11 13:24:44 -08:00
netdevices.rst net: bonding: move ioctl handling to private ndo operation 2021-07-27 20:11:45 +01:00
netfilter-sysctl.rst docs: networking: convert netfilter-sysctl.txt to ReST 2020-04-30 12:56:36 -07:00
netif-msg.rst docs: networking: convert netif-msg.txt to ReST 2020-04-30 12:56:36 -07:00
nexthop-group-resilient.rst Documentation: net: Document resilient next-hop groups 2021-03-29 13:51:38 -07:00
nf_conntrack-sysctl.rst netfilter: conntrack: remove nf_conntrack_helper documentation 2022-09-20 23:50:03 +02:00
nf_flowtable.rst docs: nf_flowtable: fix compilation and warnings 2021-03-25 17:42:02 -07:00
nfc.rst docs: networking: nfc: change to rst format 2019-11-23 11:00:19 -08:00
openvswitch.rst docs: networking: convert openvswitch.txt to ReST 2020-04-30 12:56:36 -07:00
operstates.rst docs: operstates: document IF_OPER_TESTING 2021-08-02 15:16:04 +01:00
packet_mmap.rst docs: networking: Replace strncpy() with strscpy() 2021-06-04 11:21:43 -06:00
page_pool.rst Documentation: update networking/page_pool.rst 2022-03-03 09:55:28 +00:00
phonet.rst docs: networking: convert phonet.txt to ReST 2020-04-30 12:56:37 -07:00
phy.rst docs: networking: phy: add missing space 2022-10-05 20:32:39 -07:00
pktgen.rst pktgen: document the latest pktgen usage options 2021-08-25 13:44:30 +01:00
plip.rst docs: networking: convert PLIP.txt to ReST 2020-04-30 12:56:37 -07:00
ppp_generic.rst docs: update ppp_generic.rst to document new ioctls 2020-12-10 13:57:36 -08:00
proc_net_tcp.rst docs: networking: convert proc_net_tcp.txt to ReST 2020-04-30 12:56:37 -07:00
radiotap-headers.rst docs: networking: convert radiotap-headers.txt to ReST 2020-04-30 12:56:37 -07:00
rds.rst Doc: networking: Fix the title's Sphinx overline in rds.rst 2021-11-29 15:18:21 -07:00
regulatory.rst doc: networking: wireless: fix wiki website url 2020-06-08 10:05:53 +02:00
representors.rst docs: net: add an explanation of VF (and other) Representors 2022-09-21 07:31:38 -07:00
rxrpc.rst rxrpc: Remove rxrpc_get_reply_time() which is no longer used 2022-09-01 11:44:13 +01:00
scaling.rst docs: networking: update XPS to account for netif_set_xps_queue 2020-10-13 16:21:54 -07:00
sctp.rst docs: networking: convert sctp.txt to ReST 2020-04-30 12:56:38 -07:00
secid.rst docs: networking: convert secid.txt to ReST 2020-04-30 12:56:38 -07:00
seg6-sysctl.rst doc: move seg6_flowlabel to seg6-sysctl.rst 2021-04-14 13:13:15 -07:00
segmentation-offloads.rst networking: : fix typos in code comments 2019-05-20 20:24:34 -04:00
sfp-phylink.rst doc: sfp-phylink: Fix a broken reference 2022-08-02 21:45:07 -07:00
skbuff.rst skbuff: render the checksum comment to documentation 2022-05-10 17:48:37 -07:00
smc-sysctl.rst net/smc: Unbind r/w buffer size from clcsock and make them tunable 2022-09-22 12:58:21 +02:00
snmp_counter.rst net-next: docs: Fix typos in snmp_counter.rst 2021-01-05 17:07:38 -08:00
statistics.rst docs: networking: extend the statistics documentation 2021-04-16 16:59:20 -07:00
strparser.rst docs: networking: convert strparser.txt to ReST 2020-04-30 12:56:38 -07:00
switchdev.rst docs: net: add an explanation of VF (and other) Representors 2022-09-21 07:31:38 -07:00
sysfs-tagging.rst Documentation: better locations for sysfs-pci, sysfs-tagging 2020-10-09 09:33:23 -06:00
tc-actions-env-rules.rst docs: networking: convert tc-actions-env-rules.txt to ReST 2020-04-30 12:56:38 -07:00
tc-queue-filters.rst Documentation: networking: TC queue based filtering 2022-10-25 10:32:40 +02:00
tcp-thin.rst docs: networking: convert tcp-thin.txt to ReST 2020-04-30 12:56:38 -07:00
team.rst docs: networking: convert team.txt to ReST 2020-04-30 12:56:38 -07:00
timestamping.rst net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP 2022-12-08 19:49:21 -08:00
tipc.rst Documentation: add more details in tipc.rst 2021-07-01 13:18:18 -07:00
tls-offload-layers.svg Documentation: add TLS offload documentation 2019-05-22 12:18:20 -07:00
tls-offload-reorder-bad.svg Documentation: add TLS offload documentation 2019-05-22 12:18:20 -07:00
tls-offload-reorder-good.svg Documentation: add TLS offload documentation 2019-05-22 12:18:20 -07:00
tls-offload.rst net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled 2021-01-19 15:58:05 -08:00
tls.rst tls: rx: add counter for NoPad violations 2022-07-11 19:48:33 -07:00
tproxy.rst docs: networking: convert tproxy.txt to ReST 2020-04-30 12:56:38 -07:00
tuntap.rst docs: networking: Replace strncpy() with strscpy() 2021-06-04 11:21:43 -06:00
udplite.rst docs: networking: convert udplite.txt to ReST 2020-05-01 12:24:40 -07:00
vrf.rst doc: Document unexpected tcp_l3mdev_accept=1 behavior 2021-08-23 11:53:24 +01:00
vxlan.rst docs: vxlan: add info about device features 2020-09-28 12:50:12 -07:00
x25-iface.rst net: x25: Queue received packets in the drivers instead of per-CPU queues 2021-04-05 11:42:12 -07:00
x25.rst net: x25: Remove unimplemented X.25-over-LLC code stubs 2020-12-12 17:15:33 -08:00
xfrm_device.rst xfrm: document IPsec packet offload mode 2022-12-05 10:40:29 +01:00
xfrm_proc.rst docs: networking: convert xfrm_proc.txt to ReST 2020-05-01 12:24:40 -07:00
xfrm_sync.rst docs: networking: convert xfrm_sync.txt to ReST 2020-05-01 12:24:41 -07:00
xfrm_sysctl.rst docs: networking: convert xfrm_sysctl.txt to ReST 2020-05-01 12:24:41 -07:00