linux/net/bridge
Jakub Kicinski 95d1815f09 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter.

2) Add DATA_SENT state to SCTP connection tracking helper, from
   Sriram Yagnaraman.

3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal.

4) Add bitmask support for ipset, from Vishwanath Pai.

5) Handle icmpv6 redirects as RELATED, from Florian Westphal.

6) Add WARN_ON_ONCE() to impossible case in flowtable datapath,
   from Li Qiong.

7) A large batch of IPVS updates to replace timer-based estimators by
   kthreads to scale up wrt. CPUs and workload (millions of estimators).

Julian Anastasov says:

	This patchset implements stats estimation in kthread context.
It replaces the code that runs on single CPU in timer context every 2
seconds and causing latency splats as shown in reports [1], [2], [3].
The solution targets setups with thousands of IPVS services,
destinations and multi-CPU boxes.

	Spread the estimation on multiple (configured) CPUs and multiple
time slots (timer ticks) by using multiple chains organized under RCU
rules.  When stats are not needed, it is recommended to use
run_estimation=0 as already implemented before this change.

RCU Locking:

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array. For now we limit
kthreads depending on the number of CPUs.

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is set to 0 to disable the estimation.

- the allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the max number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.

- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains and as result,
reducing the initial delay below 2 seconds.

Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.

The new IPVS estimators do not use workqueue infrastructure
because:

- The estimation can take long time when using multiple IPVS rules (eg.
  millions estimator structures) and especially when box has multiple
  CPUs due to the for_each_possible_cpu usage that expects packets from
  any CPU. With est_nice sysctl we have more control how to prioritize the
  estimation kthreads compared to other processes/kthreads that have
  latency requirements (such as servers). As a benefit, we can see these
  kthreads in top and decide if we will need some further control to limit
  their CPU usage (max number of structure to estimate per kthread).

- with kthreads we run code that is read-mostly, no write/lock
  operations to process the estimators in 2-second intervals.

- work items are one-shot: as estimators are processed every
  2 seconds, they need to be re-added every time. This again
  loads the timers (add_timer) if we use delayed works, as there are
  no kthreads to do the timings.

[1] Report from Yunhong Jiang:
    https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
    https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  ipvs: run_estimation should control the kthread tasks
  ipvs: add est_cpulist and est_nice sysctl vars
  ipvs: use kthreads for stats estimation
  ipvs: use u64_stats_t for the per-cpu counters
  ipvs: use common functions for stats allocation
  ipvs: add rcu protection to stats
  netfilter: flowtable: add a 'default' case to flowtable datapath
  netfilter: conntrack: set icmpv6 redirects as RELATED
  netfilter: ipset: Add support for new bitmask parameter
  netfilter: conntrack: merge ipv4+ipv6 confirm functions
  netfilter: conntrack: add sctp DATA_SENT state
  netfilter: nft_inner: fix IS_ERR() vs NULL check
====================

Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12 14:45:36 -08:00
..
netfilter netfilter: conntrack: merge ipv4+ipv6 confirm functions 2022-11-30 18:55:30 +01:00
br_arp_nd_proxy.c net: bridge: Use netif_rx(). 2022-03-04 12:02:19 +00:00
br_cfm_netlink.c bridge: cfm: Netlink Notifications. 2020-10-29 18:39:44 -07:00
br_cfm.c bridge: cfm: remove redundant return 2021-06-22 10:35:15 -07:00
br_device.c bridge: move from strlcpy with unused retval to strscpy 2022-08-22 17:57:30 -07:00
br_fdb.c bridge: switchdev: Allow device drivers to install locked FDB entries 2022-11-09 19:06:13 -08:00
br_forward.c net: Add skb_clear_tstamp() to keep the mono delivery_time 2022-03-03 14:38:48 +00:00
br_if.c net: bridge: assign path_cost for 2.5G and 5G link speed 2022-09-30 12:35:29 +01:00
br_input.c bridge: Add missing parentheses 2022-11-11 21:34:55 -08:00
br_ioctl.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2021-12-31 14:35:40 +00:00
br_mdb.c bridge: mcast: Remove redundant function arguments 2022-12-07 20:05:52 -08:00
br_mrp_netlink.c bridge: mrp: Use hlist_head instead of list_head for mrp 2020-11-09 16:42:12 -08:00
br_mrp_switchdev.c bridge: mrp: Extend br_mrp_switchdev to detect better the errors 2021-02-16 14:47:46 -08:00
br_mrp.c net: bridge: mrp: Update the Test frames for MRA 2021-06-28 15:46:10 -07:00
br_mst.c net: bridge: mst: Add helper to query a port's MST state 2022-03-17 16:49:58 -07:00
br_multicast_eht.c net: bridge: multicast: use multicast contexts instead of bridge or port 2021-07-20 05:41:19 -07:00
br_multicast.c bridge: mcast: Constify 'group' argument in br_multicast_new_port_group() 2022-12-07 20:05:52 -08:00
br_netfilter_hooks.c netfilter: br_netfilter: Drop dst references before setting. 2022-08-31 12:12:32 +02:00
br_netfilter_ipv6.c netfilter: br_netfilter: Drop dst references before setting. 2022-08-31 12:12:32 +02:00
br_netlink_tunnel.c net: bridge: notify on vlan tunnel changes done via the old api 2020-07-12 15:18:24 -07:00
br_netlink.c bridge: Add MAC Authentication Bypass (MAB) support 2022-11-03 20:46:32 -07:00
br_nf_core.c net: add bool confirm_neigh parameter for dst_ops.update_pmtu 2019-12-24 22:28:54 -08:00
br_private_cfm.h bridge: cfm: Kernel space implementation of CFM. CCM frame RX added. 2020-10-29 18:39:43 -07:00
br_private_mcast_eht.h net: bridge: multicast: use multicast contexts instead of bridge or port 2021-07-20 05:41:19 -07:00
br_private_mrp.h net: bridge: mrp: Update the Test frames for MRA 2021-06-28 15:46:10 -07:00
br_private_stp.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
br_private_tunnel.h net: bridge: change return type of br_handle_ingress_vlan_tunnel 2021-08-24 16:51:09 -07:00
br_private.h bridge: mcast: Constify 'group' argument in br_multicast_new_port_group() 2022-12-07 20:05:52 -08:00
br_stp_bpdu.c net: bridge: add STP xstats 2019-12-14 20:02:36 -08:00
br_stp_if.c net: use eth_hw_addr_set() 2021-10-02 14:18:25 +01:00
br_stp_timer.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
br_stp.c net: bridge: mst: Multiple Spanning Tree (MST) mode 2022-03-17 16:49:57 -07:00
br_switchdev.c bridge: switchdev: Reflect MAB bridge port flag to device drivers 2022-11-09 19:06:14 -08:00
br_sysfs_br.c bridge: Fix flushing of dynamic FDB entries 2022-11-02 20:47:09 -07:00
br_sysfs_if.c bridge: move from strlcpy with unused retval to strscpy 2022-08-22 17:57:30 -07:00
br_vlan_options.c net: bridge: mst: Allow changing a VLAN's MSTI 2022-03-17 16:49:57 -07:00
br_vlan_tunnel.c net: bridge: change return type of br_handle_ingress_vlan_tunnel 2021-08-24 16:51:09 -07:00
br_vlan.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-11-17 18:30:39 -08:00
br.c bridge: switchdev: Allow device drivers to install locked FDB entries 2022-11-09 19:06:13 -08:00
Kconfig bridge: cfm: Add BRIDGE_CFM to Kconfig. 2020-10-29 18:39:43 -07:00
Makefile net: bridge: mst: Multiple Spanning Tree (MST) mode 2022-03-17 16:49:57 -07:00