linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-25 20:14:25 +08:00

Author	SHA1	Message	Date
Eric Dumazet	1abe267f17	net: add sk_wake_async_rcu() helper While looking at UDP receive performance, I saw sk_wake_async() was no longer inlined. This matters at least on AMD Zen1-4 platforms (see SRSO) This might be because rcu_read_lock() and rcu_read_unlock() are no longer nops in recent kernels ? Add sk_wake_async_rcu() variant, which must be called from contexts already holding rcu lock. As SOCK_FASYNC is deprecated in modern days, use unlikely() to give a hint to the compiler. sk_wake_async_rcu() is properly inlined from __udp_enqueue_schedule_skb() and sock_def_readable(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240328144032.1864988-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 15:03:11 -07:00
Eric Dumazet	612b1c0dec	udp: avoid calling sock_def_readable() if possible sock_def_readable() is quite expensive (particularly when ep_poll_callback() is in the picture). We must call sk->sk_data_ready() when : - receive queue was empty, or - SO_PEEK_OFF is enabled on the socket, or - sk->sk_data_ready is not sock_def_readable. We still need to call sk_wake_async(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240328144032.1864988-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 15:03:11 -07:00
Eric Dumazet	6a1f12dd85	udp: relax atomic operation on sk->sk_rmem_alloc atomic_add_return() is more expensive than atomic_add() and seems overkill in UDP rx fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240328144032.1864988-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 15:03:10 -07:00
Eric Dumazet	6055796995	udp: annotate data-race in __udp_enqueue_schedule_skb() sk->sk_rcvbuf is read locklessly twice, while other threads could change its value. Use a READ_ONCE() to annotate the race. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240328144032.1864988-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 15:03:10 -07:00
Jakub Kicinski	46dc11bee2	Merge branch 'address-remaining-wtautological-constant-out-of-range-compare' Arnd Bergmann says: ==================== address remaining -Wtautological-constant-out-of-range-compare The warning option was introduced a few years ago but left disabled by default. All of the actual bugs that this has found have been fixed in the meantime, and this series should address the remaining false-positives, as tested on arm/arm64/x86 randconfigs as well as allmodconfig builds for all architectures supported by clang. ==================== Link: https://lore.kernel.org/r/20240328143051.1069575-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:47:58 -07:00
Arnd Bergmann	a5535e5336	mlx5: stop warning for 64KB pages When building with 64KB pages, clang points out that xsk->chunk_size can never be PAGE_SIZE: drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c:19:22: error: result of comparison of constant 65536 with expression of type 'u16' (aka 'unsigned short') is always false [-Werror,-Wtautological-constant-out-of-range-compare] if (xsk->chunk_size > PAGE_SIZE \|\| ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~ In older versions of this code, using PAGE_SIZE was the only possibility, so this would have never worked on 64KB page kernels, but the patch apparently did not address this case completely. As Maxim Mikityanskiy suggested, 64KB chunks are really not all that useful, so just shut up the warning by adding a cast. Fixes: `282c0c798f` ("net/mlx5e: Allow XSK frames smaller than a page") Link: https://lore.kernel.org/netdev/20211013150232.2942146-1-arnd@kernel.org/ Link: https://lore.kernel.org/lkml/a7b27541-0ebb-4f2d-bd06-270a4d404613@app.fastmail.com/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Maxim Mikityanskiy <maxtram95@gmail.com> Reviewed-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240328143051.1069575-9-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:47:58 -07:00
Suraj Gupta	06c2a5cd48	net: axienet: Fix kernel doc warnings Add description of mdio enable, mdio disable and mdio wait functions. Add description of skb pointer in axidma_bd data structure. Remove 'phy_node' description in axienet local data structure since it is not a valid struct member. Correct description of struct axienet_option. Fix below kernel-doc warnings in drivers/net/ethernet/xilinx/: 1) xilinx_axienet_mdio.c:1: warning: no structured comments found 2) xilinx_axienet.h:379: warning: Function parameter or struct member 'skb' not described in 'axidma_bd' 3) xilinx_axienet.h:538: warning: Excess struct member 'phy_node' description in 'axienet_local' 4) xilinx_axienet.h:1002: warning: expecting prototype for struct axiethernet_option. Prototype was for struct axienet_option instead Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com> Reviewed-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> Link: https://lore.kernel.org/r/20240328110713.12885-1-suraj.gupta2@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:41:22 -07:00
Su Hui	1ab6fe64d2	octeontx2-pf: remove unused variables req_hdr and rsp_hdr Clang static checker(scan-buid): drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c:503:2: warning: Value stored to 'rsp_hdr' is never read [deadcode.DeadStores] Remove these unused variables to save some space. Signed-off-by: Su Hui <suhui@nfschina.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240328020723.4071539-1-suhui@nfschina.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:39:23 -07:00
Krzysztof Kozlowski	e93af72286	nfc: st95hf: drop driver owner assignment Core in spi_register_driver() already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240327174810.519676-4-krzysztof.kozlowski@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:32:51 -07:00
Krzysztof Kozlowski	e3c95d5619	nfc: mrvl: spi: drop driver owner assignment Core in spi_register_driver() already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240327174810.519676-3-krzysztof.kozlowski@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:32:51 -07:00
Krzysztof Kozlowski	3439412061	net: wwan: mhi: drop driver owner assignment Core in mhi_driver_register() already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Link: https://lore.kernel.org/r/20240327174810.519676-2-krzysztof.kozlowski@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:32:51 -07:00
Krzysztof Kozlowski	648bb2bf44	net: microchip: encx24j600: drop driver owner assignment Core in spi_register_driver() already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240327174810.519676-1-krzysztof.kozlowski@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:32:50 -07:00
Donald Hunter	bd3ce405fe	tools/net/ynl: Add extack policy attribute decoding The NLMSGERR_ATTR_POLICY extack attribute has been ignored by ynl up to now. Extend extack decoding to include _POLICY and the nested NL_POLICY_TYPE_ATTR_* attributes. For example: ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/rt_link.yaml \ --create --do newlink --json '{ "ifname": "12345678901234567890", "linkinfo": {"kind": "bridge"} }' Netlink error: Numerical result out of range nl_len = 104 (88) nl_flags = 0x300 nl_type = 2 error: -34 extack: {'msg': 'Attribute failed policy validation', 'policy': {'max-length': 15, 'type': 'string'}, 'bad-attr': '.ifname'} Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240328155636.64688-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:26:02 -07:00
Jian Wen	730fffce4f	devlink: use kvzalloc() to allocate devlink instance resources During live migration of a virtual machine, the SR-IOV VF need to be re-registered. It may fail when the memory is badly fragmented. The related log is as follows. kernel: hv_netvsc 6045bdaa-c0d1-6045-bdaa-c0d16045bdaa eth0: VF slot 1 added ... kernel: kworker/0:0: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL\|__GFP_COMP\|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 kernel: CPU: 0 PID: 24006 Comm: kworker/0:0 Tainted: G E 5.4...x86_64 #1 kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 kernel: Workqueue: events work_for_cpu_fn kernel: Call Trace: kernel: dump_stack+0x8b/0xc8 kernel: warn_alloc+0xff/0x170 kernel: __alloc_pages_slowpath+0x92c/0xb2b kernel: ? get_page_from_freelist+0x1d4/0x1140 kernel: __alloc_pages_nodemask+0x2f9/0x320 kernel: alloc_pages_current+0x6a/0xb0 kernel: kmalloc_order+0x1e/0x70 kernel: kmalloc_order_trace+0x26/0xb0 kernel: ? __switch_to_asm+0x34/0x70 kernel: __kmalloc+0x276/0x280 kernel: ? _raw_spin_unlock_irqrestore+0x1e/0x40 kernel: devlink_alloc+0x29/0x110 kernel: mlx5_devlink_alloc+0x1a/0x20 [mlx5_core] kernel: init_one+0x1d/0x650 [mlx5_core] kernel: local_pci_probe+0x46/0x90 kernel: work_for_cpu_fn+0x1a/0x30 kernel: process_one_work+0x16d/0x390 kernel: worker_thread+0x1d3/0x3f0 kernel: kthread+0x105/0x140 kernel: ? max_active_store+0x80/0x80 kernel: ? kthread_bind+0x20/0x20 kernel: ret_from_fork+0x3a/0x50 Signed-off-by: Jian Wen <wenjian1@xiaomi.com> Link: https://lore.kernel.org/r/20240327082128.942818-1-wenjian1@xiaomi.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:24:38 -07:00
Jakub Kicinski	45e0ba9c15	Merge branch 'enabled-wformat-truncation-for-clang' Arnd Bergmann says: ==================== enabled -Wformat-truncation for clang With randconfig build testing, I found only eight files that produce warnings with clang when -Wformat-truncation is enabled. This means we can just turn it on by default rather than only enabling it for "make W=1". Unfortunately, gcc produces a lot more warnings when the option is enabled, so it's not yet possible to turn it on both compilers. ==================== Link: https://lore.kernel.org/r/20240326223825.4084412-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:22:31 -07:00
Arnd Bergmann	b324a96035	mlx5: avoid truncating error message clang warns that one error message is too long for its destination buffer: drivers/net/ethernet/mellanox/mlx5/core/esw/bridge.c:1876:4: error: 'snprintf' will always be truncated; specified size is 80, but format string expands to at least 94 [-Werror,-Wformat-truncation-non-kprintf] Reword it to be a bit shorter so it always fits. Fixes: `70f0302b3f` ("net/mlx5: Bridge, implement mdb offload") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Link: https://lore.kernel.org/r/20240326223825.4084412-5-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:22:29 -07:00
Arnd Bergmann	954fd908f1	qed: avoid truncating work queue length clang complains that the temporary string for the name passed into alloc_workqueue() is too short for its contents: drivers/net/ethernet/qlogic/qed/qed_main.c:1218:3: error: 'snprintf' will always be truncated; specified size is 16, but format string expands to at least 18 [-Werror,-Wformat-truncation] There is no need for a temporary buffer, and the actual name of a workqueue is 32 bytes (WQ_NAME_LEN), so just use the interface as intended to avoid the truncation. Fixes: `59ccf86fe6` ("qed: Add driver infrastucture for handling mfw requests.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20240326223825.4084412-4-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:22:28 -07:00
Arnd Bergmann	9046d581ed	enetc: avoid truncating error message As clang points out, the error message in enetc_setup_xdp_prog() still does not fit in the buffer and will be truncated: drivers/net/ethernet/freescale/enetc/enetc.c:2771:3: error: 'snprintf' will always be truncated; specified size is 80, but format string expands to at least 87 [-Werror,-Wformat-truncation] Replace it with an even shorter message that should fit. Fixes: `f968c56417` ("net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20240326223825.4084412-3-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:22:28 -07:00
Radha Mohan Chintakuntla	b6694abcf5	octeontx2-af: Increase maximum BPID channels Any NIX interface type can have maximum 256 channels. So increased the backpressure ID count to 256 so that it can cover cn9k and cn10k SoCs that have different NIX interface types with varied maximum channels. Signed-off-by: Radha Mohan Chintakuntla <radhac@marvell.com> Link: https://lore.kernel.org/r/20240326184514.1628284-1-radhac@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:20:12 -07:00
Jakub Kicinski	e6091ba518	Merge branch 'add-ip-port-information-to-udp-drop-tracepoint' Balazs Scheidler says: ==================== Add IP/port information to UDP drop tracepoint In our use-case we would like to recover the properties of dropped UDP packets. Unfortunately the current udp_fail_queue_rcv_skb tracepoint only exposes the port number of the receiving socket. This patch-set will add the source/dest ip/port to the tracepoint, while keeping the socket's local port as well for compatibility. Thanks for the review comments by Jason and Kuniyuki, they helped me a lot and I tried to address all of their comments in this new iteration. ==================== Link: https://lore.kernel.org/r/cover.1711475011.git.balazs.scheidler@axoflow.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:18:26 -07:00
Balazs Scheidler	e9669a00bb	net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source and destination IP/port whereas this information can be critical in case of UDP/syslog. Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/0c8b3e33dbf679e190be6f4c6736603a76988a20.1711475011.git.balazs.scheidler@axoflow.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:18:24 -07:00
Balazs Scheidler	a0ad11fc26	net: port TP_STORE_ADDR_PORTS_SKB macro to be tcp/udp independent This patch moves TP_STORE_ADDR_PORTS_SKB() to a common header and removes the TCP specific implementation details. Previously the macro assumed the skb passed as an argument is a TCP packet, the implementation now uses an argument to the L4 header and uses that to extract the source/destination ports, which happen to be named the same in "struct tcphdr" and "struct udphdr" Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com> Link: https://lore.kernel.org/r/9e306f78260dfbbdc7353ba5f864cc027a409540.1711475011.git.balazs.scheidler@axoflow.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:18:24 -07:00
Andy Shevchenko	b7b1c59069	nfc: st95hf: Switch to using gpiod API This updates the driver to gpiod API, and removes yet another use of of_get_named_gpio(). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240326175836.1418718-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:15:13 -07:00
Jakub Kicinski	b125f3eb07	Merge branch 'add-en8811h-phy-driver-and-devicetree-binding-doc' Eric Woudstra says: ==================== Add en8811h phy driver and devicetree binding doc This patch series adds the driver and the devicetree binding documentation for the Airoha en8811h PHY. ==================== Link: https://lore.kernel.org/r/20240326162305.303598-1-ericwouds@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:06:44 -07:00
Eric Woudstra	71e7943011	net: phy: air_en8811h: Add the Airoha EN8811H PHY driver Add the driver for the Airoha EN8811H 2.5 Gigabit PHY. The phy supports 100/1000/2500 Mbps with auto negotiation only. The driver uses two firmware files, for which updated versions are added to linux-firmware already. Note: At phy-address + 8 there is another device on the mdio bus, that belongs to the EN881H. While the original driver writes to it, Airoha has confirmed this is not needed. Therefore, communication with this device is not included in this driver. Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326162305.303598-3-ericwouds@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:06:44 -07:00
Eric Woudstra	2434ba2bc8	dt-bindings: net: airoha,en8811h: Add en8811h Add the Airoha EN8811H 2.5 Gigabit PHY. The en8811h phy can be set with serdes polarity reversed on rx and/or tx. Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240326162305.303598-2-ericwouds@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:06:39 -07:00
Jakub Kicinski	da493dbb1f	Merge branch 'af_unix-rework-gc' Kuniyuki Iwashima says: ==================== af_unix: Rework GC. When we pass a file descriptor to an AF_UNIX socket via SCM_RIGTHS, the underlying struct file of the inflight fd gets its refcount bumped. If the fd is of an AF_UNIX socket, we need to track it in case it forms cyclic references. Let's say we send a fd of AF_UNIX socket A to B and vice versa and close() both sockets. When created, each socket's struct file initially has one reference. After the fd exchange, both refcounts are bumped up to 2. Then, close() decreases both to 1. From this point on, no one can touch the file/socket. However, the struct file has one refcount and thus never calls the release() function of the AF_UNIX socket. That's why we need to track all inflight AF_UNIX sockets and run garbage collection. This series replaces the current GC implementation that locks each inflight socket's receive queue and requires trickiness in other places. The new GC does not lock each socket's queue to minimise its effect and tries to be lightweight if there is no cyclic reference or no update in the shape of the inflight fd graph. The new implementation is based on Tarjan's Strongly Connected Components algorithm, and we will consider each inflight AF_UNIX socket as a vertex and its file descriptor as an edge in a directed graph. For the details, please see each patch. patch 1 - 3 : Add struct to express inflight socket graphs patch 4 : Optimse inflight fd counting patch 5 - 6 : Group SCC possibly forming a cycle patch 7 - 8 : Support embryo socket patch 9 - 11 : Make GC lightweight patch 12 - 13 : Detect dead cycle references patch 14 : Replace GC algorithm patch 15 : selftest After this series is applied, we can remove the two ugly tricks for race, scm_fp_dup() in unix_attach_fds() and spin_lock dance in unix_peek_fds() as done in patch 14/15 of v1. Also, we will add cond_resched_lock() in __unix_gc() and convert it to use a dedicated kthread instead of global system workqueue as suggested by Paolo in a v4 thread. v4: https://lore.kernel.org/netdev/20240301022243.73908-1-kuniyu@amazon.com/ v3: https://lore.kernel.org/netdev/20240223214003.17369-1-kuniyu@amazon.com/ v2: https://lore.kernel.org/netdev/20240216210556.65913-1-kuniyu@amazon.com/ v1: https://lore.kernel.org/netdev/20240203030058.60750-1-kuniyu@amazon.com/ ==================== Link: https://lore.kernel.org/r/20240325202425.60930-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:51 -07:00
Kuniyuki Iwashima	2aa0cff26e	selftest: af_unix: Test GC for SCM_RIGHTS. This patch adds test cases to verify the new GC. We run each test for the following cases: * SOCK_DGRAM * SOCK_STREAM without embryo socket * SOCK_STREAM without embryo socket + MSG_OOB * SOCK_STREAM with embryo sockets * SOCK_STREAM with embryo sockets + MSG_OOB Before and after running each test case, we ensure that there is no AF_UNIX socket left in the netns by reading /proc/net/protocols. We cannot use /proc/net/unix and UNIX_DIAG because the embryo socket does not show up there. Each test creates multiple sockets in an array. We pass sockets in the even index using the peer sockets in the odd index. So, send_fd(0, 1) actually sends fd[0] to fd[2] via fd[0 + 1]. Test 1 : A <-> A Test 2 : A <-> B Test 3 : A -> B -> C <- D ^.___\|___.' ^ `---------' Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-16-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:46 -07:00
Kuniyuki Iwashima	4090fa373f	af_unix: Replace garbage collection algorithm. If we find a dead SCC during iteration, we call unix_collect_skb() to splice all skb in the SCC to the global sk_buff_head, hitlist. After iterating all SCC, we unlock unix_gc_lock and purge the queue. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-15-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:42 -07:00
Kuniyuki Iwashima	a15702d8b3	af_unix: Detect dead SCC. When iterating SCC, we call unix_vertex_dead() for each vertex to check if the vertex is close()d and has no bridge to another SCC. If both conditions are true for every vertex in SCC, we can execute garbage collection for all skb in the SCC. The actual garbage collection is done in the following patch, replacing the old implementation. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-14-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:38 -07:00
Kuniyuki Iwashima	bfdb01283e	af_unix: Assign a unique index to SCC. The definition of the lowlink in Tarjan's algorithm is the smallest index of a vertex that is reachable with at most one back-edge in SCC. This is not useful for a cross-edge. If we start traversing from A in the following graph, the final lowlink of D is 3. The cross-edge here is one between D and C. A -> B -> D D = (4, 3) (index, lowlink) ^ \| \| C = (3, 1) \| V \| B = (2, 1) `--- C <--' A = (1, 1) This is because the lowlink of D is updated with the index of C. In the following patch, we detect a dead SCC by checking two conditions for each vertex. 1) vertex has no edge directed to another SCC (no bridge) 2) vertex's out_degree is the same as the refcount of its file If 1) is false, there is a receiver of all fds of the SCC and its ancestor SCC. To evaluate 1), we need to assign a unique index to each SCC and assign it to all vertices in the SCC. This patch changes the lowlink update logic for cross-edge so that in the example above, the lowlink of D is updated with the lowlink of C. A -> B -> D D = (4, 1) (index, lowlink) ^ \| \| C = (3, 1) \| V \| B = (2, 1) `--- C <--' A = (1, 1) Then, all vertices in the same SCC have the same lowlink, and we can quickly find the bridge connecting to different SCC if exists. However, it is no longer called lowlink, so we rename it to scc_index. (It's sometimes called lowpoint.) Also, we add a global variable to hold the last index used in DFS so that we do not reset the initial index in each DFS. This patch can be squashed to the SCC detection patch but is split deliberately for anyone wondering why lowlink is not used as used in the original Tarjan's algorithm and many reference implementations. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-13-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:33 -07:00
Kuniyuki Iwashima	ad081928a8	af_unix: Avoid Tarjan's algorithm if unnecessary. Once a cyclic reference is formed, we need to run GC to check if there is dead SCC. However, we do not need to run Tarjan's algorithm if we know that the shape of the inflight graph has not been changed. If an edge is added/updated/deleted and the edge's successor is inflight, we set false to unix_graph_grouped, which means we need to re-classify SCC. Once we finalise SCC, we set true to unix_graph_grouped. While unix_graph_grouped is true, we can iterate the grouped SCC using vertex->scc_entry in unix_walk_scc_fast(). list_add() and list_for_each_entry_reverse() uses seem weird, but they are to keep the vertex order consistent and make writing test easier. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-12-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:29 -07:00
Kuniyuki Iwashima	77e5593aeb	af_unix: Skip GC if no cycle exists. We do not need to run GC if there is no possible cyclic reference. We use unix_graph_maybe_cyclic to decide if we should run GC. If a fd of an AF_UNIX socket is passed to an already inflight AF_UNIX socket, they could form a cyclic reference. Then, we set true to unix_graph_maybe_cyclic and later run Tarjan's algorithm to group them into SCC. Once we run Tarjan's algorithm, we are 100% sure whether cyclic references exist or not. If there is no cycle, we set false to unix_graph_maybe_cyclic and can skip the entire garbage collection next time. When finalising SCC, we set true to unix_graph_maybe_cyclic if SCC consists of multiple vertices. Even if SCC is a single vertex, a cycle might exist as self-fd passing. Given the corner case is rare, we detect it by checking all edges of the vertex and set true to unix_graph_maybe_cyclic. With this change, __unix_gc() is just a spin_lock() dance in the normal usage. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-11-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:25 -07:00
Kuniyuki Iwashima	ba31b4a4e1	af_unix: Save O(n) setup of Tarjan's algo. Before starting Tarjan's algorithm, we need to mark all vertices as unvisited. We can save this O(n) setup by reserving two special indices (0, 1) and using two variables. The first time we link a vertex to unix_unvisited_vertices, we set unix_vertex_unvisited_index to index. During DFS, we can see that the index of unvisited vertices is the same as unix_vertex_unvisited_index. When we finalise SCC later, we set unix_vertex_grouped_index to each vertex's index. Then, we can know (i) that the vertex is on the stack if the index of a visited vertex is >= 2 and (ii) that it is not on the stack and belongs to a different SCC if the index is unix_vertex_grouped_index. After the whole algorithm, all indices of vertices are set as unix_vertex_grouped_index. Next time we start DFS, we know that all unvisited vertices have unix_vertex_grouped_index, and we can use unix_vertex_unvisited_index as the not-on-stack marker. To use the same variable in __unix_walk_scc(), we can swap unix_vertex_(grouped\|unvisited)_index at the end of Tarjan's algorithm. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:20 -07:00
Kuniyuki Iwashima	dcf70df204	af_unix: Fix up unix_edge.successor for embryo socket. To garbage collect inflight AF_UNIX sockets, we must define the cyclic reference appropriately. This is a bit tricky if the loop consists of embryo sockets. Suppose that the fd of AF_UNIX socket A is passed to D and the fd B to C and that C and D are embryo sockets of A and B, respectively. It may appear that there are two separate graphs, A (-> D) and B (-> C), but this is not correct. A --. .-- B X C <-' `-> D Now, D holds A's refcount, and C has B's refcount, so unix_release() will never be called for A and B when we close() them. However, no one can call close() for D and C to free skbs holding refcounts of A and B because C/D is in A/B's receive queue, which should have been purged by unix_release() for A and B. So, here's another type of cyclic reference. When a fd of an AF_UNIX socket is passed to an embryo socket, the reference is indirectly held by its parent listening socket. .-> A .-> B \| `- sk_receive_queue \| `- sk_receive_queue \| `- skb \| `- skb \| `- sk == C \| `- sk == D \| `- sk_receive_queue \| `- sk_receive_queue \| `- skb +---------' `- skb +-. \| \| `---------------------------------------------------------' Technically, the graph must be denoted as A <-> B instead of A (-> D) and B (-> C) to find such a cyclic reference without touching each socket's receive queue. .-> A --. .-- B <-. \| X \| == A <-> B `-- C <-' `-> D --' We apply this fixup during GC by fetching the real successor by unix_edge_successor(). When we call accept(), we clear unix_sock.listener under unix_gc_lock not to confuse GC. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:14 -07:00
Kuniyuki Iwashima	aed6ecef55	af_unix: Save listener for embryo socket. This is a prep patch for the following change, where we need to fetch the listening socket from the successor embryo socket during GC. We add a new field to struct unix_sock to save a pointer to a listening socket. We set it when connect() creates a new socket, and clear it when accept() is called. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:10 -07:00
Kuniyuki Iwashima	3484f06317	af_unix: Detect Strongly Connected Components. In the new GC, we use a simple graph algorithm, Tarjan's Strongly Connected Components (SCC) algorithm, to find cyclic references. The algorithm visits every vertex exactly once using depth-first search (DFS). DFS starts by pushing an input vertex to a stack and assigning it a unique number. Two fields, index and lowlink, are initialised with the number, but lowlink could be updated later during DFS. If a vertex has an edge to an unvisited inflight vertex, we visit it and do the same processing. So, we will have vertices in the stack in the order they appear and number them consecutively in the same order. If a vertex has a back-edge to a visited vertex in the stack, we update the predecessor's lowlink with the successor's index. After iterating edges from the vertex, we check if its index equals its lowlink. If the lowlink is different from the index, it shows there was a back-edge. Then, we go backtracking and propagate the lowlink to its predecessor and resume the previous edge iteration from the next edge. If the lowlink is the same as the index, we pop vertices before and including the vertex from the stack. Then, the set of vertices is SCC, possibly forming a cycle. At the same time, we move the vertices to unix_visited_vertices. When we finish the algorithm, all vertices in each SCC will be linked via unix_vertex.scc_entry. Let's take an example. We have a graph including five inflight vertices (F is not inflight): A -> B -> C -> D -> E (-> F) ^ \| `---------' Suppose that we start DFS from C. We will visit C, D, and B first and initialise their index and lowlink. Then, the stack looks like this: > B = (3, 3) (index, lowlink) D = (2, 2) C = (1, 1) When checking B's edge to C, we update B's lowlink with C's index and propagate it to D. B = (3, 1) (index, lowlink) > D = (2, 1) C = (1, 1) Next, we visit E, which has no edge to an inflight vertex. > E = (4, 4) (index, lowlink) B = (3, 1) D = (2, 1) C = (1, 1) When we leave from E, its index and lowlink are the same, so we pop E from the stack as single-vertex SCC. Next, we leave from B and D but do nothing because their lowlink are different from their index. B = (3, 1) (index, lowlink) D = (2, 1) > C = (1, 1) Then, we leave from C, whose index and lowlink are the same, so we pop B, D and C as SCC. Last, we do DFS for the rest of vertices, A, which is also a single-vertex SCC. Finally, each unix_vertex.scc_entry is linked as follows: A -. B -> C -> D E -. ^ \| ^ \| ^ \| `--' `---------' `--' We use SCC later to decide whether we can garbage-collect the sockets. Note that we still cannot detect SCC properly if an edge points to an embryo socket. The following two patches will sort it out. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:28:04 -07:00
Kuniyuki Iwashima	6ba76fd284	af_unix: Iterate all vertices by DFS. The new GC will use a depth first search graph algorithm to find cyclic references. The algorithm visits every vertex exactly once. Here, we implement the DFS part without recursion so that no one can abuse it. unix_walk_scc() marks every vertex unvisited by initialising index as UNIX_VERTEX_INDEX_UNVISITED and iterates inflight vertices in unix_unvisited_vertices and call __unix_walk_scc() to start DFS from an arbitrary vertex. __unix_walk_scc() iterates all edges starting from the vertex and explores the neighbour vertices with DFS using edge_stack. After visiting all neighbours, __unix_walk_scc() moves the visited vertex to unix_visited_vertices so that unix_walk_scc() will not restart DFS from the visited vertex. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:27:59 -07:00
Kuniyuki Iwashima	22c3c0c52d	af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb. Currently, we track the number of inflight sockets in two variables. unix_tot_inflight is the total number of inflight AF_UNIX sockets on the host, and user->unix_inflight is the number of inflight fds per user. We update them one by one in unix_inflight(), which can be done once in batch. Also, sendmsg() could fail even after unix_inflight(), then we need to acquire unix_gc_lock only to decrement the counters. Let's bulk update the counters in unix_add_edges() and unix_del_edges(), which is called only for successfully passed fds. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:27:55 -07:00
Kuniyuki Iwashima	42f298c06b	af_unix: Link struct unix_edge when queuing skb. Just before queuing skb with inflight fds, we call scm_stat_add(), which is a good place to set up the preallocated struct unix_vertex and struct unix_edge in UNIXCB(skb).fp. Then, we call unix_add_edges() and construct the directed graph as follows: 1. Set the inflight socket's unix_sock to unix_edge.predecessor. 2. Set the receiver's unix_sock to unix_edge.successor. 3. Set the preallocated vertex to inflight socket's unix_sock.vertex. 4. Link inflight socket's unix_vertex.entry to unix_unvisited_vertices. 5. Link unix_edge.vertex_entry to the inflight socket's unix_vertex.edges. Let's say we pass the fd of AF_UNIX socket A to B and the fd of B to C. The graph looks like this: +-------------------------+ \| unix_unvisited_vertices \| <-------------------------. +-------------------------+ \| + \| \| +--------------+ +--------------+ \| +--------------+ \| \| unix_sock A \| <---. .---> \| unix_sock B \| <-\|-. .---> \| unix_sock C \| \| +--------------+ \| \| +--------------+ \| \| \| +--------------+ \| .-+ \| vertex \| \| \| .-+ \| vertex \| \| \| \| \| vertex \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| \| \| \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| '-> \| unix_vertex \| \| \| '-> \| unix_vertex \| \| \| \| \| +--------------+ \| \| +--------------+ \| \| \| `---> \| entry \| +---------> \| entry \| +-' \| \| \|--------------\| \| \| \|--------------\| \| \| \| edges \| <-. \| \| \| edges \| <-. \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| \| \| \| \| \| .----------------------' \| \| .----------------------' \| \| \| \| \| \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| unix_edge \| \| \| \| \| unix_edge \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| `-> \| vertex_entry \| \| \| `-> \| vertex_entry \| \| \| \|--------------\| \| \| \|--------------\| \| \| \| predecessor \| +---' \| \| predecessor \| +---' \| \|--------------\| \| \|--------------\| \| \| successor \| +-----' \| successor \| +-----' +--------------+ +--------------+ Henceforth, we denote such a graph as A -> B (-> C). Now, we can express all inflight fd graphs that do not contain embryo sockets. We will support the particular case later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:27:50 -07:00
Kuniyuki Iwashima	29b64e3540	af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd. As with the previous patch, we preallocate to skb's scm_fp_list an array of struct unix_edge in the number of inflight AF_UNIX fds. There we just preallocate memory and do not use immediately because sendmsg() could fail after this point. The actual use will be in the next patch. When we queue skb with inflight edges, we will set the inflight socket's unix_sock as unix_edge->predecessor and the receiver's unix_sock as successor, and then we will link the edge to the inflight socket's unix_vertex.edges. Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup() so that MSG_PEEK does not change the shape of the directed graph. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:27:44 -07:00
Kuniyuki Iwashima	1fbfdfaa59	af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd. We will replace the garbage collection algorithm for AF_UNIX, where we will consider each inflight AF_UNIX socket as a vertex and its file descriptor as an edge in a directed graph. This patch introduces a new struct unix_vertex representing a vertex in the graph and adds its pointer to struct unix_sock. When we send a fd using the SCM_RIGHTS message, we allocate struct scm_fp_list to struct scm_cookie in scm_fp_copy(). Then, we bump each refcount of the inflight fds' struct file and save them in scm_fp_list.fp. After that, unix_attach_fds() inexplicably clones scm_fp_list of scm_cookie and sets it to skb. (We will remove this part after replacing GC.) Here, we add a new function call in unix_attach_fds() to preallocate struct unix_vertex per inflight AF_UNIX fd and link each vertex to skb's scm_fp_list.vertices. When sendmsg() succeeds later, if the socket of the inflight fd is still not inflight yet, we will set the preallocated vertex to struct unix_sock.vertex and link it to a global list unix_unvisited_vertices under spin_lock(&unix_gc_lock). If the socket is already inflight, we free the preallocated vertex. This is to avoid taking the lock unnecessarily when sendmsg() could fail later. In the following patch, we will similarly allocate another struct per edge, which will finally be linked to the inflight socket's unix_vertex.edges. And then, we will count the number of edges as unix_vertex.out_degree. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:27:33 -07:00
Eric Dumazet	50e2907ef8	tcp/dccp: bypass empty buckets in inet_twsk_purge() TCP ehash table is often sparsely populated. inet_twsk_purge() spends too much time calling cond_resched(). This patch can reduce time spent in inet_twsk_purge() by 20x. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240327191206.508114-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:04:07 -07:00
Kurt Kanzenbach	385edf325e	net: dsa: hellcreek: Convert to gettimex64() As of commit `916444df30` ("ptp: deprecate gettime64() in favor of gettimex64()") (new) PTP drivers should rather implement gettimex64(). In addition, this variant provides timestamps from the system clock. The readings have to be recorded right before and after reading the lowest bits of the PHC timestamp. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-03-29 10:44:17 +00:00
Zhengchao Shao	af398bd0cb	net/smc: make smc_hash_sk/smc_unhash_sk static smc_hash_sk and smc_unhash_sk are only used in af_smc.c, so make them static and remove the output symbol. They can be called under the path .prot->hash()/unhash(). Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-03-29 09:48:23 +00:00
David S. Miller	17593357e7	Merge branch 'net-sched-skip_sw' Asbjørn Sloth Tønnesen says: ==================== make skip_sw actually skip software During development of flower-route[1], which I recently presented at FOSDEM[2], I noticed that CPU usage, would increase the more rules I installed into the hardware for IP forwarding offloading. Since we use TC flower offload for the hottest prefixes, and leave the long tail to the normal (non-TC) Linux network stack for slow-path IP forwarding. We therefore need both the hardware and software datapath to perform well. I found that skip_sw rules, are quite expensive in the kernel datapath, since they must be evaluated and matched upon, before the kernel checks the skip_sw flag. This patchset optimizes the case where all rules are skip_sw, by implementing a TC bypass for these cases, where TC is only used as a control plane for the hardware path. v4: - Rebased onto net-next, now that net-next is open again v3: https://lore.kernel.org/netdev/20240306165813.656931-1-ast@fiberby.net/ - Patch 3: - Fix source_inline - Fix build failure, when CONFIG_NET_CLS without CONFIG_NET_CLS_ACT. v2: https://lore.kernel.org/netdev/20240305144404.569632-1-ast@fiberby.net/ - Patch 1: - Add Reviewed-By from Jiri Pirko - Patch 2: - Move code, to avoid forward declaration (Jiri). - Patch 3 - Refactor to use a static key. - Add performance data for trapping, or sending a packet to a non-existent chain (as suggested by Marcelo). v1: https://lore.kernel.org/netdev/20240215160458.1727237-1-ast@fiberby.net/ [1] flower-route https://github.com/fiberby-dk/flower-route [2] FOSDEM talk https://fosdem.org/2024/schedule/event/fosdem-2024-3337-flying-higher-hardware-offloading-with-bird/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2024-03-29 09:46:39 +00:00
Asbjørn Sloth Tønnesen	047f340b36	net: sched: make skip_sw actually skip software TC filters come in 3 variants: - no flag (try to process in hardware, but fallback to software)) - skip_hw (do not process filter by hardware) - skip_sw (do not process filter by software) However skip_sw is implemented so that the skip_sw flag can first be checked, after it has been matched. IMHO it's common when using skip_sw, to use it on all rules. So if all filters in a block is skip_sw filters, then we can bail early, we can thus avoid having to match the filters, just to check for the skip_sw flag. This patch adds a bypass, for when only TC skip_sw rules are used. The bypass is guarded by a static key, to avoid harming other workloads. There are 3 ways that a packet from a skip_sw ruleset, can end up in the kernel path. Although the send packets to a non-existent chain way is only improved a few percents, then I believe it's worth optimizing the trap and fall-though use-cases. +----------------------------+--------+--------+--------+ \| Test description \| Pre- \| Post- \| Rel. \| \| \| kpps \| kpps \| chg. \| +----------------------------+--------+--------+--------+ \| basic forwarding + notrack \| 3589.3 \| 3587.9 \| 1.00x \| \| switch to eswitch mode \| 3081.8 \| 3094.7 \| 1.00x \| \| add ingress qdisc \| 3042.9 \| 3063.6 \| 1.01x \| \| tc forward in hw / skip_sw \|37024.7 \|37028.4 \| 1.00x \| \| tc forward in sw / skip_hw \| 3245.0 \| 3245.3 \| 1.00x \| +----------------------------+--------+--------+--------+ \| tests with only skip_sw rules below: \| +----------------------------+--------+--------+--------+ \| 1 non-matching rule \| 2694.7 \| 3058.7 \| 1.14x \| \| 1 n-m rule, match trap \| 2611.2 \| 3323.1 \| 1.27x \| \| 1 n-m rule, goto non-chain \| 2886.8 \| 2945.9 \| 1.02x \| \| 5 non-matching rules \| 1958.2 \| 3061.3 \| 1.56x \| \| 5 n-m rules, match trap \| 1911.9 \| 3327.0 \| 1.74x \| \| 5 n-m rules, goto non-chain\| 2883.1 \| 2947.5 \| 1.02x \| \| 10 non-matching rules \| 1466.3 \| 3062.8 \| 2.09x \| \| 10 n-m rules, match trap \| 1444.3 \| 3317.9 \| 2.30x \| \| 10 n-m rules,goto non-chain\| 2883.1 \| 2939.5 \| 1.02x \| \| 25 non-matching rules \| 838.5 \| 3058.9 \| 3.65x \| \| 25 n-m rules, match trap \| 824.5 \| 3323.0 \| 4.03x \| \| 25 n-m rules,goto non-chain\| 2875.8 \| 2944.7 \| 1.02x \| \| 50 non-matching rules \| 488.1 \| 3054.7 \| 6.26x \| \| 50 n-m rules, match trap \| 484.9 \| 3318.5 \| 6.84x \| \| 50 n-m rules,goto non-chain\| 2884.1 \| 2939.7 \| 1.02x \| +----------------------------+--------+--------+--------+ perf top (25 n-m skip_sw rules - pre patch): 20.39% [kernel] [k] __skb_flow_dissect 16.43% [kernel] [k] rhashtable_jhash2 10.58% [kernel] [k] fl_classify 10.23% [kernel] [k] fl_mask_lookup 4.79% [kernel] [k] memset_orig 2.58% [kernel] [k] tcf_classify 1.47% [kernel] [k] __x86_indirect_thunk_rax 1.42% [kernel] [k] __dev_queue_xmit 1.36% [kernel] [k] nft_do_chain 1.21% [kernel] [k] __rcu_read_lock perf top (25 n-m skip_sw rules - post patch): 5.12% [kernel] [k] __dev_queue_xmit 4.77% [kernel] [k] nft_do_chain 3.65% [kernel] [k] dev_gro_receive 3.41% [kernel] [k] check_preemption_disabled 3.14% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear 2.88% [kernel] [k] __netif_receive_skb_core.constprop.0 2.49% [kernel] [k] mlx5e_xmit 2.15% [kernel] [k] ip_forward 1.95% [kernel] [k] mlx5e_tc_restore_tunnel 1.92% [kernel] [k] vlan_gro_receive Test setup: DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G Data rate measured on switch (Extreme X690), and DUT connected as a router on a stick, with pktgen and pktsink as VLANs. Pktgen-dpdk was in range 36.6-37.7 Mpps 64B packets across all tests. Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/ Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-03-29 09:46:39 +00:00
Asbjørn Sloth Tønnesen	2081fd3445	net: sched: cls_api: add filter counter Maintain a count of filters per block. Counter updates are protected by cb_lock, which is also used to protect the offload counters. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-03-29 09:46:39 +00:00
Asbjørn Sloth Tønnesen	f631ef39d8	net: sched: cls_api: add skip_sw counter Maintain a count of skip_sw filters. This counter is protected by the cb_lock, and is updated at the same time as offloadcnt. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-03-29 09:46:38 +00:00
Jakub Kicinski	fd2162a501	Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice: use less resources in switchdev Michal Swiatkowski says: Switchdev is using one queue per created port representor. This can quickly lead to Rx queue shortage, as with subfunction support user can create high number of PRs. Save one MSI-X and 'number of PRs' * 1 queues. Refactor switchdev slow-path to use less resources (even no additional resources). Do this by removing control plane VSI and move its functionality to PF VSI. Even with current solution PF is acting like uplink and can't be used simultaneously for other use cases (adding filters can break slow-path). In short, do Tx via PF VSI and Rx packets using PF resources. Rx needs additional code in interrupt handler to choose correct PR netdev. Previous solution had to queue filters, it was way more elegant but needed one queue per PRs. Beside that this refactor mostly simplifies switchdev configuration. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: count representor stats ice: do switchdev slow-path Rx using PF VSI ice: change repr::id values ice: remove switchdev control plane VSI ice: control default Tx rule in lag ice: default Tx rule instead of to queue ice: do Tx through PF netdev in slow-path ice: remove eswitch changing queues algorithm ==================== Link: https://lore.kernel.org/r/20240325202623.1012287-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-28 22:50:22 -07:00

1 2 3 4 5 ...

1264849 Commits