Ido Schimmel says:
====================
Preparations for FIB rule DSCP selector
This patchset moves the masking of the upper DSCP bits in 'flowi4_tos'
to the core instead of relying on callers of the FIB lookup API to do
it.
This will allow us to start changing users of the API to initialize the
'flowi4_tos' field with all six bits of the DSCP field. In turn, this
will allow us to extend FIB rules with a new DSCP selector.
By masking the upper DSCP bits in the core we are able to maintain the
behavior of the TOS selector in FIB rules and routes to only match on
the lower DSCP bits.
While working on this I found two users of the API that do not mask the
upper DSCP bits before performing the lookup. The first is an ancient
netlink family that is unlikely to be used. It is adjusted in patch #1
to mask both the upper DSCP bits and the ECN bits before calling the
API.
The second user is a nftables module that differs in this regard from
its equivalent iptables module. It is adjusted in patch #2 to invoke the
API with the upper DSCP bits masked, like all other callers. The
relevant selftest passed, but in the unlikely case that regressions are
reported because of this change, we can restore the existing behavior
using a new flow information flag as discussed here [1].
The last patch moves the masking of the upper DSCP bits to the core,
making the first two patches redundant, but I wanted to post them
separately to call attention to the behavior change for these two users
of the FIB lookup API.
Future patchsets (around 3) will start unmasking the upper DSCP bits
throughout the networking stack before adding support for the new FIB
rule DSCP selector.
Changes from v1 [2]:
Patch #3: Include <linux/ip.h> in <linux/in_route.h> instead of
including it in net/ip_fib.h
[1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/
[2] https://lore.kernel.org/netdev/20240725131729.1729103-1-idosch@nvidia.com/
====================
Link: https://patch.msgid.link/20240814125224.972815-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The TOS field in the IPv4 flow information structure ('flowi4_tos') is
matched by the kernel against the TOS selector in IPv4 rules and routes.
The field is initialized differently by different call sites. Some treat
it as DSCP (RFC 2474) and initialize all six DSCP bits, some treat it as
RFC 1349 TOS and initialize it using RT_TOS() and some treat it as RFC
791 TOS and initialize it using IPTOS_RT_MASK.
What is common to all these call sites is that they all initialize the
lower three DSCP bits, which fits the TOS definition in the initial IPv4
specification (RFC 791).
Therefore, the kernel only allows configuring IPv4 FIB rules that match
on the lower three DSCP bits which are always guaranteed to be
initialized by all call sites:
# ip -4 rule add tos 0x1c table 100
# ip -4 rule add tos 0x3c table 100
Error: Invalid tos.
While this works, it is unlikely to be very useful. RFC 791 that
initially defined the TOS and IP precedence fields was updated by RFC
2474 over twenty five years ago where these fields were replaced by a
single six bits DSCP field.
Extending FIB rules to match on DSCP can be done by adding a new DSCP
selector while maintaining the existing semantics of the TOS selector
for applications that rely on that.
A prerequisite for allowing FIB rules to match on DSCP is to adjust all
the call sites to initialize the high order DSCP bits and remove their
masking along the path to the core where the field is matched on.
However, making this change alone will result in a behavior change. For
example, a forwarded IPv4 packet with a DS field of 0xfc will no longer
match a FIB rule that was configured with 'tos 0x1c'.
This behavior change can be avoided by masking the upper three DSCP bits
in 'flowi4_tos' before comparing it against the TOS selectors in FIB
rules and routes.
Implement the above by adding a new function that checks whether a given
DSCP value matches the one specified in the IPv4 flow information
structure and invoke it from the three places that currently match on
'flowi4_tos'.
Use RT_TOS() for the masking of 'flowi4_tos' instead of IPTOS_RT_MASK
since the latter is not uAPI and we should be able to remove it at some
point.
Include <linux/ip.h> in <linux/in_route.h> since the former defines
IPTOS_TOS_MASK which is used in the definition of RT_TOS() in
<linux/in_route.h>.
No regressions in FIB tests:
# ./fib_tests.sh
[...]
Tests passed: 218
Tests failed: 0
And FIB rule tests:
# ./fib_rule_tests.sh
[...]
Tests passed: 116
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As part of its functionality, the nftables FIB expression module
performs a FIB lookup, but unlike other users of the FIB lookup API, it
does so without masking the upper DSCP bits. In particular, this differs
from the equivalent iptables match ("rpfilter") that does mask the upper
DSCP bits before the FIB lookup.
Align the module to other users of the FIB lookup API and mask the upper
DSCP bits using IPTOS_RT_MASK before the lookup.
No regressions in nft_fib.sh:
# ./nft_fib.sh
PASS: fib expression did not cause unwanted packet drops
PASS: fib expression did drop packets for 1.1.1.1
PASS: fib expression did drop packets for 1c3::c01d
PASS: fib expression forward check with policy based routing
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB
lookup according to user provided parameters and communicate the result
back to user space.
However, unlike other users of the FIB lookup API, the upper DSCP bits
and the ECN bits of the DS field are not masked, which can result in the
wrong result being returned.
Solve this by masking the upper DSCP bits and the ECN bits using
IPTOS_RT_MASK.
The structure that communicates the request and the response is not
exported to user space, so it is unlikely that this netlink family is
actually in use [1].
[1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
GRO code checks for matching layer 2 headers to see, if packet belongs
to the same flow and because ip6 tunnel set dev->hard_header_len
this check fails in cases, where it shouldn't. To fix this don't
set hard_header_len, but use needed_headroom like ipv4/ip_tunnel.c
does.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Link: https://patch.msgid.link/20240815151419.109864-1-tbogendoerfer@suse.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When only the last resource is invalid, tpmi_sst_dev_add() is returing
error even if there are other valid resources before. This function
should return error when there are no valid resources.
Here tpmi_sst_dev_add() is returning "ret" variable. But this "ret"
variable contains the failure status of last call to sst_main(), which
failed for the invalid resource. But there may be other valid resources
before the last entry.
To address this, do not update "ret" variable for sst_main() return
status.
If there are no valid resources, it is already checked for by !inst
below the loop and -ENODEV is returned.
Fixes: 9d1d36268f ("platform/x86: ISST: Support partitioned systems")
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: stable@vger.kernel.org # 6.10+
Link: https://lore.kernel.org/r/20240816163626.415762-1-srinivas.pandruvada@linux.intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
revert commit 4c905f6740 ("netfilter: nf_tables: initialize registers in
nft_do_chain()").
Previous patch makes sure that loads from uninitialized registers are
detected from the control plane. in this case rule blob auto-zeroes
registers. Thus the explicit zeroing is not needed anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reject rules where a load occurs from a register that has not seen a store
early in the same rule.
commit 4c905f6740 ("netfilter: nf_tables: initialize registers in
nft_do_chain()")
had to add a unconditional memset to the nftables register space to avoid
leaking stack information to userspace.
This memset shows up in benchmarks. After this change, this commit can
be reverted again.
Note that this breaks userspace compatibility, because theoretically
you can do
rule 1: reg2 := meta load iif, reg2 == 1 jump ...
rule 2: reg2 == 2 jump ... // read access with no store in this rule
... after this change this is rejected.
Neither nftables nor iptables-nft generate such rules, each rule is
always standalone.
This resuts in a small increase of nft_ctx structure by sizeof(long).
To cope with hypothetical rulesets like the example above one could emit
on-demand "reg[x] = 0" store when generating the datapath blob in
nf_tables_commit_chain_prepare().
A patch that does this is linked to below.
For now, lets disable this. In nf_tables, a rule is the smallest
unit that can be replaced from userspace, i.e. a hypothetical ruleset
that relies on earlier initialisations of registers can't be changed
at will as register usage would need to be coordinated.
Link: https://lore.kernel.org/netfilter-devel/20240627135330.17039-4-fw@strlen.de/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_counter_reset() resets the counter by subtracting the previously
retrieved value from the counter. This is a write operation on the
counter and as such it requires to be performed with a write sequence of
nft_counter_seq to serialize against its possible reader.
Update the packets/ bytes within write-sequence of nft_counter_seq.
Fixes: d84701ecbc ("netfilter: nft_counter: rework atomic dump and reset")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The sequence counter nft_counter_seq is a per-CPU counter. There is no
lock associated with it. nft_counter_do_eval() is using the same counter
and disables BH which suggest that it can be invoked from a softirq.
This in turn means that nft_counter_offload_stats(), which disables only
preemption, can be interrupted by nft_counter_do_eval() leading to two
writer for one seqcount_t.
This can lead to loosing stats or reading statistics while they are
updated.
Disable BH during stats update in nft_counter_offload_stats() to ensure
one writer at a time.
Fixes: b72920f6e4 ("netfilter: nftables: counter hardware offload support")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Wen Gu says:
====================
net/smc: introduce ringbufs usage statistics
Currently, we have histograms that show the sizes of ringbufs that ever
used by SMC connections. However, they are always incremental and since
SMC allows the reuse of ringbufs, we cannot know the actual amount of
ringbufs being allocated or actively used.
So this patch set introduces statistics for the amount of ringbufs that
actually allocated by link group and actively used by connections of a
certain net namespace, so that we can react based on these memory usage
information, e.g. active fallback to TCP.
With appropriate adaptations of smc-tools, we can obtain these ringbufs
usage information:
$ smcr -d linkgroup
LG-ID : 00000500
LG-Role : SERV
LG-Type : ASYML
VLAN : 0
PNET-ID :
Version : 1
Conns : 0
Sndbuf : 12910592 B <-
RMB : 12910592 B <-
or
$ smcr -d stats
[...]
RX Stats
Data transmitted (Bytes) 869225943 (869.2M)
Total requests 18494479
Buffer usage (Bytes) 12910592 (12.31M) <-
[...]
TX Stats
Data transmitted (Bytes) 12760884405 (12.76G)
Total requests 36988338
Buffer usage (Bytes) 12910592 (12.31M) <-
[...]
[...]
Change log:
v3->v2
- use new helper nla_put_uint() instead of nla_put_u64_64bit().
v2->v1
https://lore.kernel.org/r/20240807075939.57882-1-guwen@linux.alibaba.com/
- remove inline keyword in .c files.
- use local variable in macros to avoid potential side effects.
v1
https://lore.kernel.org/r/20240805090551.80786-1-guwen@linux.alibaba.com/
====================
Link: https://patch.msgid.link/20240814130827.73321-1-guwen@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The buffer size histograms in smc_stats, namely rx/tx_rmbsize, record
the sizes of ringbufs for all connections that have ever appeared in
the net namespace. They are incremental and we cannot know the actual
ringbufs usage from these. So here introduces statistics for current
ringbufs usage of existing smc connections in the net namespace into
smc_stats, it will be incremented when new connection uses a ringbuf
and decremented when the ringbuf is unused.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently we have the statistics on sndbuf/RMB sizes of all connections
that have ever been on the link group, namely smc_stats_memsize. However
these statistics are incremental and since the ringbufs of link group
are allowed to be reused, we cannot know the actual allocated buffers
through these. So here introduces the statistic on actual allocated
ringbufs of the link group, it will be incremented when a new ringbuf is
added into buf_list and decremented when it is deleted from buf_list.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
These are never implenmented since commit b691b1116e ("net/mlx5: Implement
devlink port function cmds to control ipsec_packet").
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240816101550.881844-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit f13697cc7a ("gve: Switch to config-aware queue allocation")
convert this function to gve_rx_alloc_rings_gqi().
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240816101906.882743-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the netlink policy to validate IPv6 address length.
Destination address currently has policy for max len set,
and source has no policy validation. In both cases
the code does the real check. With correct policy
check the code can be removed.
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240816212245.467745-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michal Luczaj says:
====================
Series takes care of few bugs and missing features with the aim to improve
the test coverage of sockmap/sockhash.
Last patch is a create_pair() rewrite making use of
__attribute__((cleanup)) to handle socket fd lifetime.
---
Changes in v2:
- Rebase on bpf-next (Jakub)
- Use cleanup helpers from kernel's cleanup.h (Jakub)
- Fix subject of patch 3, rephrase patch 4, use correct prefix
- Link to v1: https://lore.kernel.org/r/20240724-sockmap-selftest-fixes-v1-0-46165d224712@rbox.co
Changes in v1:
- No declarations in function body (Jakub)
- Don't touch output arguments until function succeeds (Jakub)
- Link to v0: https://lore.kernel.org/netdev/027fdb41-ee11-4be0-a493-22f28a1abd7c@rbox.co/
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Rewrite function to have (unneeded) socket descriptors automatically
close()d when leaving the scope. Make sure the "ownership" of fds is
correctly passed via take_fd(); i.e. descriptor returned to caller will
remain valid.
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20240731-selftest-sockmap-fixes-v2-6-08a0c73abed2@rbox.co
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Constants got switched reducing the test's coverage. Replace SOCK_DGRAM
with SOCK_STREAM in one of unix_inet_skb_redir_to_connected() tests.
Fixes: 51354f700d ("bpf, sockmap: Add af_unix test with both sockets in map")
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20240731-selftest-sockmap-fixes-v2-5-08a0c73abed2@rbox.co
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Do actually test the sotype as specified by the caller.
This picks up after commit 75e0e27db6 ("selftest/bpf: Change udp to inet
in some function names").
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20240731-selftest-sockmap-fixes-v2-4-08a0c73abed2@rbox.co
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Replace implementation with a call to a generic function.
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20240731-selftest-sockmap-fixes-v2-3-08a0c73abed2@rbox.co
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Following create_pair() changes, remove unused function argument in
create_socket_pairs() and adapt its callers, i.e. drop the open-coded
loopback socket creation.
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20240731-selftest-sockmap-fixes-v2-2-08a0c73abed2@rbox.co
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Extend the function to allow creating socket pairs of SOCK_STREAM,
SOCK_DGRAM and SOCK_SEQPACKET.
Adapt direct callers and leave further cleanups for the following patch.
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20240731-selftest-sockmap-fixes-v2-1-08a0c73abed2@rbox.co
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAZsOFw6Zi849r7WBJAQKanRAAxb5R6YdwO8U4VcrVqMWWoLLCp7CvhPT6
QIL19e5hT6UlynLfqHiSlo4olcrUFaSXJhviwgX4Lj35Hq5PbyU7Fqtdomy/8ITl
kJ9yYaxnXI9y+LpCMzJsszt6EZ3m0j2VCSEcMiTO34KmZHU3Spln0uiFhzwVB7yU
Vp9D+XYaWxH5nf5gXpwPqSEBqjTgE3MvuU2Higd8oWLsAjI2wnyS6kC89g2KndQO
vZWQLuq3ZL3c/aDc8zsOrI0gOUAyq5yAz4Y26uH9sFjqTRpJPOXsDxMmV8rZPJye
+ByVWNC21rdGc4c8fLxUAaEgofJkWx/+VR8gX4aim+6O9aGBw2b2nByxNuzc3sDM
xaWD9+XJ9l32cH6Md/LGn0xlUcjJTtUYbiDgDCsaGlOZa99ASOXUQvAJKrAgxQEe
E/7iz16F4UO3DXVVgdG51bTK37axhUPZTsM6YDJStPDdN9Z8IBmPCw9Bd+gDIcD7
3JIsugQLOFjfinNge0ClzCOdbWkK4ireBMBlxs9zX33sAAOX9NbabLOf7/+t0rrI
WOxUPev6Dzi2JKk4P/5WPjLXAY7tTgVoUnHOuACPdIK153UbDzE8H8opgtksBDPN
Z8l+yvsmPtD6iTjgoe4hsM7gtXoWkom5sLVhIY4Lskfk/qulNt5p6B7z+KuRNxXY
wFcrq2FJDK4=
=iUQo
-----END PGP SIGNATURE-----
Merge tag 'hid-for-linus-2024081901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Jiri Kosina:
- memory corruption fixes for hid-cougar (Camila Alvarez) and
hid-amd_sfh (Olivier Sobrie)
- fix for regression in Wacom driver of twist gesture handling (Jason
Gerecke)
- two new device IDs for hid-multitouch (Dmitry Savin) and hid-asus
(Luke D. Jones)
* tag 'hid-for-linus-2024081901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: wacom: Defer calculation of resolution until resolution_code is known
HID: multitouch: Add support for GT7868Q
HID: amd_sfh: free driver_data after destroying hid device
hid-asus: add ROG Ally X prod ID to quirk list
HID: cougar: fix slab-out-of-bounds Read in cougar_report_fixup
This patch is to move nf_ct_netns_get() out of nf_conncount_init()
and let the consumers of nf_conncount decide if they want to turn
on netfilter conntrack.
It makes nf_conncount more flexible to be used in other places and
avoids netfilter conntrack turned on when using it in openvswitch
conntrack.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
pipapo set backend maintains two copies of the datastructure, removing
the elements from the copy that is going to be discarded slows down
the abort path significantly, from several minutes to few seconds after
this patch.
This patch was previously reverted by
f86fb94011 ("netfilter: nf_tables: revert do not remove elements if set backend implements .abort")
but it is now possible since recent work by Florian Westphal to perform
on-demand clone from insert/remove path:
532aec7e87 ("netfilter: nft_set_pipapo: remove dirty flag")
3f1d886cc7 ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path")
a238106703 ("netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone")
c5444786d0 ("netfilter: nft_set_pipapo: merge deactivate helper into caller")
6c108d9bee ("netfilter: nft_set_pipapo: prepare walk function for on-demand clone")
8b8a241755 ("netfilter: nft_set_pipapo: prepare destroy function for on-demand clone")
80efd2997f ("netfilter: nft_set_pipapo: make pipapo_clone helper return NULL")
a590f47609 ("netfilter: nft_set_pipapo: move prove_locking helper around")
after this series, the clone is fully released once aborted, no need to
take it back to previous state. Thus, no stale reference to elements can
occur.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_set_lookup_byid() is very slow when transaction becomes large, due to
walk of the transaction list.
Add a dedicated list that contains only the new sets.
Before: nft -f ruleset 0.07s user 0.00s system 0% cpu 1:04.84 total
After: nft -f ruleset 0.07s user 0.00s system 0% cpu 30.115 total
.. where ruleset contains ~10 sets with ~100k elements.
The above number is for a combined flush+reload of the ruleset.
With previous flush, even the first NEWELEM has to walk through a few
hundred thousands of DELSET(ELEM) transactions before the first NEWSET
object. To cope with random-order-newset-newsetelem we'd need to replace
commit_set_list with a hashtable.
Expectation is that a NEWELEM operation refers to the most recently added
set, so last entry of the dedicated list should be the set we want.
NB: This is not a bug fix per se (functionality is fine), but with
larger transaction batches list search takes forever, so it would be
nice to speed this up for -stable too, hence adding a "fixes" tag.
Fixes: 958bee14d0 ("netfilter: nf_tables: use new transaction infrastructure to handle sets")
Reported-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use consume_skb in the batch code path to avoid generating spurious
NOT_SPECIFIED skb drop reasons.
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Test that nfqueue with and without GSO process SCTP packets correctly.
Joint work with Florian and Pablo.
Signed-off-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
when packet is enqueued with nfqueue and GSO is enabled, checksum
calculation has to take into account the protocol, as SCTP uses a
32 bits CRC checksum.
Enter skb_gso_segment() path in case of SCTP GSO packets because
skb_zerocopy() does not support for GSO_BY_FRAGS.
Joint work with Pablo.
Signed-off-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmbDBfYACgkQUqAMR0iA
lPL3ohAArEJ46nPdGWXEZ+K78biXlz/F3IXT+FH95YgtpIk0Tha6Jc5xybGerf/N
91GzWGbFweEFIIHq9i/CeBnmUEYsMocDF2hlmPiCvaqvMl1J6EuXgERUaPWqaQTS
fPZab7x8MitH64hFGWbMbvt8ZDJXyQaixtkQyA0AoRPMTpiQy0mFWbFIhtN9M+Cx
dov2l4N9je8X46X7SWDdKNvVEXHPnpWpq5NeMr9FW7yM4Kun3Hdb3Ks58sHS2oLm
EmPFQ6kNuxpHyXNvfjeE/JdXQZvK2gGOCNS4zykpGVYJJvhmfrNSwR7iGhm0z/Zw
sFObF46fK2NTkD5UZ9jQK8+uTiOwpiZSka8v55LocLa7gg2e1G7owaRSIMKjeNYT
GVmcdkgLqdtfKo3D3rM+auWXlP9o+ioqM52HCewWzMXd0HC2nLx28X/66oHbif9U
qJSjDPTtvlVEfIcbLr0bRX9KrYeqwtXD74zxB+msbi3Z2C/O9CrFfnGaI0h6+8cb
RwAptjiO8QdbKkL06CW5RjM5ulNqtPmRETziwA01gh5h6AE5oR1PHCf0DM12ulYK
/gY/rMznZ6qK0G+BYQyRhMgZh5P5KPvL77a7kxknuj4va2s6c2EsnG8u5iYcYAdo
YHWN6Jad1OPfQyHsqQ7IL+zlQzTPKmuy3PHQcZwBezUPWRY96kI=
=2wc2
-----END PGP SIGNATURE-----
Merge tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
- Do not block printk on non-panic CPUs when they are dumping
backtraces
* tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk/panic: Allow cpu backtraces to be written into ringbuffer during panic
Its possible that two threads call tcp_sk_exit_batch() concurrently,
once from the cleanup_net workqueue, once from a task that failed to clone
a new netns. In the latter case, error unwinding calls the exit handlers
in reverse order for the 'failed' netns.
tcp_sk_exit_batch() calls tcp_twsk_purge().
Problem is that since commit b099ce2602 ("net: Batch inet_twsk_purge"),
this function picks up twsk in any dying netns, not just the one passed
in via exit_batch list.
This means that the error unwind of setup_net() can "steal" and destroy
timewait sockets belonging to the exiting netns.
This allows the netns exit worker to proceed to call
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
without the expected 1 -> 0 transition, which then splats.
At same time, error unwind path that is also running inet_twsk_purge()
will splat as well:
WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
...
refcount_dec include/linux/refcount.h:351 [inline]
inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
setup_net+0x714/0xb40 net/core/net_namespace.c:375
copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
This doesn't seem like an actual bug (no tw sockets got lost and I don't
see a use-after-free) but as erroneous trigger of debug check.
Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
blocks other task from doing final _dec_and_test before mutex-owner has
removed all tw sockets of dying netns.
Fixes: e9bd0cca09 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240812222857.29837-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sparse tool complains as follows:
drivers/iommu/iommufd/selftest.c:277:30: warning:
symbol 'dirty_ops' was not declared. Should it be static?
This symbol is not used outside of selftest.c, so marks it static.
Fixes: 266ce58989 ("iommufd/selftest: Test IOMMU_HWPT_ALLOC_DIRTY_TRACKING")
Link: https://patch.msgid.link/r/20240819120007.3884868-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Hangbin Liu says:
====================
selftests: Fix udpgro failures
There are 2 issues for the current udpgro test. The first one is the testing
doesn't record all the failures, which may report pass but the test actually
failed. e.g.
https://netdev-3.bots.linux.dev/vmksft-net/results/725661/45-udpgro-sh/stdout
The other one is after commit d7db7775ea ("net: veth: do not manipulate
GRO when using XDP"), there is no need to load xdp program to enable GRO
on veth device.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit d7db7775ea ("net: veth: do not manipulate GRO when using
XDP"), there is no need to load XDP program to enable GRO. On the other
hand, the current test is failed due to loading the XDP program. e.g.
# selftests: net: udpgro.sh
# ipv4
# no GRO ok
# no GRO chk cmsg ok
# GRO ./udpgso_bench_rx: recv: bad packet len, got 1472, expected 14720
#
# failed
[...]
# bad GRO lookup ok
# multiple GRO socks ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
#
# ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
#
# failed
ok 1 selftests: net: udpgro.sh
After fix, all the test passed.
# ./udpgro.sh
ipv4
no GRO ok
[...]
multiple GRO socks ok
Fixes: d7db7775ea ("net: veth: do not manipulate GRO when using XDP")
Reported-by: Yi Chen <yiche@redhat.com>
Closes: https://issues.redhat.com/browse/RHEL-53858
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we only check the latest senders's exit code. If the receiver
report failed, it is not recoreded. Fix it by checking the exit code
of all the involved processes.
Before:
bad GRO lookup ok
multiple GRO socks ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
failed
$ echo $?
0
After:
bad GRO lookup ok
multiple GRO socks ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
failed
$ echo $?
1
Fixes: 3327a9c463 ("selftests: add functionals test for UDP GRO")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fsl,enetc-ptp is embedded pcie device. Add compatible string pci1957,ee02.
Fix warning:
arch/arm64/boot/dts/freescale/fsl-ls1028a-kontron-kbox-a-230-ls.dtb: ethernet@0,4:
compatible:0: 'pci1957,ee02' is not one of ['fsl,etsec-ptp', 'fsl,fman-ptp-timer', 'fsl,dpaa2-ptp', 'fsl,enetc-ptp']
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As commit 2e6506e1c4 ("mm/migrate: fix deadlock in
migrate_pages_batch() on large folios") has landed upstream, large
folios can be safely enabled for compressed inodes since all
prerequisites have already landed in 6.11-rc1.
Stress tests has been running on my fleet for over 20 days without any
regression. Additionally, users [1] have requested it for months.
Let's allow large folios for EROFS full cases upstream now for wider
testing.
[1] https://lore.kernel.org/r/CAGsJ_4wtE8OcpinuqVwG4jtdx6Qh5f+TON6wz+4HMCq=A2qFcA@mail.gmail.com
Cc: Barry Song <21cnbao@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
[ Gao Xiang: minor commit typo fixes. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240819025207.3808649-1-hsiangkao@linux.alibaba.com
- Use i_size instead of i_size_read() due to immutable fses;
- Get rid of an unneeded goto since erofs_fill_dentries() also works;
- Remove unnecessary lines.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240801112622.2164029-1-hongzhen@linux.alibaba.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Replace the deprecated one-element arrays with flexible-array members
in the structs filesystem_attribute_info and filesystem_device_info.
There are no binary differences after this conversion.
Link: https://github.com/KSPP/linux/issues/79
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
There are a couple of spelling mistakes in the documentation. This patch
fixes them.
Signed-off-by: Victor Timofei <victor@vtimothy.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If there is ->PreviousSessionId field in the session setup request,
The session of the previous connection should be destroyed.
During this, if the smb2 operation requests in the previous session are
being processed, a racy issue could happen with ksmbd_destroy_file_table().
This patch sets conn->status to KSMBD_SESS_NEED_RECONNECT to block
incoming operations and waits until on-going operations are complete
(i.e. idle) before desctorying the previous session.
Fixes: c8efcc7861 ("ksmbd: add support for durable handles v1/v2")
Cc: stable@vger.kernel.org # v6.6+
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25040
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
rsp buffer is allocated larger than spnego_blob from
smb2_allocate_rsp_buf().
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>