This patch adds SMCv2.x supplemental features negotiation. Supported
SMCv2.x supplemental features are represented by feature_mask in FCE
field. The negotiation process is as follows.
Server Client
Proposal(features(c-mask bits))
<-----------------------------------------
Accept(features(s-mask bits))
----------------------------------------->
Confirm(features(s&c-mask bits))
<-----------------------------------------
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-and-tested-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The structs of CLC accept and confirm messages for SMCv1 and SMCv2 are
separately defined and often casted to each other in the code, which may
increase the risk of errors caused by future divergence of them. So
unify them into one struct for better maintainability.
Suggested-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a large if-else block in smc_clc_send_confirm_accept() and it
is better to split it into two sub-functions.
Suggested-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename some functions or variables with 'fce' in their name but used in
SMCv2.1 as 'fce_v2x' for clarity.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we know we cannot send the datagram (state can be set to LLCP_CLOSED
by nfc_llcp_socket_release()), there is no need to proceed further.
Thus, bail out early from llcp_sock_sendmsg().
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
llcp_sock_sendmsg() calls nfc_llcp_send_ui_frame() which in turn calls
nfc_alloc_send_skb(), which accesses the nfc_dev from the llcp_sock for
getting the headroom and tailroom needed for skb allocation.
Parallelly the nfc_dev can be freed, as the refcount is decreased via
nfc_free_device(), leading to a UAF reported by Syzkaller, which can
be summarized as follows:
(1) llcp_sock_sendmsg() -> nfc_llcp_send_ui_frame()
-> nfc_alloc_send_skb() -> Dereference *nfc_dev
(2) virtual_ncidev_close() -> nci_free_device() -> nfc_free_device()
-> put_device() -> nfc_release() -> Free *nfc_dev
When a reference to llcp_local is acquired, we do not acquire the same
for the nfc_dev. This leads to freeing even when the llcp_local is in
use, and this is the case with the UAF described above too.
Thus, when we acquire a reference to llcp_local, we should acquire a
reference to nfc_dev, and release the references appropriately later.
References for llcp_local is initialized in nfc_llcp_register_device()
(which is called by nfc_register_device()). Thus, we should acquire a
reference to nfc_dev there.
nfc_unregister_device() calls nfc_llcp_unregister_device() which in
turn calls nfc_llcp_local_put(). Thus, the reference to nfc_dev is
appropriately released later.
Reported-and-tested-by: syzbot+bbe84a4010eeea00982d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bbe84a4010eeea00982d
Fixes: c7aa12252f ("NFC: Take a reference on the LLCP local pointer when creating a socket")
Reviewed-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change rxrpc's API such that:
(1) A new function, rxrpc_kernel_lookup_peer(), is provided to look up an
rxrpc_peer record for a remote address and a corresponding function,
rxrpc_kernel_put_peer(), is provided to dispose of it again.
(2) When setting up a call, the rxrpc_peer object used during a call is
now passed in rather than being set up by rxrpc_connect_call(). For
afs, this meenat passing it to rxrpc_kernel_begin_call() rather than
the full address (the service ID then has to be passed in as a
separate parameter).
(3) A new function, rxrpc_kernel_remote_addr(), is added so that afs can
get a pointer to the transport address for display purposed, and
another, rxrpc_kernel_remote_srx(), to gain a pointer to the full
rxrpc address.
(4) The function to retrieve the RTT from a call, rxrpc_kernel_get_srtt(),
is then altered to take a peer. This now returns the RTT or -1 if
there are insufficient samples.
(5) Rename rxrpc_kernel_get_peer() to rxrpc_kernel_call_get_peer().
(6) Provide a new function, rxrpc_kernel_get_peer(), to get a ref on a
peer the caller already has.
This allows the afs filesystem to pin the rxrpc_peer records that it is
using, allowing faster lookups and pointer comparisons rather than
comparing sockaddr_rxrpc contents. It also makes it easier to get hold of
the RTT. The following changes are made to afs:
(1) The addr_list struct's addrs[] elements now hold a peer struct pointer
and a service ID rather than a sockaddr_rxrpc.
(2) When displaying the transport address, rxrpc_kernel_remote_addr() is
used.
(3) The port arg is removed from afs_alloc_addrlist() since it's always
overridden.
(4) afs_merge_fs_addr4() and afs_merge_fs_addr6() do peer lookup and may
now return an error that must be handled.
(5) afs_find_server() now takes a peer pointer to specify the address.
(6) afs_find_server(), afs_compare_fs_alists() and afs_merge_fs_addr[46]{}
now do peer pointer comparison rather than address comparison.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
rxrpc_find_service_conn_rcu() should make the "seq" counter odd on the
second pass, otherwise read_seqbegin_or_lock() never takes the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231117164846.GA10410@redhat.com/
Remove documentation for nonexistent struct members, addressing these
warnings:
./net/tipc/link.c:228: warning: Excess struct member 'media_addr' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'timer' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'refcnt' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'proto_msg' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'pmsg' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'backlog_limit' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'exp_msg_count' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'reset_rcv_checkpt' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'transmitq' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'snt_nxt' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'deferred_queue' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'unacked_window' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'next_out' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'long_msg_seq_no' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'bc_rcvr' description in 'tipc_link'
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now all sockets including TIME_WAIT are linked to bhash2 using
sock_common.skc_bind_node.
We no longer use inet_bind2_bucket.deathrow, sock.sk_bind2_node,
and inet_timewait_sock.tw_bind2_node.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we can use sk_bind_node/tw_bind_node for bhash2, which means
we need not link TIME_WAIT sockets separately.
The dead code and sk_bind2_node will be removed in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we do not use tb->owners and can unlink sockets from bhash.
sk_bind_node/tw_bind_node are available for bhash2 and will be
used in the following patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use hlist_empty(&tb->owners) to check if the bhash bucket has a socket.
We can check the child bhash2 buckets instead.
For this to work, the bhash2 bucket must be freed before the bhash bucket.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sockets in bhash are also linked to bhash2, but TIME_WAIT sockets
are linked separately in tb2->deathrow.
Let's replace tb->owners iteration in inet_csk_bind_conflict() with
two iterations over tb2->owners and tb2->deathrow.
This can be done safely under bhash's lock because socket insertion/
deletion in bhash2 happens with bhash's lock held.
Note that twsk_for_each_bound_bhash() will be removed later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following patch adds code in the !inet_use_bhash2_on_bind(sk)
case in inet_csk_bind_conflict().
To avoid adding nest and make the change cleaner, this patch
rearranges tests in inet_csk_bind_conflict().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bhash2 added a new member sk_bind2_node in struct sock to link
sockets to bhash2 in addition to bhash.
bhash is still needed to search conflicting sockets efficiently
from a port for the wildcard address. However, bhash itself need
not have sockets.
If we link each bhash2 bucket to the corresponding bhash bucket,
we can iterate the same set of the sockets from bhash2 via bhash.
This patch links bhash2 to bhash only, and the actual use will be
in the later patches. Finally, we will remove sk_bind2_node.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Later, we no longer link sockets to bhash. Instead, each bhash2
bucket is linked to the corresponding bhash bucket.
Then, we pass the bhash bucket to bhash2 allocation functions as
tb. However, tb is already used in inet_bind2_bucket_create() and
inet_bind2_bucket_init() as the bhash2 bucket.
To make the following diff clear, let's use tb2 for the bhash2 bucket
there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_bind2_bucket_addr_match() and inet_bind2_bucket_match_addr_any()
are called for each bhash2 bucket to check conflicts. Thus, we call
ipv6_addr_any() and ipv6_addr_v4mapped() over and over during bind().
Let's avoid calling them by saving the address type in inet_bind2_bucket.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In bhash2, IPv4/IPv6 addresses are saved in two union members,
which complicate address checks in inet_bind2_bucket_addr_match()
and inet_bind2_bucket_match_addr_any() considering uninitialised
memory and v4-mapped-v6 conflicts.
Let's simplify that by saving IPv4 address as v4-mapped-v6 address
and defining tb2.rcv_saddr as tb2.v6_rcv_saddr.s6_addr32[3].
Then, we can compare v6 address as is, and after checking v4-mapped-v6,
we can compare v4 address easily. Also, we can remove tb2->family.
Note these functions will be further refactored in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The protocol family tests in inet_bind2_bucket_addr_match() and
inet_bind2_bucket_match_addr_any() are ordered as follows.
if (sk->sk_family != tb2->family)
else if (sk->sk_family == AF_INET6)
else
This patch rearranges them so that AF_INET6 socket is handled first
to make the following patch tidy, where tb2->family will be removed.
if (sk->sk_family == AF_INET6)
else if (tb2->family == AF_INET6)
else
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While checking port availability in bind() or listen(), we used only
bhash for all v4-mapped-v6 addresses. But there is no good reason not
to use bhash2 for v4-mapped-v6 non-wildcard addresses.
Let's do it by returning true in inet_use_bhash2_on_bind(). Then, we
also need to add a test in inet_bind2_bucket_match_addr_any() so that
::ffff:X.X.X.X will match with 0.0.0.0.
Note that sk->sk_rcv_saddr is initialised for v4-mapped-v6 sk in
__inet6_bind().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In min_key_size_set():
if (val > hdev->le_max_key_size || val < SMP_MIN_ENC_KEY_SIZE)
return -EINVAL;
hci_dev_lock(hdev);
hdev->le_min_key_size = val;
hci_dev_unlock(hdev);
In max_key_size_set():
if (val > SMP_MAX_ENC_KEY_SIZE || val < hdev->le_min_key_size)
return -EINVAL;
hci_dev_lock(hdev);
hdev->le_max_key_size = val;
hci_dev_unlock(hdev);
The atomicity violation occurs due to concurrent execution of set_min and
set_max funcs.Consider a scenario where setmin writes a new, valid 'min'
value, and concurrently, setmax writes a value that is greater than the
old 'min' but smaller than the new 'min'. In this case, setmax might check
against the old 'min' value (before acquiring the lock) but write its
value after the 'min' has been updated by setmin. This leads to a
situation where the 'max' value ends up being smaller than the 'min'
value, which is an inconsistency.
This possible bug is found by an experimental static analysis tool
developed by our team, BassCheck[1]. This tool analyzes the locking APIs
to extract function pairs that can be concurrently executed, and then
analyzes the instructions in the paired functions to identify possible
concurrency bugs including data races and atomicity violations. The above
possible bug is reported when our tool analyzes the source code of
Linux 5.17.
To resolve this issue, it is suggested to encompass the validity checks
within the locked sections in both set_min and set_max funcs. The
modification ensures that the validation of 'val' against the
current min/max values is atomic, thus maintaining the integrity of the
settings. With this patch applied, our tool no longer reports the bug,
with the kernel configuration allyesconfig for x86_64. Due to the lack of
associated hardware, we cannot test the patch in runtime testing, and just
verify it according to the code logic.
[1] https://sites.google.com/view/basscheck/
Fixes: 18f81241b7 ("Bluetooth: Move {min,max}_key_size debugfs ...")
Cc: stable@vger.kernel.org
Signed-off-by: Gui-Dong Han <2045gemini@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In case of an incomplete command or a command with a null identifier 2
reject packets will be sent, one with the identifier and one with 0.
Consuming the data of the command will prevent it.
This allows to send a reject packet for each corrupted command in a
multi-command packet.
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
when Bluetooth set the event mask and enter suspend, the controller
has hci mode change event coming, it cause controller can not enter
sleep mode. so it should to set the hci mode change event mask before
enter suspend.
Signed-off-by: clancy shang <clancy.shang@quectel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
For some controllers such as QCA2066, it does not need to send
HCI_Configure_Data_Path to configure non-HCI data transport path to support
HFP offload, their device drivers may set hdev->get_codec_config_data as
NULL, so Explicitly add this non NULL checking before calling the function.
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When a PA sync socket is closed, the associated hcon is also unlinked
and cleaned up. If there are no other hcons marked with the
HCI_CONN_PA_SYNC flag, HCI_OP_LE_PA_TERM_SYNC is sent to controller.
Between the time of the command and the moment PA sync is terminated
in controller, residual BIGInfo reports might continue to come.
This causes a new PA sync hcon to be added, and a new socket to be
notified to user space.
This commit fixs this by adding a flag on a Broadcast listening
socket to mark when the PA sync child has been closed.
This flag is checked when BIGInfo reports are indicated in
iso_connect_ind, to avoid recreating a hcon and socket if
residual reports arrive before PA sync is terminated.
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This reverts 19f8def031
"Bluetooth: Fix auth_complete_evt for legacy units" which seems to be
working around a bug on a broken controller rather then any limitation
imposed by the Bluetooth spec, in fact if there ws not possible to
re-auth the command shall fail not succeed.
Fixes: 19f8def031 ("Bluetooth: Fix auth_complete_evt for legacy units")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This removes le_restart_scan work and instead just disables controller
duplicate filtering when discovery result_filtering is enabled and
HCI_QUIRK_STRICT_DUPLICATE_FILTER is set.
Link: https://github.com/bluez/bluez/issues/573
Link: https://github.com/bluez/bluez/issues/572
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Most functions in `net/bluetooth/lib.c` lack propper
documentation.
This patch adds documentation to all exported functions
in `net/bluetooth/lib.c`.
Unnecessary or redundant comments are also removed to
ensure the file looks clean.
Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
For ISO Broadcast, all BISes from a BIG have the same lifespan - they
cannot be created or terminated independently from each other.
This links together all BIS hcons that are part of the same BIG, so all
hcons are kept alive as long as the BIG is active.
If multiple BIS sockets are opened for a BIG handle, and only part of
them are closed at some point, the associated hcons will be marked as
open. If new sockets will later be opened for the same BIG, they will
be reassociated with the open BIS hcons.
All BIS hcons will be cleaned up and the BIG will be terminated when
the last BIS socket is closed from userspace.
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This makes it possible to bind a PA sync socket to a number of BISes
before issuing the BIG Create Sync command.
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
A tracepoint fix that's been reading past the end of messages forever,
but semi-recently also went over the end of the buffer.
And a potential incorrectly freeing garbage in pdu parsing error path
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmWE4HEACgkQq06b7GqY
5nBiaw//VKYbPnx+4wtyT17BnvamwEQH88kdIFLjcmGfCKJyBL1UrBfNoV+azzUa
x4ob+M82zkRgfsQ0bDW1A4OWmRWXWngr/VJ53HpQA0FYff6dFGULXzHwQnz2Sn16
Lhn7D0FhZZ12skIcCCDRADZIT9ydh7fEfaI3F2goDuvwopwAoOZnYQjIwVhjNuVo
EIHO3VJShHh/sDMxdyurCkiv+PszIqUgeVW3FH9KNlFC/8hdKpMPGzVHpd1S72wP
9rex+9yIWb5+o6zpxIB4K6lRWJg0itPKMRDGbey3BR1U7YAQlsGXJ54I6lMRSJ8N
penT0ztpBSEVWauPvWt0cmkb/ccBFWml4uncT9WK47ExwtbmOhqsk/wOWon8Jho/
i09lB/Uk5jHSvjong6BbKZ8rYhHdOK9lXfr8UWoJ+jWAilMHqt6Yxb61MRizfe60
nsVNExBNr78gP/k2U0Fd1++27nB48jvvXN07lVM6uWqyjzaLXahVAb6moz9O7/PD
ntZVocPCL+XLoByOhRrLNgpyrWiR3ClurpQCsR8d4AE3H7D0CyKLbdLBUBH6fkul
RErrolev90AKNYthOf6MFWpZuyWnTpK4RooaB0D6KNsIfMUOhfAaIqaM2u50RK/T
B7hWbFCSPfXkpAM9XcYayVnaJp2dj9MBu6ioQIns4DqqHB//M8U=
=jdEx
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.7-rc7' of https://github.com/martinetd/linux
Pull 9p fixes from Dominique Martinet:
"Two small fixes scheduled for stable trees:
A tracepoint fix that's been reading past the end of messages forever,
but semi-recently also went over the end of the buffer. And a
potential incorrectly freeing garbage in pdu parsing error path"
* tag '9p-for-6.7-rc7' of https://github.com/martinetd/linux:
net: 9p: avoid freeing uninit memory in p9pdu_vreadf
9p: prevent read overrun in protocol dump tracepoint
Parse netlink attribute containing the chain type in this update, to
bail out if this is different from the existing type.
Otherwise, it is possible to define a chain with the same name, hook and
priority but different type, which is silently ignored.
Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
conntrack zones are heavily used by tools like openvswitch to run
multiple virtual "routers" on a single machine. In this context each
conntrack zone matches to a single router, thereby preventing
overlapping IPs from becoming issues.
In these systems it is common to operate on all conntrack entries of a
given zone, e.g. to delete them when a router is deleted. Previously this
required these tools to dump the full conntrack table and filter out the
relevant entries in userspace potentially causing performance issues.
To do this we reuse the existing CTA_ZONE attribute. This was previous
parsed but not used during dump and flush requests. Now if CTA_ZONE is
set we filter these operations based on the provided zone.
However this means that users that previously passed CTA_ZONE will
experience a difference in functionality.
Alternatively CTA_FILTER could have been used for the same
functionality. However it is not yet supported during flush requests and
is only available when using AF_INET or AF_INET6.
Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
Co-developed-by: Max Lamprecht <max.lamprecht@mail.schwarz>
Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If a transaction is aborted, we should mark the to-be-released NEWSET dead,
just like commit path does for DEL and DESTROYSET commands.
In both cases all remaining elements will be released via
set->ops->destroy().
The existing abort code does NOT post the actual release to the work queue.
Also the entire __nf_tables_abort() function is wrapped in gc_seq
begin/end pair.
Therefore, async gc worker will never try to release the pending set
elements, as gc sequence is always stale.
It might be possible to speed up transaction aborts via work queue too,
this would result in a race and a possible use-after-free.
So fix this before it becomes an issue.
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to use GFP_ATOMIC here.
Fixes: f6c383b8c3 ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Set expressions' dump callbacks are not concurrency-safe per-se with
reset bit set. If two CPUs reset the same element at the same time,
values may underrun at least with element-attached counters and quotas.
Prevent this by introducing dedicated callbacks for nfnetlink and the
asynchronous dump handling to serialize access.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a wrapper around nft_ctx_init() for use in
nf_tables_getsetelem() and a resetting equivalent introduced later.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The function is not supposed to alter the set, passing the pointer as
const is fine and merely requires to adjust signatures of two called
functions as well.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
During ieee80211_set_active_links() we do (among the others):
1. Call drv_change_vif_links() with both old_active and new_active
2. Unassign the chanctx for the removed link(s) (if any)
3. Assign chanctx to the added link(s) (if any)
4. Call drv_change_vif_links() with the new_active links bitmap
The problem here is that during step #1 the driver doesn't know whether
we will activate multiple links simultaneously or are just doing a link
switch, so it can't check there if multiple links are supported/enabled.
(Some of the drivers might enable/disable this option dynamically)
And during step #3, in which the driver already knows that,
returning an error code (for example when multiple links are not
supported or disabled), will cause a warning, and we will still complete
the transition to the new_active links.
(It is hard to undo things in that stage, since we released channels etc.)
Therefore add a driver callback to check if the desired new_active links
will be supported by the driver or not. This callback will be called
in the beginning of ieee80211_set_active_links() so we won't do anything
before we are sure it is supported.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://msgid.link/20231220133549.64c4d70b33b8.I79708619be76b8ecd4ef3975205b8f903e24a2cd@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Handle a case of time overflow, where the switch time might
be smaller than the partial TSF in the beacon.
Additionally, apply advertised TTLM earlier in order to be
ready on time on the newly activated links.
Fixes: 702e80470a ("wifi: mac80211: support handling of advertised TID-to-link mapping")
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.15079c34e5c8.I0dd50bcceff5953080cdd7aee5118b72c78c6507@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
cfg80211_update_known_bss will always consume the passed IEs. As such,
cfg80211_update_assoc_bss_entry also needs to always set the pointers to
NULL so that no double free can occur.
Note that hitting this would probably require being connected to a
hidden BSS which is then doing a channel switch while also switching to
be not hidden anymore at the same time.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.8891edb28d51.Id09c5145363e990ff5237decd58296302e2d53c8@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
cfg80211_bss_update is expected to consume the IEs that are passed into
it in the temporary internal BSS. This did not happen in some error
cases (which are also WARN_ON paths), so change the code to use a common
label and use that everywhere.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.8e72ea105e17.Ic81e9431e980419360e97502ce8c75c58793f05a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is a more of a cosmetic fix. The branch will only be taken if
proberesp_ies is set, which implies that beacon_ies is not set unless we
are connected to an AP that just did a channel switch. And, in that case
we should have found the BSS in the internal storage to begin with.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.b898e22dadff.Id8c4c10aedd176ef2e18a4cad747b299f150f9df@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When reporting the radiotap timestamp, the mactime field is
usually unused, we take the data from the device_timestamp.
However, there can be cases where the radiotap timestamp is
better reported as a 64-bit value, so since the mactime is
free, add a flag to support using the mactime as a 64-bit
radiotap timestamp.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.00c8b9234f0c.Ie3ce5eae33cce88fa01178e7aea94661ded1ac24@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We only have a single flag free, and before using that for
another mactime flag, instead refactor the mactime flags
to use a 2-bit field.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.d0e664832d14.I20c8900106f9bf81316bed778b1e3ce145785274@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
UHB AP send supported power type(LPI, SP, VLP)
in beacon and probe response IE and STA should
connect to these AP only if their regulatory support
the AP power type.
Beacon/Probe response are reported to userspace
with reason "STA regulatory not supporting to connect to AP
based on transmitted power type" and it should
not connect to AP.
Signed-off-by: Mukesh Sisodiya <mukesh.sisodiya@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.cbfbef9170a9.I432f78438de18aa9f5c9006be12e41dc34cc47c5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some devices may support concurrent DFS operation which relies on the
BSS channel width for its relaxations. Notify cfg80211 about BW change
so it can schedule regulatory checks.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.e08f8e9ebc67.If8915d13e203ebd380579f55fd9148e9b3f43306@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Due to different relaxation policies it may be needed to re-check
channels after a BSS station interface is disconnected or performed a
channel switch.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.1f2f8475bcf1.I1879d259d8d756159c8060f61f4bce172e6d323e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
FCC-594280 D01 Section B.3 allows peer-to-peer and ad hoc devices to
operate on DFS channels while they operate under the control of a
concurrent DFS master. For example, it is possible to have a P2P GO on a
DFS channel as long as BSS connection is active on the same channel.
Allow such operation by adding additional regulatory flags to indicate
DFS concurrent channels and capable devices. Add the required
relaxations in DFS regulatory checks.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220133549.bdfb8a9c7c54.I973563562969a27fea8ec5685b96a3a47afe142f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
It is possible for the TX status report for the (Re)Association Request
frame to be delayed long enough for the AP's (Re)Association Response
frame to be received and processed before it. If that were to happen for
a case where the AP rejects the association with indication to come back
later, the association timeout and retry state should not be modified
anymore with the TX status information that would be processed after
this. Updating the association timeout in such a reverse order of events
could result in shortening the timeouts for the association comeback
mechanism and that could result in the association failing.
Track whether we have already processed association rejection with
comeback time and if so, skip the timeout and retry update on any
following TX status report.
Signed-off-by: Jouni Malinen <quic_jouni@quicinc.com>
Link: https://msgid.link/20231219174814.2581575-1-j@w1.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmWEVXEACgkQ+7dXa6fL
C2sxbw/+IIgTpjfXDGQoGOpcvQyHW+gMGFqrrZjKJZGQiNZ0DfHPciYcMvOOyqZp
rbt22V/WvQKOlcQ1IYQqjdB47DilFGRepRLZ/fuqq6JDmcHGx2Btj8uJTsV0He4o
rCLXVrfm/JNYECY6dO5bGizrCYL6clVo0x/U2LPlU/2mbXltY1d1yXtzE++6kBZl
w/MLJDmQxvONarhpdD0J9E/uAJ+kHX05HhlqnSxu8HEoGHMVka1N5EGAOq9cICvm
y/8NwnGtflhpJEIso2Kx7XAE8kszXyKw0PJvOaO4GG1PWMs3rIrZbHn7wCbChyMi
xOw+qZVC60BTang/vEOo5I4eFD+NIdBDoGdyuyNICXDIMQ9WvN2nF5qUdFAeR7Vi
Dgxld1WWHm6RcOjl6y9t5Na0zJmgdOyONWx6Xli/AJw2RTx5JiVzDuKP6yu+DMvn
DUPrjEQ1m+qPbTwclEzqu3grNabp7EX1vYRKDC4bf+Lg8iGNxlFp+2uyg14HsDUH
N/yqnj8MK6ADcVMfZGGUalIzsgN06vHfHhE7Tj4xSnrR1dekxBveNFJM3r+eeaLV
0VsjHW/IMKPWxO/vDzi6zr0nBeWYQgxAAg+w3LXl3qRGEXlihibmosofovhQFD6k
GhkXojmc3BeSceVfOcEHZu0xXZIy/2y+hZy95BbNLT/eiCwIf/M=
=GJNW
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20231221' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Improve the interaction of arbitrary lookups in the AFS dynamic root
that hit DNS lookup failures [1] where kafs behaves differently from
openafs and causes some applications to fail that aren't expecting
that. Further, negative DNS results aren't getting removed and are
causing failures to persist.
- Always delete unused (particularly negative) dentries as soon as
possible so that they don't prevent future lookups from retrying.
- Fix the handling of new-style negative DNS lookups in ->lookup() to
make them return ENOENT so that userspace doesn't get confused when
stat succeeds but the following open on the looked up file then
fails.
- Fix key handling so that DNS lookup results are reclaimed almost as
soon as they expire rather than sitting round either forever or for
an additional 5 mins beyond a set expiry time returning
EKEYEXPIRED. They persist for 1s as /bin/ls will do a second stat
call if the first fails"
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216637 [1]
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
* tag 'afs-fixes-20231221' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
keys, dns: Allow key types (eg. DNS) to be reclaimed immediately on expiry
afs: Fix dynamic root lookup DNS check
afs: Fix the dynamic root's d_delete to always delete unused dentries
Current release - regressions:
- bpf: syzkaller found null ptr deref in unix_bpf proto add
- eth: i40e: fix ST code value for clause 45
Previous releases - regressions:
- core: return error from sk_stream_wait_connect() if sk_wait_event() fails
- ipv6: revert remove expired routes with a separated list of routes
- wifi rfkill:
- set GPIO direction
- fix crash with WED rx support enabled
- bluetooth:
- fix deadlock in vhci_send_frame
- fix use-after-free in bt_sock_recvmsg
- eth: mlx5e: fix a race in command alloc flow
- eth: ice: fix PF with enabled XDP going no-carrier after reset
- eth: bnxt_en: do not map packet buffers twice
Previous releases - always broken:
- core:
- check vlan filter feature in vlan_vids_add_by_dev() and vlan_vids_del_by_dev()
- check dev->gso_max_size in gso_features_check()
- mptcp: fix inconsistent state on fastopen race
- phy: skip LED triggers on PHYs on SFP modules
- eth: mlx5e:
- fix double free of encap_header
- fix slab-out-of-bounds in mlx5_query_nic_vport_mac_list()
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWELNYSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkL0MQALYng9iIz5m/iX0pjRpI4HxruflMkdWa
+FMRnWp0OE5ak8l/UcffgYRNozXAcA/PSJhanZU3gT21IeJ+X78paivWyEPUhpqN
d1nhRRsnx37fTK3lbS/wSGJUN+x4g49kHQn5mw8vfi/RGHuc/vbLO23iWsawB92/
7YI0rEzZh2b1FKytvqF9t2lLtJw5ucwQtdm3d/tg4iuL44Lq8dA69dln4wZx3t28
lobsW0eQW2JRh2YwrREb1oUD0CcUNk+XGsWVyUXqs31OflkqYMEzI41Yxs/lHJs0
0Lmt3/F2Ls+H+vEYElJ0wsNPFZr4TDhAsV5KMxZdoBfWhTN8ordloBXGHre1IVSK
SQtja5IqT01dLbDoL7tLpyEsGLp1A+HPH+BVxt582srSMoXWmFYOZcRczKJ85C1W
qaohCGeEO537ExrAMHbJ0CxR3oSawyOBszjTYGdbI3xiFj5q1n48YyJSep//rgvP
PewzqtMpPPapPIiJbvRjN8Mn56Y2832TSbPOVZ2KJuBpx+i/mjXyIK97FMb+Jdvu
3ACH3BmsUfvXXpXNSZIgtc35tP03GSeV9B2vzlhjFwxB2vV4wuX9NbI5OIWi7ZM1
03jkC2wQf6jVby45IM5kMuEKL3hMXUx9t0nZg0szJ3T31+OQ6e5Hlv1Aqp4Ihn5Q
N+fxo6lpm+Aq
=sEmi
-----END PGP SIGNATURE-----
Merge tag 'net-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from WiFi and bpf.
Current release - regressions:
- bpf: syzkaller found null ptr deref in unix_bpf proto add
- eth: i40e: fix ST code value for clause 45
Previous releases - regressions:
- core: return error from sk_stream_wait_connect() if sk_wait_event()
fails
- ipv6: revert remove expired routes with a separated list of routes
- wifi rfkill:
- set GPIO direction
- fix crash with WED rx support enabled
- bluetooth:
- fix deadlock in vhci_send_frame
- fix use-after-free in bt_sock_recvmsg
- eth: mlx5e: fix a race in command alloc flow
- eth: ice: fix PF with enabled XDP going no-carrier after reset
- eth: bnxt_en: do not map packet buffers twice
Previous releases - always broken:
- core:
- check vlan filter feature in vlan_vids_add_by_dev() and
vlan_vids_del_by_dev()
- check dev->gso_max_size in gso_features_check()
- mptcp: fix inconsistent state on fastopen race
- phy: skip LED triggers on PHYs on SFP modules
- eth: mlx5e:
- fix double free of encap_header
- fix slab-out-of-bounds in mlx5_query_nic_vport_mac_list()"
* tag 'net-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
net: check dev->gso_max_size in gso_features_check()
kselftest: rtnetlink.sh: use grep_fail when expecting the cmd fail
net/ipv6: Revert remove expired routes with a separated list of routes
net: avoid build bug in skb extension length calculation
net: ethernet: mtk_wed: fix possible NULL pointer dereference in mtk_wed_wo_queue_tx_clean()
net: stmmac: fix incorrect flag check in timestamp interrupt
selftests: add vlan hw filter tests
net: check vlan filter feature in vlan_vids_add_by_dev() and vlan_vids_del_by_dev()
net: hns3: add new maintainer for the HNS3 ethernet driver
net: mana: select PAGE_POOL
net: ks8851: Fix TX stall caused by TX buffer overrun
ice: Fix PF with enabled XDP going no-carrier after reset
ice: alter feature support check for SRIOV and LAG
ice: stop trashing VF VSI aggregator node ID information
mailmap: add entries for Geliang Tang
mptcp: fill in missing MODULE_DESCRIPTION()
mptcp: fix inconsistent state on fastopen race
selftests: mptcp: join: fix subflow_send_ack lookup
net: phy: skip LED triggers on PHYs on SFP modules
bpf: Add missing BPF_LINK_TYPE invocations
...
If a key has an expiration time, then when that time passes, the key is
left around for a certain amount of time before being collected (5 mins by
default) so that EKEYEXPIRED can be returned instead of ENOKEY. This is a
problem for DNS keys because we want to redo the DNS lookup immediately at
that point.
Fix this by allowing key types to be marked such that keys of that type
don't have this extra period, but are reclaimed as soon as they expire and
turn this on for dns_resolver-type keys. To make this easier to handle,
key->expiry is changed to be permanent if TIME64_MAX rather than 0.
Furthermore, give such new-style negative DNS results a 1s default expiry
if no other expiry time is set rather than allowing it to stick around
indefinitely. This shouldn't be zero as ls will follow a failing stat call
immediately with a second with AT_SYMLINK_NOFOLLOW added.
Fixes: 1a4240f476 ("DNS: Separate out CIFS DNS Resolver code")
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: Wang Lei <wang840925@gmail.com>
cc: Jeff Layton <jlayton@redhat.com>
cc: Steve French <smfrench@gmail.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jarkko Sakkinen <jarkko@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: keyrings@vger.kernel.org
cc: netdev@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZYQVjgAKCRDbK58LschI
g/NfAP9xMBCASd22+KPu44FtPPO5DKcdG7hATXZMpb/cygF8GQEAojcZ4jztx42S
F1+4RPEoxrn31oVYdtEGFY9q85ruzgA=
=2XhN
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-12-21
Hi David, hi Jakub, hi Paolo, hi Eric,
The following pull-request contains BPF updates for your *net* tree.
We've added 3 non-merge commits during the last 5 day(s) which contain
a total of 4 files changed, 45 insertions(+).
The main changes are:
1) Fix a syzkaller splat which triggered an oob issue in bpf_link_show_fdinfo(),
from Jiri Olsa.
2) Fix another syzkaller-found issue which triggered a NULL pointer dereference
in BPF sockmap for unconnected unix sockets, from John Fastabend.
bpf-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add missing BPF_LINK_TYPE invocations
bpf: sockmap, test for unconnected af_unix sock
bpf: syzkaller found null ptr deref in unix_bpf proto add
====================
Link: https://lore.kernel.org/r/20231221104844.1374-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some drivers might misbehave if TSO packets get too big.
GVE for instance uses a 16bit field in its TX descriptor,
and will do bad things if a packet is bigger than 2^16 bytes.
Linux TCP stack honors dev->gso_max_size, but there are
other ways for too big packets to reach an ndo_start_xmit()
handler : virtio_net, af_packet, GRO...
Add a generic check in gso_features_check() and fallback
to GSO when needed.
gso_max_size was added in the blamed commit.
Fixes: 82cc1a7a56 ("[NET]: Add per-connection option to set max TSO frame size")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231219125331.4127498-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reverts commit 3dec89b14d.
The commit has some race conditions given how expires is managed on a
fib6_info in relation to gc start, adding the entry to the gc list and
setting the timer value leading to UAF. Revert the commit and try again
in a later release.
Fixes: 3dec89b14d ("net/ipv6: Remove expired routes with a separated list of routes")
Cc: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231219030243.25687-1-dsahern@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
GCC seems to incorrectly fail to evaluate skb_ext_total_length() at
compile time under certain conditions.
The issue even occurs if all values in skb_ext_type_len[] are "0",
ruling out the possibility of an actual overflow.
As the patch has been in mainline since v6.6 without triggering the
problem it seems to be a very uncommon occurrence.
As the issue only occurs when -fno-tree-loop-im is specified as part of
CFLAGS_GCOV, disable the BUILD_BUG_ON() only when building with coverage
reporting enabled.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312171924.4FozI5FG-lkp@intel.com/
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/lkml/487cfd35-fe68-416f-9bfd-6bb417f98304@app.fastmail.com/
Fixes: 5d21d0a65b ("net: generalize calculation of skb extensions length")
Cc: <stable@vger.kernel.org>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231218-net-skbuff-build-bug-v1-1-eefc2fb0a7d3@weissschuh.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
NFT_MSG_DELSET deactivates all elements in the set, skip
set->ops->commit() to avoid the unnecessary clone (for the pipapo case)
as well as the sync GC cycle, which could deactivate again expired
elements in such set.
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Reported-by: Kevin Rich <kevinrich1337@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Continue expanding Daniel's patch by adding new skb drop reasons that
are idiosyncratic to TC.
More specifically:
- SKB_DROP_REASON_TC_COOKIE_ERROR: An error occurred whilst
processing a tc ext cookie.
- SKB_DROP_REASON_TC_CHAIN_NOTFOUND: tc chain lookup failed.
- SKB_DROP_REASON_TC_RECLASSIFY_LOOP: tc exceeded max reclassify loop
iterations
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Incrementing on Daniel's patch[1], make tc-related drop reason more
flexible for remaining qdiscs - that is, all qdiscs aside from clsact.
In essence, the drop reason will be set by cls_api and act_api in case
any error occurred in the data path. With that, we can give the user more
detailed information so that they can distinguish between a policy drop
or an error drop.
[1] https://lore.kernel.org/all/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move drop_reason from struct tcf_result to skb cb - more specifically to
struct tc_skb_cb. With that, we'll be able to also set the drop reason for
the remaining qdiscs (aside from clsact) that do not have access to
tcf_result when time comes to set the skb drop reason.
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that both the common code as well as individual drivers support MDB
bulk deletion, allow user space to make such requests.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement MDB bulk deletion support in the bridge driver, allowing MDB
entries to be deleted in bulk according to provided parameters.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Invoke the new MDB bulk deletion device operation when the 'NLM_F_BULK'
flag is set in the netlink message header.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For MDB bulk delete we will need to validate 'MDBA_SET_ENTRY'
differently compared to regular delete. Specifically, allow the ifindex
to be zero (in case not filtering on bridge port) and force the address
to be zero as bulk delete based on address is not supported.
Do that by introducing a new policy and choosing the correct policy
based on the presence of the 'NLM_F_BULK' flag in the netlink message
header. Use nlmsg_parse() for strict validation.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add encryption key size check when acting as peripheral
- Shut up false-positive build warning
- Send reject if L2CAP command request is corrupted
- Fix Use-After-Free in bt_sock_recvmsg
- Fix not notifying when connection encryption changes
- Fix not checking if HCI_OP_INQUIRY has been sent
- Fix address type send over to the MGMT interface
- Fix deadlock in vhci_send_frame
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmV8mvcZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKcFTEACQRFdK50+kKekfzNoFe4SA
CEDQqPyoR59Q12zc/sZmVVdWKCLjHT3yx9/3UPOPhz4LWQl0juvpPhNE/WSIh+Dx
eozOZViebOfh9DstOSnjDKg7qGrybEjmtpVJn+jcFfzJedDVHF+POM97SleGSj50
EgJ8j3Qe8KNVQqhT0beJUE/Ku5f6zDBxZ3gcgphIU2CVnWQwrR36yaY/psDRn3oz
YXcSz/Un4gWpIw0pwby8kViFjcnt8c65xJLX0o+WL+2rJecTiikDvSPtVHhI3F9W
1dmZ9rNg+mBYc+Lw/LkpDpj051+A+dINa71jLS/XCZyj4MMQw3uH/8D2diHwLxX8
Nst3ZF/xDZsEjFEbJEfDamM6M7DtqmltQ8A28Lmp90+AMIczNlrU2ZYAnRd+zK/H
pbzTNddbVEj7gSSJb2OrymN45i1unX7dHZ9ASDAzBDxgQEdsrM0W1Qp0YW+7qefa
ihvQOGM52ZdGNM6C73rvSYXssGKTp8G/illIQkHg4szs1H96Yw4x2ArXo5S1KANC
5c/k4qQEsr7vgDgkfORtJZ6BKFBb6p6/toyf2GizAiGfcsKCUwHvd7i8/AeGznj3
bCkLnL5v9nOSEeIqvBe9hIFsL6a3k1CbpTF4U/tK33Ybqqp8eiGrQY35+1EESdY4
Cy5fz5AgONuLAiOoypmN5A==
=JAze
-----END PGP SIGNATURE-----
Merge tag 'for-net-2023-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Add encryption key size check when acting as peripheral
- Shut up false-positive build warning
- Send reject if L2CAP command request is corrupted
- Fix Use-After-Free in bt_sock_recvmsg
- Fix not notifying when connection encryption changes
- Fix not checking if HCI_OP_INQUIRY has been sent
- Fix address type send over to the MGMT interface
- Fix deadlock in vhci_send_frame
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch, transport offset (pkt->thoff) provides an offset
relative to the network header. This is fine for the inet families
because skb->data points to the network header in such case. However,
from netdev/egress, skb->data points to the mac header (if available),
thus, pkt->thoff is missing the mac header length.
Add skb_network_offset() to the transport offset (pkt->thoff) for
netdev, so transport header mangling works as expected. Adjust payload
fast eval function to use skb->data now that pkt->thoff provides an
absolute offset. This explains why users report that matching on
egress/netdev works but payload mangling does not.
This patch implicitly fixes payload mangling for IPv4 packets in
netdev/egress given skb_store_bits() requires an offset from skb->data
to reach the transport header.
I suspect that nft_exthdr and the trace infra were also broken from
netdev/egress because they also take skb->data as start, and pkt->thoff
was not correct.
Note that IPv6 is fine because ipv6_find_hdr() already provides a
transport offset starting from skb->data, which includes
skb_network_offset().
The bridge family also uses nft_set_pktinfo_ipv4_validate(), but there
skb_network_offset() is zero, so the update in this patch does not alter
the existing behaviour.
Fixes: 42df6e1d22 ("netfilter: Introduce egress hook")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since we no longer allow sending io_uring fds over SCM_RIGHTS, move to
using io_is_uring_fops() to detect whether this is a io_uring fd or not.
With that done, kill off io_uring_get_socket() as nobody calls it
anymore.
This is in preparation to yanking out the rest of the core related to
unix gc with io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZYHIBwAKCRDbK58LschI
g7nvAQDM5/5vMbsO9rEJAdHomJ4Ybvcv/2SG652za9DVEu6VUgD+JCLJ9He/hMx2
1wZ3wcjwxibNAZ/cZp6yu1JpOtR3rQo=
=BO/b
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-12-19
Hi David, hi Jakub, hi Paolo, hi Eric,
The following pull-request contains BPF updates for your *net-next* tree.
We've added 2 non-merge commits during the last 1 day(s) which contain
a total of 40 files changed, 642 insertions(+), 2926 deletions(-).
The main changes are:
1) Revert all of BPF token-related patches for now as per list discussion [0],
from Andrii Nakryiko.
[0] https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com
2) Fix a syzbot-reported use-after-free read in nla_find() triggered from
bpf_skb_get_nlattr_nest() helper, from Jakub Kicinski.
bpf-next-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
Revert BPF token-related functionality
bpf: Use nla_ok() instead of checking nla_len directly
====================
Link: https://lore.kernel.org/r/20231219170359.11035-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Expose the previously introduced notification multicast messages
filtering infrastructure and allow the user to select messages using
port index.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently the user listening on a socket for devlink notifications
gets always all messages for all existing instances, even if he is
interested only in one of those. That may cause unnecessary overhead
on setups with thousands of instances present.
User is currently able to narrow down the devlink objects replies
to dump commands by specifying select attributes.
Allow similar approach for notifications. Introduce a new devlink
NOTIFY_FILTER_SET which the user passes the select attributes. Store
these per-socket and use them for filtering messages
during multicast send.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Make the code using filter function a bit nicer by consolidating the
filter function arguments using typedef.
Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce an xarray for Generic netlink family to store per-socket
private. Initialize this xarray only if family uses per-socket privs.
Introduce genl_sk_priv_get() to get the socket priv pointer for a family
and initialize it in case it does not exist.
Introduce __genl_sk_priv_get() to obtain socket priv pointer for a
family under RCU read lock.
Allow family to specify the priv size, init() and destroy() callbacks.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce a helper devlink_nl_notify_send() so each object notification
function does not have to call genlmsg_multicast_netns() with the same
arguments.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce devlink_nl_notify_need() helper and using it to check at the
beginning of notification functions to avoid overhead of composing
notification messages in case nobody listens.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce __devl_is_registered() which does not assert on devlink
instance lock and use it in notifications which may be called
without devlink instance lock held.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of checking the xarray mark directly using xa_get_mark() helper
use devl_is_registered() helper which wraps it up. Note that there are
couple more users of xa_get_mark() left which are going to be handled
by the next patch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
I got the below warning trace:
WARNING: CPU: 4 PID: 4056 at net/core/dev.c:11066 unregister_netdevice_many_notify
CPU: 4 PID: 4056 Comm: ip Not tainted 6.7.0-rc4+ #15
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:unregister_netdevice_many_notify+0x9a4/0x9b0
Call Trace:
rtnl_dellink
rtnetlink_rcv_msg
netlink_rcv_skb
netlink_unicast
netlink_sendmsg
__sock_sendmsg
____sys_sendmsg
___sys_sendmsg
__sys_sendmsg
do_syscall_64
entry_SYSCALL_64_after_hwframe
It can be repoduced via:
ip netns add ns1
ip netns exec ns1 ip link add bond0 type bond mode 0
ip netns exec ns1 ip link add bond_slave_1 type veth peer veth2
ip netns exec ns1 ip link set bond_slave_1 master bond0
[1] ip netns exec ns1 ethtool -K bond0 rx-vlan-filter off
[2] ip netns exec ns1 ip link add link bond_slave_1 name bond_slave_1.0 type vlan id 0
[3] ip netns exec ns1 ip link add link bond0 name bond0.0 type vlan id 0
[4] ip netns exec ns1 ip link set bond_slave_1 nomaster
[5] ip netns exec ns1 ip link del veth2
ip netns del ns1
This is all caused by command [1] turning off the rx-vlan-filter function
of bond0. The reason is the same as commit 01f4fd2708 ("bonding: Fix
incorrect deletion of ETH_P_8021AD protocol vid from slaves"). Commands
[2] [3] add the same vid to slave and master respectively, causing
command [4] to empty slave->vlan_info. The following command [5] triggers
this problem.
To fix this problem, we should add VLAN_FILTER feature checks in
vlan_vids_add_by_dev() and vlan_vids_del_by_dev() to prevent incorrect
addition or deletion of vlan_vid information.
Fixes: 348a1443cc ("vlan: introduce functions to do mass addition/deletion of vids by another device")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When an interface is removed, we should also be deleting the driver
debugfs entries (as it might still exist in DOWN state in mac80211). At
the same time, when adding an interface, we can check the
IEEE80211_SDATA_IN_DRIVER flag to know whether the interface was
previously known to the driver and is simply being reconfigured.
Fixes: a1f5dcb1c0 ("wifi: mac80211: add a driver callback to add vif debugfs")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220043149.a9f64c359424.I7076526b5297ae8f832228079c999f7b8e147a4c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The driver debugfs entries still exist when the interface is re-added
during reconfiguration. This can be either because of a HW restart
(in_reconfig) or because we are resuming.
Fixes: a1f5dcb1c0 ("wifi: mac80211: add a driver callback to add vif debugfs")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231220043149.ddd48c66ec6b.Ia81080d92129ceecf462eceb4966bab80df12060@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v
bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN
oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7
uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1
TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH
NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R
VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO
BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR
SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861
xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS
CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt
l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw=
=stU2
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-12-18
This PR is larger than usual and contains changes in various parts
of the kernel.
The main changes are:
1) Fix kCFI bugs in BPF, from Peter Zijlstra.
End result: all forms of indirect calls from BPF into kernel
and from kernel into BPF work with CFI enabled. This allows BPF
to work with CONFIG_FINEIBT=y.
2) Introduce BPF token object, from Andrii Nakryiko.
It adds an ability to delegate a subset of BPF features from privileged
daemon (e.g., systemd) through special mount options for userns-bound
BPF FS to a trusted unprivileged application. The design accommodates
suggestions from Christian Brauner and Paul Moore.
Example:
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei.
- Complete precision tracking support for register spills
- Fix verification of possibly-zero-sized stack accesses
- Fix access to uninit stack slots
- Track aligned STACK_ZERO cases as imprecise spilled registers.
It improves the verifier "instructions processed" metric from single
digit to 50-60% for some programs.
- Fix verifier retval logic
4) Support for VLAN tag in XDP hints, from Larysa Zaremba.
5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu.
End result: better memory utilization and lower I$ miss for calls to BPF
via BPF trampoline.
6) Fix race between BPF prog accessing inner map and parallel delete,
from Hou Tao.
7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu.
It allows BPF interact with IPSEC infra. The intent is to support
software RSS (via XDP) for the upcoming ipsec pcpu work.
Experiments on AWS demonstrate single tunnel pcpu ipsec reaching
line rate on 100G ENA nics.
8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao.
9) BPF file verification via fsverity, from Song Liu.
It allows BPF progs get fsverity digest.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits)
bpf: Ensure precise is reset to false in __mark_reg_const_zero()
selftests/bpf: Add more uprobe multi fail tests
bpf: Fail uprobe multi link with negative offset
selftests/bpf: Test the release of map btf
s390/bpf: Fix indirect trampoline generation
selftests/bpf: Temporarily disable dummy_struct_ops test on s390
x86/cfi,bpf: Fix bpf_exception_cb() signature
bpf: Fix dtor CFI
cfi: Add CFI_NOSEAL()
x86/cfi,bpf: Fix bpf_struct_ops CFI
x86/cfi,bpf: Fix bpf_callback_t CFI
x86/cfi,bpf: Fix BPF JIT call
cfi: Flip headers
selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment
selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test
selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment
bpf: Limit the number of kprobes when attaching program to multiple kprobes
bpf: Limit the number of uprobes when attaching program to multiple uprobes
bpf: xdp: Register generic_kfunc_set with XDP programs
selftests/bpf: utilize string values for delegate_xxx mount options
...
====================
Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The second features pull request for v6.8. A bigger one this time with
changes both to stack and drivers. We have a new Wifi band RFI (WBRF)
mitigation feature for which we pulled an immutable branch shared with
other subsystems. And, as always, other new features and bug fixes all
over.
Major changes:
cfg80211/mac80211
* AMD ACPI based Wifi band RFI (WBRF) mitigation feature
* Basic Service Set (BSS) usage reporting
* TID to link mapping support
* mac80211 hardware flag to disallow puncturing
iwlwifi
* new debugfs file fw_dbg_clear
mt76
* NVMEM EEPROM improvements
* mt7996 Extremely High Throughpu (EHT) improvements
* mt7996 Wireless Ethernet Dispatcher (WED) support
* mt7996 36-bit DMA support
ath12k
* support one MSI vector
* WCN7850: support AP mode
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmWAdRERHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZu0RAf+JtHgfjmUMFb54xcncLgj8ZAN82E0ThE0
bPewQDhot0QTri4s7i5Kn8PCWjk+eKEmiIK+eARM+JDyZMTlCpXs2Y92cDAGQ8KG
+LbIMRQkwOUg0HmtX3NysUG3mGAx4QTcIX/y3+GmtMZpKXMFuNy6ODuFvuWFNJrF
3XTq1qFQNnA0XqUDKHW9uareeCiOMVOsqcxNW2FAi2gqRUfQpKnU1Ukv5iOjkqE9
i53GHzeAG2WI4/YjXaTEZvibkM3jqrPcquHlul3fVuq05qkKOEuiy2UalDjgDCYp
u91vbmMpcOjhlf9GIiu2BF6K/muEUCCIjlh5oxob0k9NiKhnPUZLng==
=6Y8M
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2023-12-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.8
The second features pull request for v6.8. A bigger one this time with
changes both to stack and drivers. We have a new Wifi band RFI (WBRF)
mitigation feature for which we pulled an immutable branch shared with
other subsystems. And, as always, other new features and bug fixes all
over.
Major changes:
cfg80211/mac80211
* AMD ACPI based Wifi band RFI (WBRF) mitigation feature
* Basic Service Set (BSS) usage reporting
* TID to link mapping support
* mac80211 hardware flag to disallow puncturing
iwlwifi
* new debugfs file fw_dbg_clear
mt76
* NVMEM EEPROM improvements
* mt7996 Extremely High Throughpu (EHT) improvements
* mt7996 Wireless Ethernet Dispatcher (WED) support
* mt7996 36-bit DMA support
ath12k
* support one MSI vector
* WCN7850: support AP mode
* tag 'wireless-next-2023-12-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (207 commits)
wifi: mt76: mt7996: Use DECLARE_FLEX_ARRAY() and fix -Warray-bounds warnings
wifi: ath11k: workaround too long expansion sparse warnings
Revert "wifi: ath12k: use ATH12K_PCI_IRQ_DP_OFFSET for DP IRQ"
wifi: rt2x00: remove useless code in rt2x00queue_create_tx_descriptor()
wifi: rtw89: only reset BB/RF for existing WiFi 6 chips while starting up
wifi: rtw89: add DBCC H2C to notify firmware the status
wifi: rtw89: mac: add suffix _ax to MAC functions
wifi: rtw89: mac: add flags to check if CMAC and DMAC are enabled
wifi: rtw89: 8922a: add power on/off functions
wifi: rtw89: add XTAL SI for WiFi 7 chips
wifi: rtw89: phy: print out RFK log with formatted string
wifi: rtw89: parse and print out RFK log from C2H events
wifi: rtw89: add C2H event handlers of RFK log and report
wifi: rtw89: load RFK log format string from firmware file
wifi: rtw89: fw: add version field to BB MCU firmware element
wifi: rtw89: fw: load TX power track tables from fw_element
wifi: mwifiex: configure BSSID consistently when starting AP
wifi: mwifiex: add extra delay for firmware ready
wifi: mac80211: sta_info.c: fix sentence grammar
wifi: mac80211: rx.c: fix sentence grammar
...
====================
Link: https://lore.kernel.org/r/20231218163900.C031DC433C9@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guillaume says:
> I believe commit 5f7fc5d69f ("SUNRPC: Resupply rq_pages from
> node-local memory") in Linux 6.5+ is incorrect. It passes
> unconditionally rq_pool->sp_id as the NUMA node.
>
> While the comment in the svc_pool declaration in sunrpc/svc.h says
> that sp_id is also the NUMA node id, it might not be the case if
> the svc is created using svc_create_pooled(). svc_created_pooled()
> can use the per-cpu pool mode therefore in this case sp_id would
> be the cpu id.
Fix this by reverting now. At a later point this minor optimization,
and the deceptive labeling of the sp_id field, can be revisited.
Reported-by: Guillaume Morin <guillaume@morinfr.org>
Closes: https://lore.kernel.org/linux-nfs/ZYC9rsno8qYggVt9@bender.morinfr.org/T/#u
Fixes: 5f7fc5d69f ("SUNRPC: Resupply rq_pages from node-local memory")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
With the introduction of the rcu_replace_pointer_rtnl helper,
cleanup the rtnl_unregister_* functions to use the helper instead
of open coding it.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
W=1 builds warn on missing MODULE_DESCRIPTION, add them here in MPTCP.
Only two were missing: two modules with different KUnit tests for MPTCP.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netlink PM can race with fastopen self-connect attempts, shutting
down the first subflow via:
MPTCP_PM_CMD_DEL_ADDR -> mptcp_nl_remove_id_zero_address ->
mptcp_pm_nl_rm_subflow_received -> mptcp_close_ssk
and transitioning such subflow to FIN_WAIT1 status before the syn-ack
packet is processed. The MPTCP code does not react to such state change,
leaving the connection in not-fallback status and the subflow handshake
uncompleted, triggering the following splat:
WARNING: CPU: 0 PID: 10630 at net/mptcp/subflow.c:1405 subflow_data_ready+0x39f/0x690 net/mptcp/subflow.c:1405
Modules linked in:
CPU: 0 PID: 10630 Comm: kworker/u4:11 Not tainted 6.6.0-syzkaller-14500-g1c41041124bd #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Workqueue: bat_events batadv_nc_worker
RIP: 0010:subflow_data_ready+0x39f/0x690 net/mptcp/subflow.c:1405
Code: 18 89 ee e8 e3 d2 21 f7 40 84 ed 75 1f e8 a9 d7 21 f7 44 89 fe bf 07 00 00 00 e8 0c d3 21 f7 41 83 ff 07 74 07 e8 91 d7 21 f7 <0f> 0b e8 8a d7 21 f7 48 89 df e8 d2 b2 ff ff 31 ff 89 c5 89 c6 e8
RSP: 0018:ffffc90000007448 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888031efc700 RCX: ffffffff8a65baf4
RDX: ffff888043222140 RSI: ffffffff8a65baff RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000007
R10: 000000000000000b R11: 0000000000000000 R12: 1ffff92000000e89
R13: ffff88807a534d80 R14: ffff888021c11a00 R15: 000000000000000b
FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa19a0ffc81 CR3: 000000007a2db000 CR4: 00000000003506f0
DR0: 000000000000d8dd DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
tcp_data_ready+0x14c/0x5b0 net/ipv4/tcp_input.c:5128
tcp_data_queue+0x19c3/0x5190 net/ipv4/tcp_input.c:5208
tcp_rcv_state_process+0x11ef/0x4e10 net/ipv4/tcp_input.c:6844
tcp_v4_do_rcv+0x369/0xa10 net/ipv4/tcp_ipv4.c:1929
tcp_v4_rcv+0x3888/0x3b30 net/ipv4/tcp_ipv4.c:2329
ip_protocol_deliver_rcu+0x9f/0x480 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x2e4/0x510 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:314 [inline]
NF_HOOK include/linux/netfilter.h:308 [inline]
ip_local_deliver+0x1b6/0x550 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:461 [inline]
ip_rcv_finish+0x1c4/0x2e0 net/ipv4/ip_input.c:449
NF_HOOK include/linux/netfilter.h:314 [inline]
NF_HOOK include/linux/netfilter.h:308 [inline]
ip_rcv+0xce/0x440 net/ipv4/ip_input.c:569
__netif_receive_skb_one_core+0x115/0x180 net/core/dev.c:5527
__netif_receive_skb+0x1f/0x1b0 net/core/dev.c:5641
process_backlog+0x101/0x6b0 net/core/dev.c:5969
__napi_poll.constprop.0+0xb4/0x540 net/core/dev.c:6531
napi_poll net/core/dev.c:6600 [inline]
net_rx_action+0x956/0xe90 net/core/dev.c:6733
__do_softirq+0x21a/0x968 kernel/softirq.c:553
do_softirq kernel/softirq.c:454 [inline]
do_softirq+0xaa/0xe0 kernel/softirq.c:441
</IRQ>
<TASK>
__local_bh_enable_ip+0xf8/0x120 kernel/softirq.c:381
spin_unlock_bh include/linux/spinlock.h:396 [inline]
batadv_nc_purge_paths+0x1ce/0x3c0 net/batman-adv/network-coding.c:471
batadv_nc_worker+0x9b1/0x10e0 net/batman-adv/network-coding.c:722
process_one_work+0x884/0x15c0 kernel/workqueue.c:2630
process_scheduled_works kernel/workqueue.c:2703 [inline]
worker_thread+0x8b9/0x1290 kernel/workqueue.c:2784
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
</TASK>
To address the issue, catch the racing subflow state change and
use it to cause the MPTCP fallback. Such fallback is also used to
cause the first subflow state propagation to the msk socket via
mptcp_set_connected(). After this change, the first subflow can
additionally propagate the TCP_FIN_WAIT1 state, so rename the
helper accordingly.
Finally, if the state propagation is delayed to the msk release
callback, the first subflow can change to a different state in between.
Cache the relevant target state in a new msk-level field and use
such value to update the msk state at release time.
Fixes: 1e777f39b4 ("mptcp: add MSG_FASTOPEN sendmsg flag support")
Cc: stable@vger.kernel.org
Reported-by: <syzbot+c53d4d3ddb327e80bc51@syzkaller.appspotmail.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/458
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to address the issues encountered with commit 1effe8ca4e
("skbuff: fix coalescing for page_pool fragment recycling"), the
combination of the following condition was excluded from skb coalescing:
from->pp_recycle = 1
from->cloned = 1
to->pp_recycle = 1
However, with page pool environments, the aforementioned combination can
be quite common(ex. NetworkMananger may lead to the additional
packet_type being registered, thus the cloning). In scenarios with a
higher number of small packets, it can significantly affect the success
rate of coalescing. For example, considering packets of 256 bytes size,
our comparison of coalescing success rate is as follows:
Without page pool: 70%
With page pool: 13%
Consequently, this has an impact on performance:
Without page pool: 2.57 Gbits/sec
With page pool: 2.26 Gbits/sec
Therefore, it seems worthwhile to optimize this scenario and enable
coalescing of this particular combination. To achieve this, we need to
ensure the correct increment of the "from" SKB page's page pool
reference count (pp_ref_count).
Following this optimization, the success rate of coalescing measured in
our environment has improved as follows:
With page pool: 60%
This success rate is approaching the rate achieved without using page
pool, and the performance has also been improved:
With page pool: 2.52 Gbits/sec
Below is the performance comparison for small packets before and after
this optimization. We observe no impact to packets larger than 4K.
packet size before after improved
(bytes) (Gbits/sec) (Gbits/sec)
128 1.19 1.27 7.13%
256 2.26 2.52 11.75%
512 4.13 4.81 16.50%
1024 6.17 6.73 9.05%
2048 14.54 15.47 6.45%
4096 25.44 27.87 9.52%
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Suggested-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wrap code for checking if a page is a page_pool page into a
function for better readability and ease of reuse.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Mina Almasry <almarsymina@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up to now, we were only subtracting from the number of used page fragments
to figure out when a page could be freed or recycled. A following patch
introduces support for multiple users referencing the same fragment. So
reduce the initial page fragments value to half to avoid overflowing.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Mina Almasry <almarsymina@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 1580ab63fc ("tcp/dccp: better use of ephemeral ports in connect()")
we added an heuristic to select even ports for connect() and odd ports for bind().
This was nice because no applications changes were needed.
But it added more costs when all even ports are in use,
when there are few listeners and many active connections.
Since then, IP_LOCAL_PORT_RANGE has been added to permit an application
to partition ephemeral port range at will.
This patch extends the idea so that if IP_LOCAL_PORT_RANGE is set on
a socket before accept(), port selection no longer favors even ports.
This means that connect() can find a suitable source port faster,
and applications can use a different split between connect() and bind()
users.
This should give more entropy to Toeplitz hash used in RSS: Using even
ports was wasting one bit from the 16bit sport.
A similar change can be done in inet_csk_find_open_port() if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20231214192939.1962891-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change inet_sk_get_local_port_range() to return a boolean,
telling the callers if the port range was provided by
IP_LOCAL_PORT_RANGE socket option.
Adds documentation while we are at it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20231214192939.1962891-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ensure the various dtor functions match their prototype and retain
their CFI signatures, since they don't have their address taken, they
are prone to not getting CFI, making them impossible to call
indirectly.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.799451071@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF struct_ops uses __arch_prepare_bpf_trampoline() to write
trampolines for indirect function calls. These tramplines much have
matching CFI.
In order to obtain the correct CFI hash for the various methods, add a
matching structure that contains stub functions, the compiler will
generate correct CFI which we can pilfer for the trampolines.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This can cause a race with bt_sock_ioctl() because
bt_sock_recvmsg() gets the skb from sk->sk_receive_queue
and then frees it without holding lock_sock.
A use-after-free for a skb occurs with the following flow.
```
bt_sock_recvmsg() -> skb_recv_datagram() -> skb_free_datagram()
bt_sock_ioctl() -> skb_peek()
```
Add lock_sock to bt_sock_recvmsg() to fix this issue.
Cc: stable@vger.kernel.org
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
If two Bluetooth devices both support BR/EDR and BLE, and also
support Secure Connections, then they only need to pair once.
The LTK generated during the LE pairing process may be converted
into a BR/EDR link key for BR/EDR transport, and conversely, a
link key generated during the BR/EDR SSP pairing process can be
converted into an LTK for LE transport. Hence, the link type of
the link key and LTK is not fixed, they can be either an LE LINK
or an ACL LINK.
Currently, in the mgmt_new_irk/ltk/crsk/link_key functions, the
link type is fixed, which could lead to incorrect address types
being reported to the application layer. Therefore, it is necessary
to add link_type/addr_type to the smp_irk/ltk/crsk and link_key,
to ensure the generation of the correct address type.
SMP over BREDR:
Before Fix:
> ACL Data RX: Handle 11 flags 0x02 dlen 12
BR/EDR SMP: Identity Address Information (0x09) len 7
Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Identity Resolving Key (0x0018) plen 30
Random address: 00:00:00:00:00:00 (Non-Resolvable)
LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Long Term Key (0x000a) plen 37
LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
Key type: Authenticated key from P-256 (0x03)
After Fix:
> ACL Data RX: Handle 11 flags 0x02 dlen 12
BR/EDR SMP: Identity Address Information (0x09) len 7
Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Identity Resolving Key (0x0018) plen 30
Random address: 00:00:00:00:00:00 (Non-Resolvable)
BR/EDR Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Long Term Key (0x000a) plen 37
BR/EDR Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
Key type: Authenticated key from P-256 (0x03)
SMP over LE:
Before Fix:
@ MGMT Event: New Identity Resolving Key (0x0018) plen 30
Random address: 5F:5C:07:37:47:D5 (Resolvable)
LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Long Term Key (0x000a) plen 37
LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
Key type: Authenticated key from P-256 (0x03)
@ MGMT Event: New Link Key (0x0009) plen 26
BR/EDR Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
Key type: Authenticated Combination key from P-256 (0x08)
After Fix:
@ MGMT Event: New Identity Resolving Key (0x0018) plen 30
Random address: 5E:03:1C:00:38:21 (Resolvable)
LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Long Term Key (0x000a) plen 37
LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
Key type: Authenticated key from P-256 (0x03)
@ MGMT Event: New Link Key (0x0009) plen 26
Store hint: Yes (0x01)
LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
Key type: Authenticated Combination key from P-256 (0x08)
Cc: stable@vger.kernel.org
Signed-off-by: Xiao Yao <xiaoyao@rock-chips.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
L2CAP/COS/CED/BI-02-C PTS test send a malformed L2CAP signaling packet
with 2 commands in it (a connection request and an unknown command) and
expect to get a connection response packet and a command reject packet.
The second is currently not sent.
Cc: stable@vger.kernel.org
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Turning on -Wstringop-overflow globally exposed a misleading compiler
warning in bluetooth:
net/bluetooth/hci_event.c: In function 'hci_cc_read_class_of_dev':
net/bluetooth/hci_event.c:524:9: error: 'memcpy' writing 3 bytes into a
region of size 0 overflows the destination [-Werror=stringop-overflow=]
524 | memcpy(hdev->dev_class, rp->dev_class, 3);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The problem here is the check for hdev being NULL in bt_dev_dbg() that
leads the compiler to conclude that hdev->dev_class might be an invalid
pointer access.
Add another explicit check for the same condition to make sure gcc sees
this cannot happen.
Fixes: a9de924806 ("[Bluetooth] Switch from OGF+OCF to using only opcodes")
Fixes: 1b56c90018f0 ("Makefile: Enable -Wstringop-overflow globally")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Before setting HCI_INQUIRY bit check if HCI_OP_INQUIRY was really sent
otherwise the controller maybe be generating invalid events or, more
likely, it is a result of fuzzing tools attempting to test the right
behavior of the stack when unexpected events are generated.
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218151
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Some layers such as SMP depend on getting notified about encryption
changes immediately as they only allow certain PDU to be transmitted
over an encrypted link which may cause SMP implementation to reject
valid PDUs received thus causing pairing to fail when it shouldn't.
Fixes: 7aca0ac479 ("Bluetooth: Wait for HCI_OP_WRITE_AUTH_PAYLOAD_TO to complete")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
We assume in handful of places that the name of the spec is
the same as the name of the family. We could fix that but
it seems like a fair assumption to make. Rename the MPTCP
spec instead.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
optmem_max being used in tx zerocopy,
we want to be able to control it on a netns basis.
Following patch changes two tests.
Tested:
oqq130:~# cat /proc/sys/net/core/optmem_max
131072
oqq130:~# echo 1000000 >/proc/sys/net/core/optmem_max
oqq130:~# cat /proc/sys/net/core/optmem_max
1000000
oqq130:~# unshare -n
oqq130:~# cat /proc/sys/net/core/optmem_max
131072
oqq130:~# exit
logout
oqq130:~# cat /proc/sys/net/core/optmem_max
1000000
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For many years, /proc/sys/net/core/optmem_max default value
on a 64bit kernel has been 20 KB.
Regular usage of TCP tx zerocopy needs a bit more.
Google has used 128KB as the default value for 7 years without
any problem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following NULL pointer dereference issue occurred:
BUG: kernel NULL pointer dereference, address: 0000000000000000
<...>
RIP: 0010:ccid_hc_tx_send_packet net/dccp/ccid.h:166 [inline]
RIP: 0010:dccp_write_xmit+0x49/0x140 net/dccp/output.c:356
<...>
Call Trace:
<TASK>
dccp_sendmsg+0x642/0x7e0 net/dccp/proto.c:801
inet_sendmsg+0x63/0x90 net/ipv4/af_inet.c:846
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x83/0xe0 net/socket.c:745
____sys_sendmsg+0x443/0x510 net/socket.c:2558
___sys_sendmsg+0xe5/0x150 net/socket.c:2612
__sys_sendmsg+0xa6/0x120 net/socket.c:2641
__do_sys_sendmsg net/socket.c:2650 [inline]
__se_sys_sendmsg net/socket.c:2648 [inline]
__x64_sys_sendmsg+0x45/0x50 net/socket.c:2648
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x43/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
sk_wait_event() returns an error (-EPIPE) if disconnect() is called on the
socket waiting for the event. However, sk_stream_wait_connect() returns
success, i.e. zero, even if sk_wait_event() returns -EPIPE, so a function
that waits for a connection with sk_stream_wait_connect() may misbehave.
In the case of the above DCCP issue, dccp_sendmsg() is waiting for the
connection. If disconnect() is called in concurrently, the above issue
occurs.
This patch fixes the issue by returning error from sk_stream_wait_connect()
if sk_wait_event() fails.
Fixes: 419ce133ab ("tcp: allow again tcp_disconnect() when threads are waiting")
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reported-by: syzbot+c71bc336c5061153b502@syzkaller.appspotmail.com
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send credit update message when SO_RCVLOWAT is updated and it is bigger
than number of bytes in rx queue. It is needed, because 'poll()' will
wait until number of bytes in rx queue will be not smaller than
O_RCVLOWAT, so kick sender to send more data. Otherwise mutual hungup
for tx/rx is possible: sender waits for free space and receiver is
waiting data in 'poll()'.
Rename 'set_rcvlowat' callback to 'notify_set_rcvlowat' and set
'sk->sk_rcvlowat' only in one place (i.e. 'vsock_set_rcvlowat'), so the
transport doesn't need to do it.
Fixes: b89d882dc9 ("vsock/virtio: reduce credit update messages")
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add one more condition for sending credit update during dequeue from
stream socket: when number of bytes in the rx queue is smaller than
SO_RCVLOWAT value of the socket. This is actual for non-default value
of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
transmission, because we need at least SO_RCVLOWAT bytes in our rx
queue to wake up user for reading data (in corner case it is also
possible to stuck both tx and rx sides, this is why 'Fixes' is used).
Fixes: b89d882dc9 ("vsock/virtio: reduce credit update messages")
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support IP_PKTINFO on those packets, we need to call
ipv4_pktinfo_prepare.
When sending mrouted/pimd daemons a cache report IGMP msg, it is
unnecessary to set dst on the newly created skb.
It used to be necessary on older versions until
commit d826eb14ec ("ipv4: PKTINFO doesnt need dst reference") which
changed the way IP_PKTINFO struct is been retrieved.
Changes from v1:
1. Undo changes in ipv4_pktinfo_prepare function. use it directly
and copy the control block.
Fixes: d826eb14ec ("ipv4: PKTINFO doesnt need dst reference")
Signed-off-by: Leone Fernando <leone4fernando@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While disassociating from a PAN ourselves, let's set the maximum number
of associations temporarily to zero to be sure no new device tries to
associate with us.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20231128111655.507479-6-miquel.raynal@bootlin.com
Once associated with any device, we are part of a PAN (with a specific
PAN ID), and we are expected to be present on a particular
channel. Let's avoid confusing other devices by preventing any PAN
ID/channel change once associated.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20231128111655.507479-5-miquel.raynal@bootlin.com
It is not very clear in the specification whether simple coordinators
are allowed or not to answer to association requests themselves. As
there is no synchronization mechanism, it is probably best to rely on
the relay feature of these coordinators and avoid processing them in
this case.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20231128111655.507479-4-miquel.raynal@bootlin.com
ACKs come with the source and destination address empty, this has been
clarified already. But there is something else: if the destination
address is empty but the source address is valid, it may be a way to
reach the PAN coordinator. Either the device receiving this frame is the
PAN coordinator itself and should process what it just received
(PACKET_HOST) or it is not and may, if supported, relay the packet as it
is targeted to another device in the network.
Right now we do not support relaying so the packet should be dropped in
the first place, but the stamping looks more accurate this way.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20231128111655.507479-3-miquel.raynal@bootlin.com
Sending a beacon is a way to advertise a PAN, but also ourselves as
coordinator in the PAN. There is only one PAN coordinator in a PAN, this
is the device without parent (not associated itself to an "upper"
coordinator). Instead of blindly saying that we are the PAN coordinator,
let's actually use our internal information to fill this field.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20231128111655.507479-2-miquel.raynal@bootlin.com
This commit adds an unstable kfunc helper to access internal xfrm_state
associated with an SA. This is intended to be used for the upcoming
IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
words: for custom software RSS.
That being said, the function that this kfunc wraps is fairly generic
and used for a lot of xfrm tasks. I'm sure people will find uses
elsewhere over time.
This commit also adds a corresponding bpf_xdp_xfrm_state_release() kfunc
to release the refcnt acquired by bpf_xdp_get_xfrm_state(). The verifier
will require that all acquired xfrm_state's are released.
Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/a29699c42f5fad456b875c98dd11c6afc3ffb707.1702593901.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Correct run-on sentences by changing "," to ";".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: linux-wireless@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Link: https://msgid.link/20231213054809.23475-1-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Correct a run-on sentence by changing "," to ";".
Add a subject in one sentence.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: linux-wireless@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Link: https://msgid.link/20231213054800.22561-1-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The build can become unreproducible if the list of files
found by $(wildcard ...) differs. Sort the list to avoid
this.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Because atalk_ioctl() accesses sk->sk_receive_queue
without holding a sk->sk_receive_queue.lock, it can
cause a race with atalk_recvmsg().
A use-after-free for skb occurs with the following flow.
```
atalk_ioctl() -> skb_peek()
atalk_recvmsg() -> skb_recv_datagram() -> skb_free_datagram()
```
Add sk->sk_receive_queue.lock to atalk_ioctl() to fix this issue.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Link: https://lore.kernel.org/r/20231213041056.GA519680@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The file for the new certificate (Chen-Yu Tsai's) didn't
end with a comma, so depending on the file order in the
build rule, we'd end up with invalid C when concatenating
the (now two) certificates. Fix that.
Cc: stable@vger.kernel.org
Reported-by: Biju Das <biju.das.jz@bp.renesas.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: fb768d3b13 ("wifi: cfg80211: Add my certificate")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Symmetric RSS hash functions are beneficial in applications that monitor
both Tx and Rx packets of the same flow (IDS, software firewalls, ..etc).
Getting all traffic of the same flow on the same RX queue results in
higher CPU cache efficiency.
A NIC that supports "symmetric-xor" can achieve this RSS hash symmetry
by XORing the source and destination fields and pass the values to the
RSS hash algorithm.
The user may request RSS hash symmetry for a specific algorithm, via:
# ethtool -X eth0 hfunc <hash_alg> symmetric-xor
or turn symmetry off (asymmetric) by:
# ethtool -X eth0 hfunc <hash_alg>
The specific fields for each flow type should then be specified as usual
via:
# ethtool -N|-U eth0 rx-flow-hash <flow_type> s|d|f|n
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-4-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the RSS context parameters to struct ethtool_rxfh_param and use the
get/set_rxfh to handle the RSS contexts as well.
This is part 2/2 of the fix suggested in [1]:
- Add a rss_context member to the argument struct and a capability
like cap_link_lanes_supported to indicate whether driver supports
rss contexts, then you can remove *et_rxfh_context functions,
and instead call *et_rxfh() with a non-zero rss_context.
Link: https://lore.kernel.org/netdev/20231121152906.2dd5f487@kernel.org/ [1]
CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Tony Nguyen <anthony.l.nguyen@intel.com>
CC: Marcin Wojtas <mw@semihalf.com>
CC: Russell King <linux@armlinux.org.uk>
CC: Sunil Goutham <sgoutham@marvell.com>
CC: Geetha sowjanya <gakula@marvell.com>
CC: Subbaraya Sundeep <sbhatta@marvell.com>
CC: hariprasad <hkelam@marvell.com>
CC: Saeed Mahameed <saeedm@nvidia.com>
CC: Leon Romanovsky <leon@kernel.org>
CC: Edward Cree <ecree.xilinx@gmail.com>
CC: Martin Habets <habetsm.xilinx@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-3-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The get/set_rxfh ethtool ops currently takes the rxfh (RSS) parameters
as direct function arguments. This will force us to change the API (and
all drivers' functions) every time some new parameters are added.
This is part 1/2 of the fix, as suggested in [1]:
- First simplify the code by always providing a pointer to all params
(indir, key and func); the fact that some of them may be NULL seems
like a weird historic thing or a premature optimization.
It will simplify the drivers if all pointers are always present.
- Then make the functions take a dev pointer, and a pointer to a
single struct wrapping all arguments. The set_* should also take
an extack.
Link: https://lore.kernel.org/netdev/20231121152906.2dd5f487@kernel.org/ [1]
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://lore.kernel.org/r/20231213003321.605376-2-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Releasing the DMA mapping will be useful for other types
of pages, so factor it out. Make sure compiler inlines it,
to avoid any regressions.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To support multiple users referencing the same fragment,
'pp_frag_count' is renamed to 'pp_ref_count', transitioning pp pages
from fragment management to reference count management after draining
based on the suggestion from [1].
The idea is that the concept of fragmenting exists before the page is
drained, and all related functions retain their current names.
However, once the page is drained, its management shifts to being
governed by 'pp_ref_count'. Therefore, all functions associated with
that lifecycle stage of a pp page are renamed.
[1]
http://lore.kernel.org/netdev/f71d9448-70c8-8793-dc9a-0eb48a570300@huawei.com
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20231212044614.42733-2-liangchen.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For some reason sctp_poll() generates EPOLLERR if sk->sk_error_queue
is not empty but recvmsg() can not drain the error queue yet.
This is needed to better support timestamping.
I had to export inet_recv_error(), since sctp
can be compiled as a module.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/20231212145550.3872051-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Follow up commit 9690ae6042 ("ethtool: add header/data split
indication") and add the set part of Ethtool's header split, i.e.
ability to enable/disable header split via the Ethtool Netlink
interface. This might be helpful to optimize the setup for particular
workloads, for example, to avoid XDP frags, and so on.
A driver should advertise ``ETHTOOL_RING_USE_TCP_DATA_SPLIT`` in its
ops->supported_ring_params to allow doing that. "Unknown" passed from
the userspace when the header split is supported means the driver is
free to choose the preferred state.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231212142752.935000-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to do signed arithmetic if we expect condition
`if (bytes < 0)` to be possible
Found by Linux Verification Center (linuxtesting.org) with SVACE
Fixes: 06a8fc7836 ("VSOCK: Introduce virtio_vsock_common.ko")
Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20231211162317.4116625-1-kniv@yandex-team.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcf_idr_insert_many will replace the allocated -EBUSY pointer in
tcf_idr_check_alloc with the real action pointer, exposing it
to all operations. This operation is only needed when the action pointer
is created (ACT_P_CREATED). For actions which are bound to (returned 0),
the pointer already resides in the idr making such operation a nop.
Even though it's a nop, it's still not a cheap operation as internally
the idr code walks the idr and then does a replace on the appropriate slot.
So if the action was bound, better skip the idr replace entirely.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20231211181807.96028-3-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of relying only on the idrinfo->lock mutex for
bind/alloc logic, rely on a combination of rcu + mutex + atomics
to better scale the case where multiple rtnl-less filters are
binding to the same action object.
Action binding happens when an action index is specified explicitly and
an action exists which such index exists. Example:
tc actions add action drop index 1
tc filter add ... matchall action drop index 1
tc filter add ... matchall action drop index 1
tc filter add ... matchall action drop index 1
tc filter ls ...
filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 1 ref 4 bind 3
filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 1 ref 4 bind 3
filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 1 ref 4 bind 3
When no index is specified, as before, grab the mutex and allocate
in the idr the next available id. In this version, as opposed to before,
it's simplified to store the -EBUSY pointer instead of the previous
alloc + replace combination.
When an index is specified, rely on rcu to find if there's an object in
such index. If there's none, fallback to the above, serializing on the
mutex and reserving the specified id. If there's one, it can be an -EBUSY
pointer, in which case we just try again until it's an action, or an action.
Given the rcu guarantees, the action found could be dead and therefore
we need to bump the refcount if it's not 0, handling the case it's
in fact 0.
As bind and the action refcount are already atomics, these increments can
happen without the mutex protection while many tcf_idr_check_alloc race
to bind to the same action instance.
In case binding encounters a parallel delete or add, it will return
-EAGAIN in order to try again. Both filter and action apis already
have the retry machinery in-place. In case it's an unlocked filter it
retries under the rtnl lock.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20231211181807.96028-2-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I added logic to track the sock pair for stream_unix sockets so that we
ensure lifetime of the sock matches the time a sockmap could reference
the sock (see fixes tag). I forgot though that we allow af_unix unconnected
sockets into a sock{map|hash} map.
This is problematic because previous fixed expected sk_pair() to exist
and did not NULL check it. Because unconnected sockets have a NULL
sk_pair this resulted in the NULL ptr dereference found by syzkaller.
BUG: KASAN: null-ptr-deref in unix_stream_bpf_update_proto+0x72/0x430 net/unix/unix_bpf.c:171
Write of size 4 at addr 0000000000000080 by task syz-executor360/5073
Call Trace:
<TASK>
...
sock_hold include/net/sock.h:777 [inline]
unix_stream_bpf_update_proto+0x72/0x430 net/unix/unix_bpf.c:171
sock_map_init_proto net/core/sock_map.c:190 [inline]
sock_map_link+0xb87/0x1100 net/core/sock_map.c:294
sock_map_update_common+0xf6/0x870 net/core/sock_map.c:483
sock_map_update_elem_sys+0x5b6/0x640 net/core/sock_map.c:577
bpf_map_update_value+0x3af/0x820 kernel/bpf/syscall.c:167
We considered just checking for the null ptr and skipping taking a ref
on the NULL peer sock. But, if the socket is then connected() after
being added to the sockmap we can cause the original issue again. So
instead this patch blocks adding af_unix sockets that are not in the
ESTABLISHED state.
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+e8030702aefd3444fb9e@syzkaller.appspotmail.com
Fixes: 8866730aed ("bpf, sockmap: af_unix stream sockets need to hold ref for pair sock")
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20231201180139.328529-2-john.fastabend@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Implement functionality that enables drivers to expose VLAN tag
to XDP code.
VLAN tag is represented by 2 variables:
- protocol ID, which is passed to bpf code in BE
- VLAN TCI, in host byte order
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20231205210847.28460-10-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 94ecc5ca4d ("xsk: Add cb area to struct xdp_buff_xsk") has added
a buffer for custom data to xdp_buff_xsk. Particularly, this memory is used
for data, consumed by XDP hints kfuncs. It does not always change on
a per-packet basis and some parts can be set for example, at the same time
as RX queue info.
Add functions to fill all cbs in xsk_buff_pool with the same metadata.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20231205210847.28460-8-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm
and packets are sent as soon as possible. However in the `tcp_push` function
where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not
considered which can trigger unexpected corking of packets and induce delays.
For example, if two packets are generated as part of a server's reply, if the
first one is not transmitted on the wire quickly enough, the second packet can
trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as
possible. It will either wait for additional packets to be coalesced or an ACK
from the client before transmitting the corked packet. This can interact badly
if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in
completion times. It is not always possible to control who has delayed acks
set, but it is possible to adjust when and how autocorking is triggered.
Patch prevents autocorking if the TCP_NODELAY flag is set on the socket.
Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
Apache Tomcat 9.0.83 running the basic servlet below:
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
public class HelloWorldServlet extends HttpServlet {
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html;charset=utf-8");
OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
String s = "a".repeat(3096);
osw.write(s,0,s.length());
osw.flush();
}
}
Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance. With the current auto-corking behavior and TCP_NODELAY
set an additional 40ms latency from P99.99+ values are observed. With the
patch applied we see no occurrences of 40ms latencies. The patch has also been
tested with iperf and uperf benchmarks and no regression was observed.
# No patch with tcp_autocorking=1 and TCP_NODELAY set on all sockets
./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.49.177:8080/hello/hello'
...
50.000% 0.91ms
75.000% 1.12ms
90.000% 1.46ms
99.000% 1.73ms
99.900% 1.96ms
99.990% 43.62ms <<< 40+ ms extra latency
99.999% 48.32ms
100.000% 49.34ms
# With patch
./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.49.177:8080/hello/hello'
...
50.000% 0.89ms
75.000% 1.13ms
90.000% 1.44ms
99.000% 1.67ms
99.900% 1.78ms
99.990% 2.27ms <<< no 40+ ms extra latency
99.999% 3.71ms
100.000% 4.57ms
Fixes: f54b311142 ("tcp: auto corking")
Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller report:
kernel BUG at net/core/skbuff.c:3452!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc4-00009-gbee0e7762ad2-dirty #135
RIP: 0010:skb_copy_and_csum_bits (net/core/skbuff.c:3452)
Call Trace:
icmp_glue_bits (net/ipv4/icmp.c:357)
__ip_append_data.isra.0 (net/ipv4/ip_output.c:1165)
ip_append_data (net/ipv4/ip_output.c:1362 net/ipv4/ip_output.c:1341)
icmp_push_reply (net/ipv4/icmp.c:370)
__icmp_send (./include/net/route.h:252 net/ipv4/icmp.c:772)
ip_fragment.constprop.0 (./include/linux/skbuff.h:1234 net/ipv4/ip_output.c:592 net/ipv4/ip_output.c:577)
__ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:295)
ip_output (net/ipv4/ip_output.c:427)
__ip_queue_xmit (net/ipv4/ip_output.c:535)
__tcp_transmit_skb (net/ipv4/tcp_output.c:1462)
__tcp_retransmit_skb (net/ipv4/tcp_output.c:3387)
tcp_retransmit_skb (net/ipv4/tcp_output.c:3404)
tcp_retransmit_timer (net/ipv4/tcp_timer.c:604)
tcp_write_timer (./include/linux/spinlock.h:391 net/ipv4/tcp_timer.c:716)
The panic issue was trigered by tcp simultaneous initiation.
The initiation process is as follows:
TCP A TCP B
1. CLOSED CLOSED
2. SYN-SENT --> <SEQ=100><CTL=SYN> ...
3. SYN-RECEIVED <-- <SEQ=300><CTL=SYN> <-- SYN-SENT
4. ... <SEQ=100><CTL=SYN> --> SYN-RECEIVED
5. SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
// TCP B: not send challenge ack for ack limit or packet loss
// TCP A: close
tcp_close
tcp_send_fin
if (!tskb && tcp_under_memory_pressure(sk))
tskb = skb_rb_last(&sk->tcp_rtx_queue); //pick SYN_ACK packet
TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN; // set FIN flag
6. FIN_WAIT_1 --> <SEQ=100><ACK=301><END_SEQ=102><CTL=SYN,FIN,ACK> ...
// TCP B: send challenge ack to SYN_FIN_ACK
7. ... <SEQ=301><ACK=101><CTL=ACK> <-- SYN-RECEIVED //challenge ack
// TCP A: <SND.UNA=101>
8. FIN_WAIT_1 --> <SEQ=101><ACK=301><END_SEQ=102><CTL=SYN,FIN,ACK> ... // retransmit panic
__tcp_retransmit_skb //skb->len=0
tcp_trim_head
len = tp->snd_una - TCP_SKB_CB(skb)->seq // len=101-100
__pskb_trim_head
skb->data_len -= len // skb->len=-1, wrap around
... ...
ip_fragment
icmp_glue_bits //BUG_ON
If we use tcp_trim_head() to remove acked SYN from packet that contains data
or other flags, skb->len will be incorrectly decremented. We can remove SYN
flag that has been acked from rtx_queue earlier than tcp_trim_head(), which
can fix the problem mentioned above.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Link: https://lore.kernel.org/r/20231210020200.1539875-1-dongchenchen2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If some of p9pdu_readf() calls inside case 'T' in p9pdu_vreadf() fails,
the error path is not handled properly. *wnames or members of *wnames
array may be left uninitialized and invalidly freed.
Initialize *wnames to NULL in beginning of case 'T'. Initialize the first
*wnames array element to NULL and nullify the failing *wnames element so
that the error path freeing loop stops on the first NULL element and
doesn't proceed further.
Found by Linux Verification Center (linuxtesting.org).
Fixes: ace51c4dd2 ("9p: add new protocol support code")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Message-ID: <20231206200913.16135-1-pchelkin@ispras.ru>
Cc: stable@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Merge vfs.file from the VFS tree to avoid conflicts with receive_fd() now
having 3 arguments rather than just 2.
* 'vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
file: remove __receive_fd()
file: stop exposing receive_fd_user()
fs: replace f_rcuhead with f_task_work
file: remove pointless wrapper
file: s/close_fd_get_file()/file_close_fd()/g
Improve __fget_files_rcu() code generation (and thus __fget_light())
file: massage cleanup of files that failed to open
Not every subsystem needs to have their own specialized helper.
Just us the __receive_fd() helper.
Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-4-e73ca6f4ea83@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Because rose_ioctl() accesses sk->sk_receive_queue
without holding a sk->sk_receive_queue.lock, it can
cause a race with rose_accept().
A use-after-free for skb occurs with the following flow.
```
rose_ioctl() -> skb_peek()
rose_accept() -> skb_dequeue() -> kfree_skb()
```
Add sk->sk_receive_queue.lock to rose_ioctl() to fix this issue.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Link: https://lore.kernel.org/r/20231209100538.GA407321@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Because do_vcc_ioctl() accesses sk->sk_receive_queue
without holding a sk->sk_receive_queue.lock, it can
cause a race with vcc_recvmsg().
A use-after-free for skb occurs with the following flow.
```
do_vcc_ioctl() -> skb_peek()
vcc_recvmsg() -> skb_recv_datagram() -> skb_free_datagram()
```
Add sk->sk_receive_queue.lock to do_vcc_ioctl() to fix this issue.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Link: https://lore.kernel.org/r/20231209094210.GA403126@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The ESS capability bit is reserved in frames transmitted by
the client, so we shouldn't set it. Since we've set it for
decades, keep that old behaviour unless we're connection to
a new EHT AP.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.65005aba900b.I3d00c8741400572a89a7508b5ae612c968874ad7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When doing a channel switch, cfg80211_update_known_bss may be called
with a BSS where both proberesp_ies and beacon_ies is set. If that
happens, both need to be consumed.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.07a88656d7df.I0fe9fc599382de0eccf96455617e377d9c231966@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The specification says that this information should not be explicitly
included in the per-STA profile. However, we need this information
readily available in the BSS for userspace and also internally when
associating. As such, append the appropriate element before
adding/updating the BSS.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.abde63d9cc6d.I3d346be0f84f51dccf4f4f92a3e997e6102b9456@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ENOTSUPP isn't a standard error code, don't use it.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.0214b6c79756.I2536bc8426ae15c8cff7ad199e57f06e2e404f13@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ENOTSUP isn't a standard error code. EOPNOTSUPP should be used instead.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.3841b71c867d.Idf2ad01d9dfe8d6d6c352bf02deb06e49701ad1d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There may be cases where puncturing isn't possible, and
a connection needs to be downgraded. Add a hardware flag
to support this.
This is likely temporary: it seems we will need to move
puncturing to the chandef/channel context.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.c1e89ea55e93.I37b8ca0ee64d5d7699e351785a9010afc106da3c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for setting the TID to link mapping for a non-AP MLD
station.
This is useful in cases user space needs to restrict the possible
set of active links, e.g., since it got a BSS Transition Management
request forcing to use only a subset of the valid links etc.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.da4d56a5f3ff.Iacf88e943326bf9c169c49b728c4a3445fdedc97@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sometimes there may be reasons for which a BSS that's
actually found in scan cannot be used to connect to,
for example a nonprimary link of an NSTR mobile AP MLD
cannot be used for normal direct connections to it.
Not indicating these to userspace as we do now of course
avoids being able to connect to them, but it's better if
they're shown to userspace and it can make an appropriate
decision, without e.g. doing an additional ML probe.
Thus add an indication of what a BSS can be used for,
currently "normal" and "MLD link", including a reason
bitmap for it being not usable.
The latter can be extended later for certain BSSes if there
are other reasons they cannot be used.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.0464f25e0b1d.I9f70ca9f1440565ad9a5207d0f4d00a20cca67e7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Current handling of del pmksa with SSID is limited to FILS
security. In the current change the del pmksa support is extended
to SAE/OWE security offloads as well. For OWE/SAE offloads, the
PMK is generated and cached at driver/FW, so user app needs the
capability to request cache deletion based on SSID for drivers
supporting SAE/OWE offload.
Signed-off-by: Vinayak Yadawad <vinayak.yadawad@broadcom.com>
Link: https://msgid.link/ecdae726459e0944c377a6a6f6cb2c34d2e057d0.1701262123.git.vinayak.yadawad@broadcom.com
[drop whitespace-damaged rdev_ops pointer completely, enabling tracing]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prefer native jiffies-wide 'unsigned long' for the 'last_active' field of
'struct airtime_info' and introduce 'ieee80211_sta_keep_active()' for airtime
check in 'ieee80211_txq_keep_active()' and 'ieee80211_sta_register_airtime()'.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://msgid.link/20231206060935.612241-1-dmantipov@yandex.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
To support the WBRF mechanism, Wifi adapters utilized in the system must
register the frequencies in use (or unregister those frequencies no longer
used) via the dedicated calls. So that, other drivers responding to the
frequencies can take proper actions to mitigate possible interference.
Co-developed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Co-developed-by: Evan Quan <quanliangl@hotmail.com>
Signed-off-by: Evan Quan <quanliangl@hotmail.com>
Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Link: https://msgid.link/20231211100630.2170152-5-Jun.Ma2@amd.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The newly added WBRF feature needs this interface for channel
width calculation.
Signed-off-by: Evan Quan <quanliangl@hotmail.com>
Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://msgid.link/20231211100630.2170152-4-Jun.Ma2@amd.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee802_11_parse_elems() can return NULL, so we must
check for the return value.
Fixes: 5d24828d05 ("mac80211: always allocate struct ieee802_11_elems")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.93dea364f3d3.Ie87781c6c48979fb25a744b90af4a33dc2d83a28@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We need to check that cfg80211_defragment_element()
didn't return an error, since it can fail due to bad
input, and we didn't catch that before.
Fixes: 8eb8dd2ffb ("wifi: mac80211: Support link removal using Reconfiguration ML element")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.8595a6b67fc0.I1225edd8f98355e007f96502e358e476c7971d8c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we're doing reconfig, then we cannot add the debugfs
files that are already there from before the reconfig.
Skip that in drv_change_sta_links() during reconfig.
Fixes: d2caad527c ("wifi: mac80211: add API to show the link STAs in debugfs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20231211085121.88a950f43e16.Id71181780994649219685887c0fcad33d387cc78@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix the undefined usage of the GPIO consumer API after retrieving the
GPIO description with GPIO_ASIS. The API documentation mentions that
GPIO_ASIS won't set a GPIO direction and requires the user to set a
direction before using the GPIO.
This can be confirmed on i.MX6 hardware, where rfkill-gpio is no longer
able to enabled/disable a device, presumably because the GPIO controller
was never configured for the output direction.
Fixes: b2f750c3a8 ("net: rfkill: gpio: prevent value glitch during probe")
Cc: stable@vger.kernel.org
Signed-off-by: Rouven Czerwinski <r.czerwinski@pengutronix.de>
Link: https://msgid.link/20231207075835.3091694-1-r.czerwinski@pengutronix.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
As of today tc-filter/chain events are unconditionally built and sent to
RTNLGRP_TC. As with the introduction of rtnl_notify_needed we can check
before-hand if they are really needed. This will help to alleviate
system pressure when filters are concurrently added without the rtnl
lock as in tc-flower.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Link: https://lore.kernel.org/r/20231208192847.714940-8-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This argument is never called while set to true, so remove it as there's
no need for it.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231208192847.714940-7-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As of today tc-action events are unconditionally built and sent to
RTNLGRP_TC. As with the introduction of rtnl_notify_needed we can check
before-hand if they are really needed.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231208192847.714940-6-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use max() in a couple of places that are open coding it with the
ternary operator.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231208192847.714940-5-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
np->ucast_oif is read locklessly in some contexts.
Make all accesses to this field lockless, adding appropriate
annotations.
This also makes setsockopt( IPV6_UNICAST_IF ) lockless.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
np->mcast_oif is read locklessly in some contexts.
Make all accesses to this field lockless, adding appropriate
annotations.
This also makes setsockopt( IPV6_MULTICAST_IF ) lockless.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit b8dbbbc535 ("net: rtnetlink: remove local list
in __linkwatch_run_queue()"). It's evidently broken when there's a
non-urgent work that gets added back, and then the loop can never
finish.
While reverting, add a note about that.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Fixes: b8dbbbc535 ("net: rtnetlink: remove local list in __linkwatch_run_queue()")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
File reference cycles have caused lots of problems for io_uring
in the past, and it still doesn't work exactly right and races with
unix_stream_read_generic(). The safest fix would be to completely
disallow sending io_uring files via sockets via SCM_RIGHT, so there
are no possible cycles invloving registered files and thus rendering
SCM accounting on the io_uring side unnecessary.
Cc: stable@vger.kernel.org
Fixes: 0091bfc817 ("io_uring/af_unix: defer registered files gc to io_uring release")
Reported-and-suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 939463016b ("tcp: change data receiver flowlabel after one dup")
we noticed an increase of TCPACKSkippedPAWS events.
Neal Cardwell tracked the issue to tcp_disordered_ack() assumption
about remote peer TS clock.
RFC 1323 & 7323 are suggesting the following:
"timestamp clock frequency in the range 1 ms to 1 sec per tick
between 1ms and 1sec."
This has to be adjusted for 1 MHz clock frequency.
This hints at reorders of SACK packets on send side,
this might deserve a future patch.
(skb->ooo_okay is always set for pure ACK packets)
Fixes: 614e8316aa ("tcp: add support for usec resolution in TCP TS values")
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Morley <morleyd@google.com>
Link: https://lore.kernel.org/r/20231207181342.525181-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
My previous patch added a call to linkwatch_sync_dev(),
but that of course needs to be called under RTNL, which
I missed earlier, but now saw RCU warnings from.
Fix that by acquiring the RTNL in a similar fashion to
how other files do it here.
Fixes: facd15dfd6 ("net: core: synchronize link-watch when carrier is queried")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20231206172122.859df6ba937f.I9c80608bcfbab171943ff4942b52dbd5e97fe06e@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVzOJkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmYVD/0cx0TTIrip959DLqF/V8sl2BIrt/mjAvS4
oeVUe5OmqyR2gjjEYewf21MUyzE4tSMO/LFTEr0744zENKNTL84YhIIq30ga+Gue
n61c4WfPnhpvj8NQHuEf65cPosPSvKi6NSMLRJZCLqHtn8SrTQyCg8zk8GwjN/nl
fJScbrj4XKfZNlizKfbHQexfi78DZ2braTds0pPZ+uFXDTIrOKAfixfV39qwTFYZ
zI4FYKH8KzZzuMyyu2B+F3xCMdelUg26i2KMImKBaOsamnucIlyNvr/uWGs2G8tu
Z7sWGXdY9bFlWfAFxGZeFRWbmqpFz15Mmi2Uqx8wiiYxBAaJKL+Qaq358KbTD0hB
ZBKdy3AUw5J/445pwIepGp5XVxqn/qJFxGXzLAlncdhf9mXrjmFwNC/Yp5lnyDYy
S3YhUsjpGX3Mymjd/gWkn1BTZh7zzpKI6LmWJjn89jmTpOzlWmfPu/uM/c/vKvE8
KajCkZ3nUCmr56GUxvSZcon7vwc8pLUyrF8Vo1vwEEVgiN+IjJVk3dMAz0hyGhtO
2HxSwOAHllAIyqjmazqESQnWEf1p8idnoR9qZXAiLzbwUFbUY/a/YrCul6vHM4yE
czat+EGWdfJi0EX0z/bMUVRz05UbNt0JtKf3BnqxWtQlT8yKwCvMgHXuPJbY4y5g
yXi7ep37JQ==
=Xta7
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.7-2023-12-08' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Two minor fixes for issues introduced in this release cycle, and two
fixes for issues or potential issues that are heading to stable.
One of these ends up disabling passing io_uring file descriptors via
SCM_RIGHTS. There really shouldn't be an overlap between that kind of
historic use case and modern usage of io_uring, which is why this was
deemed appropriate"
* tag 'io_uring-6.7-2023-12-08' of git://git.kernel.dk/linux:
io_uring/af_unix: disable sending io_uring over sockets
io_uring/kbuf: check for buffer list readiness after NULL check
io_uring/kbuf: Fix an NULL vs IS_ERR() bug in io_alloc_pbuf_ring()
io_uring: fix mutex_unlock with unreferenced ctx
Commit 227b60f510 added a seqlock to ensure that the low and high
port numbers were always updated together.
This is overkill because the two 16bit port numbers can be held in
a u32 and read/written in a single instruction.
More recently 91d0b78c51 added support for finer per-socket limits.
The user-supplied value is 'high << 16 | low' but they are held
separately and the socket options protected by the socket lock.
Use a u32 containing 'high << 16 | low' for both the 'net' and 'sk'
fields and use READ_ONCE()/WRITE_ONCE() to ensure both values are
always updated together.
Change (the now trival) inet_get_local_port_range() to a static inline
to optimise the calling code.
(In particular avoiding returning integers by reference.)
Signed-off-by: David Laight <david.laight@aculab.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Mat Martineau <martineau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/4e505d4198e946a8be03fb1b4c3072b0@AcuMS.aculab.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tty_operations::send_xchar is one of the last users of 'char' type for
characters in the tty layer. Convert it to u8 now.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: netdev@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20231206073712.17776-5-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use strscpy() to implement ethtool_puts().
Functionally the same as ethtool_sprintf() when it's used with two
arguments or with just "%s" format specifier.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Madhuri Sripada <madhuri.sripada@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo points out that we effectively clear all unknown
flags from PIO when copying them to userspace in the netlink
RTM_NEWPREFIX notification.
We could fix this one at a time as new flags are defined,
or in one fell swoop - I choose the latter.
We could either define 6 new reserved flags (reserved1..6) and handle
them individually (and rename them as new flags are defined), or we
could simply copy the entire unmodified byte over - I choose the latter.
This unfortunately requires some anonymous union/struct magic,
so we add a static assert on the struct size for a little extra safety.
Cc: David Ahern <dsahern@kernel.org>
Cc: Lorenzo Colitti <lorenzo@google.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are seeing cases where neigh_cleanup_and_release() is called by
neigh_forced_gc() many times in a row with preemption turned off.
When running on a low powered CPU at a low CPU frequency, this has
been measured to keep preemption off for ~10 ms. That's not great on a
system with HZ=1000 which expects tasks to be able to schedule in
with ~1ms latency.
Suggested-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Judy Hsiao <judyhsiao@chromium.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After backporting commit 581512a6dc ("vsock/virtio: MSG_ZEROCOPY
flag support") in CentOS Stream 9, CI reported the following error:
In file included from ./include/linux/kernel.h:17,
from ./include/linux/list.h:9,
from ./include/linux/preempt.h:11,
from ./include/linux/spinlock.h:56,
from net/vmw_vsock/virtio_transport_common.c:9:
net/vmw_vsock/virtio_transport_common.c: In function ‘virtio_transport_can_zcopy‘:
./include/linux/minmax.h:20:35: error: comparison of distinct pointer types lacks a cast [-Werror]
20 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
| ^~
./include/linux/minmax.h:26:18: note: in expansion of macro ‘__typecheck‘
26 | (__typecheck(x, y) && __no_side_effects(x, y))
| ^~~~~~~~~~~
./include/linux/minmax.h:36:31: note: in expansion of macro ‘__safe_cmp‘
36 | __builtin_choose_expr(__safe_cmp(x, y), \
| ^~~~~~~~~~
./include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp‘
45 | #define min(x, y) __careful_cmp(x, y, <)
| ^~~~~~~~~~~~~
net/vmw_vsock/virtio_transport_common.c:63:37: note: in expansion of macro ‘min‘
63 | int pages_to_send = min(pages_in_iov, MAX_SKB_FRAGS);
We could solve it by using min_t(), but this operation seems entirely
unnecessary, because we also pass MAX_SKB_FRAGS to iov_iter_npages(),
which performs almost the same check, returning at most MAX_SKB_FRAGS
elements. So, let's eliminate this unnecessary comparison.
Fixes: 581512a6dc ("vsock/virtio: MSG_ZEROCOPY flag support")
Cc: avkrasnov@salutedevices.com
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Link: https://lore.kernel.org/r/20231206164143.281107-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The byte order conversions of ISM GID and DMB token are missing in
process of CLC accept and confirm. So fix it.
Fixes: 3d9725a6a1 ("net/smc: common routine for CLC accept and confirm")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/1701882157-87956-1-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The "NET_DM" generic netlink family notifies drop locations over the
"events" multicast group. This is problematic since by default generic
netlink allows non-root users to listen to these notifications.
Fix by adding a new field to the generic netlink multicast group
structure that when set prevents non-root users or root without the
'CAP_SYS_ADMIN' capability (in the user namespace owning the network
namespace) from joining the group. Set this field for the "events"
group. Use 'CAP_SYS_ADMIN' rather than 'CAP_NET_ADMIN' because of the
nature of the information that is shared over this group.
Note that the capability check in this case will always be performed
against the initial user namespace since the family is not netns aware
and only operates in the initial network namespace.
A new field is added to the structure rather than using the "flags"
field because the existing field uses uAPI flags and it is inappropriate
to add a new uAPI flag for an internal kernel check. In net-next we can
rework the "flags" field to use internal flags and fold the new field
into it. But for now, in order to reduce the amount of changes, add a
new field.
Since the information can only be consumed by root, mark the control
plane operations that start and stop the tracing as root-only using the
'GENL_ADMIN_PERM' flag.
Tested using [1].
Before:
# capsh -- -c ./dm_repo
# capsh --drop=cap_sys_admin -- -c ./dm_repo
After:
# capsh -- -c ./dm_repo
# capsh --drop=cap_sys_admin -- -c ./dm_repo
Failed to join "events" multicast group
[1]
$ cat dm.c
#include <stdio.h>
#include <netlink/genl/ctrl.h>
#include <netlink/genl/genl.h>
#include <netlink/socket.h>
int main(int argc, char **argv)
{
struct nl_sock *sk;
int grp, err;
sk = nl_socket_alloc();
if (!sk) {
fprintf(stderr, "Failed to allocate socket\n");
return -1;
}
err = genl_connect(sk);
if (err) {
fprintf(stderr, "Failed to connect socket\n");
return err;
}
grp = genl_ctrl_resolve_grp(sk, "NET_DM", "events");
if (grp < 0) {
fprintf(stderr,
"Failed to resolve \"events\" multicast group\n");
return grp;
}
err = nl_socket_add_memberships(sk, grp, NFNLGRP_NONE);
if (err) {
fprintf(stderr, "Failed to join \"events\" multicast group\n");
return err;
}
return 0;
}
$ gcc -I/usr/include/libnl3 -lnl-3 -lnl-genl-3 -o dm_repo dm.c
Fixes: 9a8afc8d39 ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol")
Reported-by: "The UK's National Cyber Security Centre (NCSC)" <security@ncsc.gov.uk>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231206213102.1824398-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The "psample" generic netlink family notifies sampled packets over the
"packets" multicast group. This is problematic since by default generic
netlink allows non-root users to listen to these notifications.
Fix by marking the group with the 'GENL_UNS_ADMIN_PERM' flag. This will
prevent non-root users or root without the 'CAP_NET_ADMIN' capability
(in the user namespace owning the network namespace) from joining the
group.
Tested using [1].
Before:
# capsh -- -c ./psample_repo
# capsh --drop=cap_net_admin -- -c ./psample_repo
After:
# capsh -- -c ./psample_repo
# capsh --drop=cap_net_admin -- -c ./psample_repo
Failed to join "packets" multicast group
[1]
$ cat psample.c
#include <stdio.h>
#include <netlink/genl/ctrl.h>
#include <netlink/genl/genl.h>
#include <netlink/socket.h>
int join_grp(struct nl_sock *sk, const char *grp_name)
{
int grp, err;
grp = genl_ctrl_resolve_grp(sk, "psample", grp_name);
if (grp < 0) {
fprintf(stderr, "Failed to resolve \"%s\" multicast group\n",
grp_name);
return grp;
}
err = nl_socket_add_memberships(sk, grp, NFNLGRP_NONE);
if (err) {
fprintf(stderr, "Failed to join \"%s\" multicast group\n",
grp_name);
return err;
}
return 0;
}
int main(int argc, char **argv)
{
struct nl_sock *sk;
int err;
sk = nl_socket_alloc();
if (!sk) {
fprintf(stderr, "Failed to allocate socket\n");
return -1;
}
err = genl_connect(sk);
if (err) {
fprintf(stderr, "Failed to connect socket\n");
return err;
}
err = join_grp(sk, "config");
if (err)
return err;
err = join_grp(sk, "packets");
if (err)
return err;
return 0;
}
$ gcc -I/usr/include/libnl3 -lnl-3 -lnl-genl-3 -o psample_repo psample.c
Fixes: 6ae0a62861 ("net: Introduce psample, a new genetlink channel for packet sampling")
Reported-by: "The UK's National Cyber Security Centre (NCSC)" <security@ncsc.gov.uk>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231206213102.1824398-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The curr pointer must also be updated on the splice similar to how
we do this for other copy types.
Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Reported-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20231206232706.374377-2-john.fastabend@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmVwtKsACgkQ1V2XiooU
IOQSmBAAo+L16FYFpdaZIUoFoE9D+WdZCiKVuY2pn9NNOzg8k9OEu2XYqjvOPjcu
68GvgWIbDp0O6lIj70Oz2VxAG7z74hAJJIWsq/ufsSnzggAl+L/eNm0naOysd8rv
wzzO8W22xB6aGSPVUxFul3L9FD0Cf0NkwCVHzTnndthcfxX6GEeV33puTnUnC2Ww
b3l01tw/mbX7But0j4XwvFhSdXUGjUQPy6h8T90kp/wgZ8Fkyqsm2VFPRn2Yrayr
XQ/RS+zdTy4Y07P4KA50k80l0lGbpcoZMEFbvzPmsKCWKgqqpb8fE2EgZ4sPSeY/
0nDhkWzKDriJrmZS2kW/GEXyV61DbIL/Pmq/Y+GQwBHoxNMP/tonJ2EFXsmQPVwO
jHsYerqQ6X8zsuCwtsUldp/afz+SjBL8HzQjfm5TVxNmArXythKsSpJmfUoDOsp1
opSxoXztFfFid7K3T1/g+yhoVE6Xxe+KdnWYdpfC3t2VrX/HZts2Wxeuvpe7UbOB
ktK9jceknMb22AHNixX4UeyY/xYcZMpuQtwsXp6twaxOBbB++g1X/rTuK6TeXoFb
2fPFjPaBuiyqFRC78leLYiHWtbKIQg7i5yIcuadUUifXW6q6C9gUUM9dsDt6SOlr
Uao9GayMsjhLDgmUMesSsIJ6LILkhTytPNpE8OiOmIdZv7fhmqo=
=dEOA
-----END PGP SIGNATURE-----
Merge tag 'nf-23-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Incorrect nf_defrag registration for bpf link infra, from D. Wythe.
2) Skip inactive elements in pipapo set backend walk to avoid double
deactivation, from Florian Westphal.
3) Fix NFT_*_F_PRESENT check with big endian arch, also from Florian.
4) Bail out if number of expressions in NFTA_DYNSET_EXPRESSIONS mismatch
stateful expressions in set declaration.
5) Honor family in table lookup by handle. Broken since 4.16.
6) Use sk_callback_lock to protect access to sk->sk_socket in xt_owner.
sock_orphan() might zap this pointer, from Phil Sutter.
All of these fixes address broken stuff for several releases.
* tag 'nf-23-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
netfilter: nf_tables: validate family when identifying table via handle
netfilter: nf_tables: bail out on mismatching dynset and set expressions
netfilter: nf_tables: fix 'exist' matching on bigendian arches
netfilter: nft_set_pipapo: skip inactive elements during set walk
netfilter: bpf: fix bad registration on nf_defrag
====================
Link: https://lore.kernel.org/r/20231206180357.959930-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
File reference cycles have caused lots of problems for io_uring
in the past, and it still doesn't work exactly right and races with
unix_stream_read_generic(). The safest fix would be to completely
disallow sending io_uring files via sockets via SCM_RIGHT, so there
are no possible cycles invloving registered files and thus rendering
SCM accounting on the io_uring side unnecessary.
Cc: <stable@vger.kernel.org>
Fixes: 0091bfc817 ("io_uring/af_unix: defer registered files gc to io_uring release")
Reported-and-suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c716c88321939156909cfa1bd8b0faaf1c804103.1701868795.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZXDtiAAKCRDbK58LschI
g1QAAP9SxVT/DsI4SfxmhYgJHgxxmzw4mL96MqnEL9n2D6nOeAEApAjkY5UbhZwC
+bvNPnoP9Tl0h3et8BB08bbRRJLiHAE=
=2sEV
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-12-06
We've added 4 non-merge commits during the last 6 day(s) which contain
a total of 7 files changed, 185 insertions(+), 55 deletions(-).
The main changes are:
1) Fix race found by syzkaller on prog_array_map_poke_run when
a BPF program's kallsym symbols were still missing, from Jiri Olsa.
2) Fix BPF verifier's branch offset comparison for BPF_JMP32 | BPF_JA,
from Yonghong Song.
3) Fix xsk's poll handling to only set mask on bound xsk sockets,
from Yewon Choi.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add test for early update in prog_array_map_poke_run
bpf: Fix prog_array_map_poke_run map poke update
xsk: Skip polling event check for unbound socket
bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4
====================
Link: https://lore.kernel.org/r/20231206220528.12093-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Due to linkwatch_forget_dev() (and perhaps others?) checking for
list_empty(&dev->link_watch_list), we must have all manipulations
of even the local on-stack list 'wrk' here under spinlock, since
even that list can be reached otherwise via dev->link_watch_list.
This is already the case, but makes this a bit counter-intuitive,
often local lists are used to _not_ have to use locking for their
local use.
Remove the local list as it doesn't seem to serve any purpose.
While at it, move a variable declaration into the loop using it.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20231205170011.56576dcc1727.I698b72219d9f6ce789bd209b8f6dffd0ca32a8f2@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch is based on a detailed report and ideas from Yepeng Pan
and Christian Rossow.
ACK seq validation is currently following RFC 5961 5.2 guidelines:
The ACK value is considered acceptable only if
it is in the range of ((SND.UNA - MAX.SND.WND) <= SEG.ACK <=
SND.NXT). All incoming segments whose ACK value doesn't satisfy the
above condition MUST be discarded and an ACK sent back. It needs to
be noted that RFC 793 on page 72 (fifth check) says: "If the ACK is a
duplicate (SEG.ACK < SND.UNA), it can be ignored. If the ACK
acknowledges something not yet sent (SEG.ACK > SND.NXT) then send an
ACK, drop the segment, and return". The "ignored" above implies that
the processing of the incoming data segment continues, which means
the ACK value is treated as acceptable. This mitigation makes the
ACK check more stringent since any ACK < SND.UNA wouldn't be
accepted, instead only ACKs that are in the range ((SND.UNA -
MAX.SND.WND) <= SEG.ACK <= SND.NXT) get through.
This can be refined for new (and possibly spoofed) flows,
by not accepting ACK for bytes that were never sent.
This greatly improves TCP security at a little cost.
I added a Fixes: tag to make sure this patch will reach stable trees,
even if the 'blamed' patch was adhering to the RFC.
tp->bytes_acked was added in linux-4.2
Following packetdrill test (courtesy of Yepeng Pan) shows
the issue at hand:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1024) = 0
// ---------------- Handshake ------------------- //
// when window scale is set to 14 the window size can be extended to
// 65535 * (2^14) = 1073725440. Linux would accept an ACK packet
// with ack number in (Server_ISN+1-1073725440. Server_ISN+1)
// ,though this ack number acknowledges some data never
// sent by the server.
+0 < S 0:0(0) win 65535 <mss 1400,nop,wscale 14>
+0 > S. 0:0(0) ack 1 <...>
+0 < . 1:1(0) ack 1 win 65535
+0 accept(3, ..., ...) = 4
// For the established connection, we send an ACK packet,
// the ack packet uses ack number 1 - 1073725300 + 2^32,
// where 2^32 is used to wrap around.
// Note: we used 1073725300 instead of 1073725440 to avoid possible
// edge cases.
// 1 - 1073725300 + 2^32 = 3221241997
// Oops, old kernels happily accept this packet.
+0 < . 1:1001(1000) ack 3221241997 win 65535
// After the kernel fix the following will be replaced by a challenge ACK,
// and prior malicious frame would be dropped.
+0 > . 1:1(0) ack 1001
Fixes: 354e4aa391 ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yepeng Pan <yepeng.pan@cispa.de>
Reported-by: Christian Rossow <rossow@cispa.de>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20231205161841.2702925-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As BPF trampoline of different archs moves from bpf_jit_[alloc|free]_exec()
to bpf_prog_pack_[alloc|free](), we need to use different _alloc, _free for
different archs during the transition. Add the following helpers for this
transition:
void *arch_alloc_bpf_trampoline(unsigned int size);
void arch_free_bpf_trampoline(void *image, unsigned int size);
void arch_protect_bpf_trampoline(void *image, unsigned int size);
void arch_unprotect_bpf_trampoline(void *image, unsigned int size);
The fallback version of these helpers require size <= PAGE_SIZE, but they
are only called with size == PAGE_SIZE. They will be called with size <
PAGE_SIZE when arch_bpf_trampoline_size() helper is introduced later.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-4-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A concurrently running sock_orphan() may NULL the sk_socket pointer in
between check and deref. Follow other users (like nft_meta.c for
instance) and acquire sk_callback_lock before dereferencing sk_socket.
Fixes: 0265ab44ba ("[NETFILTER]: merge ipt_owner/ip6t_owner in xt_owner")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Validate table family when looking up for it via NFTA_TABLE_HANDLE.
Fixes: 3ecbfd65f5 ("netfilter: nf_tables: allocate handle and delete objects via handle")
Reported-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If dynset expressions provided by userspace is larger than the declared
set expressions, then bail out.
Fixes: 48b0ae046e ("netfilter: nftables: netlink support for several set element expressions")
Reported-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Maze reports "tcp option fastopen exists" fails to match on
OpenWrt 22.03.5, r20134-5f15225c1e (5.10.176) router.
"tcp option fastopen exists" translates to:
inet
[ exthdr load tcpopt 1b @ 34 + 0 present => reg 1 ]
[ cmp eq reg 1 0x00000001 ]
.. but existing nft userspace generates a 1-byte compare.
On LSB (x86), "*reg32 = 1" is identical to nft_reg_store8(reg32, 1), but
not on MSB, which will place the 1 last. IOW, on bigendian aches the cmp8
is awalys false.
Make sure we store this in a consistent fashion, so existing userspace
will also work on MSB (bigendian).
Regardless of this patch we can also change nft userspace to generate
'reg32 == 0' and 'reg32 != 0' instead of u8 == 0 // u8 == 1 when
adding 'option x missing/exists' expressions as well.
Fixes: 3c1fece881 ("netfilter: nft_exthdr: Allow checking TCP option presence, too")
Fixes: b9f9a485fb ("netfilter: nft_exthdr: add boolean DCCP option matching")
Fixes: 055c4b34b9 ("netfilter: nft_fib: Support existence check")
Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/CAHo-OozyEqHUjL2-ntATzeZOiuftLWZ_HU6TOM_js4qLfDEAJg@mail.gmail.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise set elements can be deactivated twice which will cause a crash.
Reported-by: Xingyuan Mo <hdthky0@gmail.com>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This extra check doesn't work for a handshake when SYN segment has
(current_key.maclen != rnext_key.maclen). It could be amended to
preserve rnext_key.maclen instead of current_key.maclen, but that
requires a lookup on listen socket.
Originally, this extra maclen check was introduced just because it was
cheap. Drop it and convert tcp_request_sock::maclen into boolean
tcp_request_sock::used_tcp_ao.
Fixes: 06b22ef295 ("net/tcp: Wire TCP-AO to request sockets")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If the connection was established, don't allow adding TCP-AO keys that
don't match the peer. Currently, there are checks for ip-address
matching, but L3 index check is missing. Add it to restrict userspace
shooting itself somewhere.
Yet, nothing restricts the CAP_NET_RAW user from trying to shoot
themselves by performing setsockopt(SO_BINDTODEVICE) or
setsockopt(SO_BINDTOIFINDEX) over an established TCP-AO connection.
So, this is just "minimum effort" to potentially save someone's
debugging time, rather than a full restriction on doing weird things.
Fixes: 248411b8cb ("net/tcp: Wire up l3index to TCP-AO")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Listen socket is not an established TCP connection, so
setsockopt(TCP_AO_REPAIR) doesn't have any impact.
Restrict this uAPI for listen sockets.
Fixes: faadfaba5e ("net/tcp: Add TCP_AO_REPAIR")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently functions that pre-calculate TCP header options length use
unaligned TCP-AO header + MAC-length for skb reservation.
And the functions that actually write TCP-AO options into skb do align
the header. Nothing good can come out of this for ((maclen % 4) != 0).
Provide tcp_ao_len_aligned() helper and use it everywhere for TCP
header options space calculations.
Fixes: 1e03d32bea ("net/tcp: Add TCP-AO sign to outgoing packets")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This function has so many arguments already, before adding
yet another one, refactor it to take a struct instead.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In ipgre_xmit(), skb_pull() may fail even if pskb_inet_may_pull() returns
true. For example, applications can use PF_PACKET to create a malformed
packet with no IP header. This type of packet causes a problem such as
uninit-value access.
This patch ensures that skb_pull() can pull the required size by checking
the skb with pskb_network_may_pull() before skb_pull().
Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Suman Ghosh <sumang@marvell.com>
Link: https://lore.kernel.org/r/20231202161441.221135-1-syoshida@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit da37845fdc ("packet: uses kfree_skb() for errors.") switches
from consume_skb to kfree_skb to improve error handling. However, this
could bring a lot of noises when we monitor real packet drops in
kfree_skb[1], because in tpacket_rcv or packet_rcv only packet clones
can be freed, not actual packets.
Adding a generic drop reason to allow distinguish these "clone drops".
[1]: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/
Fixes: da37845fdc ("packet: uses kfree_skb() for errors.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/ZW4piNbx3IenYnuw@debian.debian
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are multiple ways to query for the carrier state: through
rtnetlink, sysfs, and (possibly) ethtool. Synchronize linkwatch
work before these operations so that we don't have a situation
where userspace queries the carrier state between the driver's
carrier off->on transition and linkwatch running and expects it
to work, when really (at least) TX cannot work until linkwatch
has run.
I previously posted a longer explanation of how this applies to
wireless [1] but with this wireless can simply query the state
before sending data, to ensure the kernel is ready for it.
[1] https://lore.kernel.org/all/346b21d87c69f817ea3c37caceb34f1f56255884.camel@sipsolutions.net/
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231204214706.303c62768415.I1caedccae72ee5a45c9085c5eb49c145ce1c0dd5@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The variables are organized according in the following way:
- TX read-mostly hotpath cache lines
- TXRX read-mostly hotpath cache lines
- RX read-mostly hotpath cache lines
- TX read-write hotpath cache line
- TXRX read-write hotpath cache line
- RX read-write hotpath cache line
Fastpath cachelines end after rcvq_space.
Cache line boundaries are enforced only between read-mostly and
read-write. That is, if read-mostly tx cachelines bleed into
read-mostly txrx cachelines, we do not care. We care about the
boundaries between read and write cachelines because we want
to prevent false sharing.
Fast path variables span cache lines before change: 12
Fast path variables span cache lines after change: 8
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231204201232.520025-3-lixiaoyan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reorganize fast path variables on tx-txrx-rx order
Fastpath variables end after npinfo.
Below data generated with pahole on x86 architecture.
Fast path variables span cache lines before change: 12
Fast path variables span cache lines after change: 4
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231204201232.520025-2-lixiaoyan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the blamed commit below, if the user-space application performs
window clamping when tp->rcv_wnd is 0, the TCP socket will never be
able to announce a non 0 receive window, even after completely emptying
the receive buffer and re-setting the window clamp to higher values.
Refactor tcp_set_window_clamp() to address the issue: when the user
decreases the current clamp value, set rcv_ssthresh according to the
same logic used at buffer initialization, but ensuring reserved mem
provisioning.
To avoid code duplication factor-out the relevant bits from
tcp_adjust_rcv_ssthresh() in a new helper and reuse it in the above
scenario.
When increasing the clamp value, give the rcv_ssthresh a chance to grow
according to previously implemented heuristic.
Fixes: 3aa7857fe1 ("tcp: enable mid stream window clamp")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/705dad54e6e6e9a010e571bf58e0b35a8ae70503.1701706073.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In xsk_poll(), checking available events and setting mask bits should
be executed only when a socket has been bound. Setting mask bits for
unbound socket is meaningless.
Currently, it checks events even when xsk_check_common() failed.
To prevent this, we move goto location (skip_tx) after that checking.
Fixes: 1596dae2f1 ("xsk: check IFF_UP earlier in Tx path")
Signed-off-by: Yewon Choi <woni9911@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20231201061048.GA1510@libra05
With the removal of the 'iov' argument to import_single_range(), the two
functions are now fully identical. Convert the import_single_range()
callers to import_ubuf(), and remove the former fully.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
The actions array is contiguous, so stop processing whenever a NULL
is found. This is already the assumption for tcf_action_destroy[1],
which is called from tcf_actions_init.
[1] https://elixir.bootlin.com/linux/v6.7-rc3/source/net/sched/act_api.c#L1115
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The ops array is contiguous, so stop processing whenever a NULL is found
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In tcf_action_add, when putting the reference for the bound actions
it assigns NULLs to just created actions passing a non contiguous
array to tcf_action_put_many.
Refactor the code so the actions array is always contiguous.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use the auxiliary macro tcf_act_for_each_action in all the
functions that expect a contiguous action array
Suggested-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Implement the netdev netlink framework functions for
napi support. The netdev structure tracks all the napi
instances and napi fields. The napi instances and associated
parameters can be retrieved this way.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147333637.5260.14807433239805550815.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the napi pointer in netdev queue for tracking the napi
instance for each queue. This achieves the queue<->napi mapping.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147331483.5260.15723438819994285695.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support in netlink spec(netdev.yaml) for queue information.
Add code generated from the spec.
Note: The "queue-type" attribute takes values 0 and 1 for rx
and tx queue type respectively.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147330963.5260.2576294626647300472.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets
that are bound but haven't yet called connect() or listen().
The code is inspired by the ->lhash2 loop. However there's no manual
test of the source port, since this kind of filtering is already
handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped
at a time, to avoid running with bh disabled for too long.
There's no TCP state for bound but otherwise inactive sockets. Such
sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed
to only dump listening sockets, actually requests the kernel to dump
sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping
bound-only sockets with "ss -l", we therefore need to define a new
pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set
explicitly.
With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to
40000, 64000, 60000, an updated version of iproute2 could work as
follow:
$ ss -t state bound-inactive
Recv-Q Send-Q Local Address:Port Peer Address:Port Process
0 0 0.0.0.0:40000 0.0.0.0:*
0 0 [::]:60000 [::]:*
0 0 *:64000 *:*
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/b3a84ae61e19c06806eea9c602b3b66e8f0cfc81.1701362867.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In some potential instances the reference count on struct packet_sock
could be saturated and cause overflows which gets the kernel a bit
confused. To prevent this, move to a 64-bit atomic reference count on
64-bit architectures to prevent the possibility of this type to overflow.
Because we can not handle saturation, using refcount_t is not possible
in this place. Maybe someday in the future if it changes it could be
used. Also, instead of using plain atomic64_t, use atomic_long_t instead.
32-bit machines tend to be memory-limited (i.e. anything that increases
a reference uses so much memory that you can't actually get to 2**32
references). 32-bit architectures also tend to have serious problems
with 64-bit atomics. Hence, atomic_long_t is the more natural solution.
Reported-by: "The UK's National Cyber Security Centre (NCSC)" <security@ncsc.gov.uk>
Co-developed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@kernel.org
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231201131021.19999-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reorganize fast path variables on tx-txrx-rx order.
Fastpath cacheline ends after sysctl_tcp_rmem.
There are only read-only variables here. (write is on the control path
and not considered in this case)
Below data generated with pahole on x86 architecture.
Fast path variables span cache lines before change: 4
Fast path variables span cache lines after change: 2
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use DO_ONCE_LITE_IF() and __cold attribute to put tcp_gro_dev_warn()
out of line.
This also allows the message to be printed again after a
"echo 1 > /sys/kernel/debug/clear_warn_once"
Also add a READ_ONCE() when reading device mtu, as it could
be changed concurrently.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231130184135.4130860-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
BYTES_PER_KBIT is defined in units.h, use that definition.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231128174813.394462-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZWiCPAAKCRDbK58LschI
g4djAQC1FdqCRIFkhbiIRNHTgHjnfQShELQbd9ofJqzylLqmmgD+JI1E7D9SXagm
pIXQ26EGmq8/VcCT3VLncA8EsC76Gg4=
=Xowm
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-11-30
We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 58 files changed, 1598 insertions(+), 154 deletions(-).
The main changes are:
1) Add initial TX metadata implementation for AF_XDP with support in mlx5
and stmmac drivers. Two types of offloads are supported right now, that
is, TX timestamp and TX checksum offload, from Stanislav Fomichev with
stmmac implementation from Song Yoong Siang.
2) Change BPF verifier logic to validate global subprograms lazily instead
of unconditionally before the main program, so they can be guarded using
BPF CO-RE techniques, from Andrii Nakryiko.
3) Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter, from Jiri Olsa.
4) Use pkg-config in BPF selftests to determine ld flags which is
in particular needed for linking statically, from Akihiko Odaki.
5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits)
bpf/tests: Remove duplicate JSGT tests
selftests/bpf: Add TX side to xdp_hw_metadata
selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
selftests/bpf: Add TX side to xdp_metadata
selftests/bpf: Add csum helpers
selftests/xsk: Support tx_metadata_len
xsk: Add option to calculate TX checksum in SW
xsk: Validate xsk_tx_metadata flags
xsk: Document tx_metadata_len layout
net: stmmac: Add Tx HWTS support to XDP ZC
net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
tools: ynl: Print xsk-features from the sample
xsk: Add TX timestamp and TX checksum offload support
xsk: Support tx_metadata_len
selftests/bpf: Use pkg-config for libelf
selftests/bpf: Override PKG_CONFIG for static builds
selftests/bpf: Choose pkg-config for the target
bpftool: Add support to display uprobe_multi links
selftests/bpf: Add link_info test for uprobe_multi link
selftests/bpf: Use bpf_link__destroy in fill_link_info tests
...
====================
Conflicts:
Documentation/netlink/specs/netdev.yaml:
839ff60df3 ("net: page_pool: add nlspec for basic access to page pools")
48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/
While at it also regen, tree is dirty after:
48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
looks like code wasn't re-rendered after "render-max" was removed.
Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
strlcpy() reads the entire source buffer first. This read may exceed
the destination size limit. This is both inefficient and can lead
to linear read overflows if a source string is not NUL-terminated[1].
Additionally, it returns the size of the source string, not the
resulting size of the destination string. In an effort to remove strlcpy()
completely[2], replace strlcpy() here with strscpy().
Explicitly handle the truncation case by returning the size of the
resulting string.
If "nodename" was ever longer than sizeof(clnt->cl_nodename) - 1, this
change will fix a bug where clnt->cl_nodelen would end up thinking there
were more characters in clnt->cl_nodename than there actually were,
which might have lead to kernel memory content exposures.
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1]
Link: https://github.com/KSPP/linux/issues/89 [2]
Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20231114175407.work.410-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
syzbot was able to trigger a crash [1] in page_pool_unlist()
page_pool_list() only inserts a page pool into a netdev page pool list
if a netdev was set in params.
Even if the kzalloc() call in page_pool_create happens to initialize
pool->user.list, I chose to be more explicit in page_pool_list()
adding one INIT_HLIST_NODE().
We could test in page_pool_unlist() if netdev was set,
but since netdev can be changed to lo, it seems more robust to
check if pool->user.list is hashed before calling hlist_del().
[1]
Illegal XDP return value 4294946546 on prog (id 2) dev N/A, expect packet loss!
general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 5064 Comm: syz-executor391 Not tainted 6.7.0-rc2-syzkaller-00533-ga379972973a8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023
RIP: 0010:__hlist_del include/linux/list.h:988 [inline]
RIP: 0010:hlist_del include/linux/list.h:1002 [inline]
RIP: 0010:page_pool_unlist+0xd1/0x170 net/core/page_pool_user.c:342
Code: df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 90 00 00 00 4c 8b a3 f0 06 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 75 68 48 85 ed 49 89 2c 24 74 24 e8 1b ca 07 f9 48 8d
RSP: 0018:ffffc900039ff768 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff88814ae02000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88814ae026f0
RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff1d57fdc
R10: ffffffff8eabfee3 R11: ffffffff8aa0008b R12: 0000000000000000
R13: ffff88814ae02000 R14: dffffc0000000000 R15: 0000000000000001
FS: 000055555717a380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002555398 CR3: 0000000025044000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__page_pool_destroy net/core/page_pool.c:851 [inline]
page_pool_release+0x507/0x6b0 net/core/page_pool.c:891
page_pool_destroy+0x1ac/0x4c0 net/core/page_pool.c:956
xdp_test_run_teardown net/bpf/test_run.c:216 [inline]
bpf_test_run_xdp_live+0x1578/0x1af0 net/bpf/test_run.c:388
bpf_prog_test_run_xdp+0x827/0x1530 net/bpf/test_run.c:1254
bpf_prog_test_run kernel/bpf/syscall.c:4041 [inline]
__sys_bpf+0x11bf/0x4920 kernel/bpf/syscall.c:5402
__do_sys_bpf kernel/bpf/syscall.c:5488 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5486 [inline]
__x64_sys_bpf+0x78/0xc0 kernel/bpf/syscall.c:5486
Fixes: 083772c9f9 ("net: page_pool: record pools per netdev")
Reported-and-tested-by: syzbot+f9f8efb58a4db2ca98d0@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231130092259.3797753-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During reload-reinit, all entities except for params, resources, regions
and health reporter should be removed and re-added. Add a warning to
be triggered in case the driver behaves differently.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>