Similar to changes done in TCP in blamed commit.
We should not sense pfmemalloc status in sendpage() methods.
Fixes: 3261400639 ("tcp: TX zerocopy should not sense pfmemalloc status")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221027040637.1107703-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEBsvAIBsPu6mG7thcrX5LkNig010FAmNabUQTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRCtfkuQ2KDTXaTBB/9xLKgfBKQtTKpLJ8HoDLESWbHr6v9T
8GmOY0+UPbKxOqfMknYfwV57ca9agIw5IOFx9ry7w6PrytDF2S5ojYHtmQvvbmQJ
2HyuP8k4Qt3MIS2O1fxHg6fe4qMhdDBJ2vkp5AGsfFWG3i189gPAGQFOtI19Zt+j
Se547s5WzGlgcCQulbveyJsTO85Z6xhBr7VOy8mxSkkxivrYzKRk7YjWlulHCCFF
A/Q15dXFyBtH+oG+Gl3Nnj0ttgiE7X6oOzwTH5JkUmUDnTujPZQX+tcJYNHKB+xw
m3fwqRM8mWb1+m1gCgDKVT0Yaurbhbmv1phdZcPgFjCVU6PR1miFFG8M
=YUyU
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-6.1-20221027' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2022-10-27
Anssi Hannula fixes the use of the completions in the kvaser_usb
driver.
Biju Das contributes 2 patches for the rcar_canfd driver. A IRQ storm
that can be triggered by high CAN bus load and channel specific IRQ
handlers are fixed.
Yang Yingliang fixes the j1939 transport protocol by moving a
kfree_skb() out of a spin_lock_irqsave protected section.
* tag 'linux-can-fixes-for-6.1-20221027' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: j1939: transport: j1939_session_skb_drop_old(): spin_unlock_irqrestore() before kfree_skb()
can: rcar_canfd: fix channel specific IRQ handling for RZ/G2L
can: rcar_canfd: rcar_canfd_handle_global_receive(): fix IRQ storm on global FIFO receive
can: kvaser_usb: Fix possible completions during init_completion
====================
Link: https://lore.kernel.org/r/20221027114356.1939821-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As explained by Julian, fib_nh_scope is related to fib_nh_gw4, but
fib_info_update_nhc_saddr() needs the scope of the route, which is
the scope "before" fib_nh_scope, ie fib_nh_scope - 1.
This patch fixes the problem described in commit 747c143072 ("ip: fix
dflt addr selection for connected nexthop").
Fixes: 597cfe4fc3 ("nexthop: Add support for IPv4 nexthops")
Link: https://lore.kernel.org/netdev/6c8a44ba-c2d5-cdf-c5c7-5baf97cba38@ssi.bg/
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 747c143072.
As explained by Julian, nhc_scope is related to nhc_gw, not to the route.
Revert the original patch. The initial problem is fixed differently in the
next commit.
Link: https://lore.kernel.org/netdev/6c8a44ba-c2d5-cdf-c5c7-5baf97cba38@ssi.bg/
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit eb55dc09b5.
The patch that introduces this bug is reverted right after this one.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function pfkey_send_acquire may race with pfkey_register
(which could even be in a different name space). This may result
in a buffer overrun.
Allocating the maximum amount of memory that could be used prevents
this.
Reported-by: syzbot+1e9af9185d8850e2c2fa@syzkaller.appspotmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
It is not allowed to call kfree_skb() from hardware interrupt context
or with interrupts being disabled. The skb is unlinked from the queue,
so it can be freed after spin_unlock_irqrestore().
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/all/20221027091237.2290111-1-yangyingliang@huawei.com
Cc: stable@vger.kernel.org
[mkl: adjust subject]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
As noted by Paolo Abeni, pr_warn doesn't generate any splat and can still
preserve the warning to the user that feature downgrade occurred. We
likely cannot introduce other kinds of checks / enforcement here because
syzbot can generate different genl versions to the datapath.
Reported-by: syzbot+31cde0bef4bbf8ba2d86@syzkaller.appspotmail.com
Fixes: 44da5ae5fb ("openvswitch: Drop user features if old user space attempted to create datapath")
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Another syzbot report [1] with no reproducer hints
at a bug in ip6_gre tunnel (dev:ip6gretap0)
Since ipv6 mcast code makes sure to read dev->mtu once
and applies a sanity check on it (see commit b9b312a7a4
"ipv6: mcast: better catch silly mtu values"), a remaining
possibility is that a layer is able to set dev->mtu to
an underflowed value (high order bit set).
This could happen indeed in ip6gre_tnl_link_config_route(),
ip6_tnl_link_config() and ipip6_tunnel_bind_dev()
Make sure to sanitize mtu value in a local variable before
it is written once on dev->mtu, as lockless readers could
catch wrong temporary value.
[1]
skbuff: skb_over_panic: text:ffff80000b7a2f38 len:40 put:40 head:ffff000149dcf200 data:ffff000149dcf2b0 tail:0xd8 end:0xc0 dev:ip6gretap0
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:120
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 10241 Comm: kworker/1:1 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022
Workqueue: mld mld_ifc_work
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : skb_panic+0x4c/0x50 net/core/skbuff.c:116
lr : skb_panic+0x4c/0x50 net/core/skbuff.c:116
sp : ffff800020dd3b60
x29: ffff800020dd3b70 x28: 0000000000000000 x27: ffff00010df2a800
x26: 00000000000000c0 x25: 00000000000000b0 x24: ffff000149dcf200
x23: 00000000000000c0 x22: 00000000000000d8 x21: ffff80000b7a2f38
x20: ffff00014c2f7800 x19: 0000000000000028 x18: 00000000000001a9
x17: 0000000000000000 x16: ffff80000db49158 x15: ffff000113bf1a80
x14: 0000000000000000 x13: 00000000ffffffff x12: ffff000113bf1a80
x11: ff808000081c0d5c x10: 0000000000000000 x9 : 73f125dc5c63ba00
x8 : 73f125dc5c63ba00 x7 : ffff800008161d1c x6 : 0000000000000000
x5 : 0000000000000080 x4 : 0000000000000001 x3 : 0000000000000000
x2 : ffff0001fefddcd0 x1 : 0000000100000000 x0 : 0000000000000089
Call trace:
skb_panic+0x4c/0x50 net/core/skbuff.c:116
skb_over_panic net/core/skbuff.c:125 [inline]
skb_put+0xd4/0xdc net/core/skbuff.c:2049
ip6_mc_hdr net/ipv6/mcast.c:1714 [inline]
mld_newpack+0x14c/0x270 net/ipv6/mcast.c:1765
add_grhead net/ipv6/mcast.c:1851 [inline]
add_grec+0xa20/0xae0 net/ipv6/mcast.c:1989
mld_send_cr+0x438/0x5a8 net/ipv6/mcast.c:2115
mld_ifc_work+0x38/0x290 net/ipv6/mcast.c:2653
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860
Code: 91011400 aa0803e1 a90027ea 94373093 (d4210000)
Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221024020124.3756833-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stefan Schmidt says:
====================
pull-request: ieee802154 for net 2022-10-24
Two fixup patches for return code changes of an earlier commit.
Wei Yongjun fixed a missed -EINVAL return on the recent change, while
Alexander Aring adds handling for unknown address type cases as well.
Miquel Raynal fixed a long standing issue with LQI value recording
which got broken 8 years ago. (It got more attention with the work
in progress enhancement in wpan).
* tag 'ieee802154-for-net-2022-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan:
mac802154: Fix LQI recording
net: ieee802154: fix error return code in dgram_bind()
net: ieee802154: return -EINVAL for unknown addr type
====================
Link: https://lore.kernel.org/r/20221024102301.9433-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Our CI reported lockdep splat in the fastopen code:
======================================================
WARNING: possible circular locking dependency detected
6.0.0.mptcp_f5e8bfe9878d+ #1558 Not tainted
------------------------------------------------------
packetdrill/1071 is trying to acquire lock:
ffff8881bd198140 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_wait_for_connect+0x19c/0x310
but task is already holding lock:
ffff8881b8346540 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0xfdf/0x1740
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (k-sk_lock-AF_INET){+.+.}-{0:0}:
__lock_acquire+0xb6d/0x1860
lock_acquire+0x1d8/0x620
lock_sock_nested+0x37/0xd0
inet_stream_connect+0x3f/0xa0
mptcp_connect+0x411/0x800
__inet_stream_connect+0x3ab/0x800
mptcp_stream_connect+0xac/0x110
__sys_connect+0x101/0x130
__x64_sys_connect+0x6e/0xb0
do_syscall_64+0x59/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (sk_lock-AF_INET){+.+.}-{0:0}:
check_prev_add+0x15e/0x2110
validate_chain+0xace/0xdf0
__lock_acquire+0xb6d/0x1860
lock_acquire+0x1d8/0x620
lock_sock_nested+0x37/0xd0
inet_wait_for_connect+0x19c/0x310
__inet_stream_connect+0x26c/0x800
tcp_sendmsg_fastopen+0x341/0x650
mptcp_sendmsg+0x109d/0x1740
sock_sendmsg+0xe1/0x120
__sys_sendto+0x1c7/0x2a0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x59/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(k-sk_lock-AF_INET);
lock(sk_lock-AF_INET);
lock(k-sk_lock-AF_INET);
lock(sk_lock-AF_INET);
*** DEADLOCK ***
1 lock held by packetdrill/1071:
#0: ffff8881b8346540 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0xfdf/0x1740
======================================================
The problem is caused by the blocking inet_wait_for_connect() releasing
and re-acquiring the msk socket lock while the subflow socket lock is
still held and the MPTCP socket requires that the msk socket lock must
be acquired before the subflow socket lock.
Address the issue always invoking tcp_sendmsg_fastopen() in an
unblocking manner, and later eventually complete the blocking
__inet_stream_connect() as needed.
Fixes: d98a82a6af ("mptcp: handle defer connect in mptcp_sendmsg")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current MPTCP connect implementation duplicates a bit of inet
code and does not use nor provide a struct proto->connect callback,
which in turn will not fit the upcoming fastopen implementation.
Refactor such implementation to use the common helper, moving the
MPTCP-specific bits into mptcp_connect(). Additionally, avoid an
indirect call to the subflow connect callback.
Note that the fastopen call-path invokes mptcp_connect() while already
holding the subflow socket lock. Explicitly keep track of such path
via a new MPTCP-level flag and handle the locking accordingly.
Additionally, track the connect flags in a new msk field to allow
propagating them to the subflow inet_stream_connect call.
Fixes: d98a82a6af ("mptcp: handle defer connect in mptcp_sendmsg")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The mptcp_pm_nl_get_local_id() code assumes that the msk local address
is available at that point. For passive sockets, we initialize such
address at accept() time.
Depending on the running configuration and the user-space timing, a
passive MPJ subflow can join the msk socket before accept() completes.
In such case, the PM assigns a wrong local id to the MPJ subflow
and later PM netlink operations will end-up touching the wrong/unexpected
subflow.
All the above causes sporadic self-tests failures, especially when
the host is heavy loaded.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/308
Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Fixes: d045b9eb95 ("mptcp: introduce implicit endpoints")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To keep backward compatibility we used to leave attribute parsing
to the family if no policy is specified. This becomes tedious as
we move to more strict validation. Families must define reject all
policies if they don't want any attributes accepted.
Piggy back on the resv_start_op field as the switchover point.
AFAICT only ethtool has added new commands since the resv_start_op
was defined, and it has per-op policies so this should be a no-op.
Nonetheless the patch should still go into v6.1 for consistency.
Link: https://lore.kernel.org/all/20221019125745.3f2e7659@kernel.org/
Link: https://lore.kernel.org/r/20221021193532.1511293-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- eth: fman: re-expose location of the MAC address to userspace,
apparently some udev scripts depended on the exact value
Current release - new code bugs:
- bpf:
- wait for busy refill_work when destroying bpf memory allocator
- allow bpf_user_ringbuf_drain() callbacks to return 1
- fix dispatcher patchable function entry to 5 bytes nop
Previous releases - regressions:
- net-memcg: avoid stalls when under memory pressure
- tcp: fix indefinite deferral of RTO with SACK reneging
- tipc: fix a null-ptr-deref in tipc_topsrv_accept
- eth: macb: specify PHY PM management done by MAC
- tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
Previous releases - always broken:
- eth: amd-xgbe: SFP fixes and compatibility improvements
Misc:
- docs: netdev: offer performance feedback to contributors
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmNW024ACgkQMUZtbf5S
IrvX7w//SP/zKZwgzC13zd2rrCP16TX2QvkHPmSLvcldQDXdCypmsoc5Vb8UNpkG
jwAuy2pxqPy2oxTwTBQv9TNRT2oqEFOsFTK+w410whlL7g1wZ02aXU8qFhV2XumW
o4gRtM+UISPUKFbOnawdK1XlrNdeLF3bjETvW2GP9zxCb0iqoQXtDDNKxv2B2iQA
MSyTtzHA4n9GS7LKGtPgsP2Ose7h1Z+AjTIpQH1nvfEHJUf/wmxUdCK+fuwfeLjY
PhmYaPG/333j1bfBk1Ms/nUYA5KRXlEj9A/7jDtxhxNEwaTNKyLB19a6oVxXxpSQ
x/k+nZP1RColn5xeco5a1X9aHHQ46PJQ8wVAmxYDIeIA5XPMgShNmhAyjrq1ac+o
9vYeYpmnMGSTLdBMvGbWpynWHe7SddgF8LkbnYf2HLKbxe4bgkOnmxOUH4q9iinZ
MfVSknjax4DP0C7X1kGgR6WyltWnkrahOdUkINsIUNxj0KxJa/eStpJIbJrfkxNV
gHbOjB2/bF3SXENrS4A0IJCgsbO9YugN83Eyu0WDWQOw9wVgopzxOJx9R+H0wkVH
XpGGP8qi1DZiTE3iQiq1LHj6f6kirFmtt9QFH5yzaqtKBaqXakHaXwUO4VtD+BI9
NPFKvFL6jrp8EAn0PTM/RrvhJZN+V0bFXiyiMe0TLx+aR0UMxGc=
=dD6N
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf.
The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
I'm biased.
Current release - regressions:
- eth: fman: re-expose location of the MAC address to userspace,
apparently some udev scripts depended on the exact value
Current release - new code bugs:
- bpf:
- wait for busy refill_work when destroying bpf memory allocator
- allow bpf_user_ringbuf_drain() callbacks to return 1
- fix dispatcher patchable function entry to 5 bytes nop
Previous releases - regressions:
- net-memcg: avoid stalls when under memory pressure
- tcp: fix indefinite deferral of RTO with SACK reneging
- tipc: fix a null-ptr-deref in tipc_topsrv_accept
- eth: macb: specify PHY PM management done by MAC
- tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
Previous releases - always broken:
- eth: amd-xgbe: SFP fixes and compatibility improvements
Misc:
- docs: netdev: offer performance feedback to contributors"
* tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
net-memcg: avoid stalls when under memory pressure
tcp: fix indefinite deferral of RTO with SACK reneging
tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
docs: netdev: offer performance feedback to contributors
kcm: annotate data-races around kcm->rx_wait
kcm: annotate data-races around kcm->rx_psock
net: fman: Use physical address for userspace interfaces
net/mlx5e: Cleanup MACsec uninitialization routine
atlantic: fix deadlock at aq_nic_stop
nfp: only clean `sp_indiff` when application firmware is unloaded
amd-xgbe: add the bit rate quirk for Molex cables
amd-xgbe: fix the SFP compliance codes check for DAC cables
amd-xgbe: enable PLL_CTL for fixed PHY modes only
amd-xgbe: use enums for mailbox cmd and sub_cmds
amd-xgbe: Yellow carp devices do not need rrc
bpf: Use __llist_del_all() whenever possbile during memory draining
bpf: Wait for busy refill_work when destroying bpf memory allocator
MAINTAINERS: add keyword match on PTP
...
This commit fixes a bug that can cause a TCP data sender to repeatedly
defer RTOs when encountering SACK reneging.
The bug is that when we're in fast recovery in a scenario with SACK
reneging, every time we get an ACK we call tcp_check_sack_reneging()
and it can note the apparent SACK reneging and rearm the RTO timer for
srtt/2 into the future. In some SACK reneging scenarios that can
happen repeatedly until the receive window fills up, at which point
the sender can't send any more, the ACKs stop arriving, and the RTO
fires at srtt/2 after the last ACK. But that can take far too long
(O(10 secs)), since the connection is stuck in fast recovery with a
low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
available.
This fix changes the logic in tcp_check_sack_reneging() to only rearm
the RTO timer if data is cumulatively ACKed, indicating forward
progress. This avoids this kind of nearly infinite loop of RTO timer
re-arming. In addition, this meets the goals of
tcp_check_sack_reneging() in handling Windows TCP behavior that looks
temporarily like SACK reneging but is not really.
Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
and provided critical packet traces that enabled root-causing this
issue. Also, many thanks to Jakub Kicinski for testing this fix.
Fixes: 5ae344c949 ("tcp: reduce spurious retransmits due to transient SACK reneging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and
in tcp_add_backlog(), the variable limit is caculated by adding
sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value
of int and overflow. This patch reduces the limit budget by
halving the sndbuf to solve this issue since ACK packets are much
smaller than the payload.
Fixes: c9c3321257 ("tcp: add tcp_add_backlog()")
Signed-off-by: Lu Wei <luwei32@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the ops_init() interface is invoked to initialize the net, but
ops->init() fails, data is released. However, the ptr pointer in
net->gen is invalid. In this case, when nfqnl_nf_hook_drop() is invoked
to release the net, invalid address access occurs.
The process is as follows:
setup_net()
ops_init()
data = kzalloc(...) ---> alloc "data"
net_assign_generic() ---> assign "date" to ptr in net->gen
...
ops->init() ---> failed
...
kfree(data); ---> ptr in net->gen is invalid
...
ops_exit_list()
...
nfqnl_nf_hook_drop()
*q = nfnl_queue_pernet(net) ---> q is invalid
The following is the Call Trace information:
BUG: KASAN: use-after-free in nfqnl_nf_hook_drop+0x264/0x280
Read of size 8 at addr ffff88810396b240 by task ip/15855
Call Trace:
<TASK>
dump_stack_lvl+0x8e/0xd1
print_report+0x155/0x454
kasan_report+0xba/0x1f0
nfqnl_nf_hook_drop+0x264/0x280
nf_queue_nf_hook_drop+0x8b/0x1b0
__nf_unregister_net_hook+0x1ae/0x5a0
nf_unregister_net_hooks+0xde/0x130
ops_exit_list+0xb0/0x170
setup_net+0x7ac/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Allocated by task 15855:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0xa1/0xb0
__kmalloc+0x49/0xb0
ops_init+0xe7/0x410
setup_net+0x5aa/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Freed by task 15855:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x40
____kasan_slab_free+0x155/0x1b0
slab_free_freelist_hook+0x11b/0x220
__kmem_cache_free+0xa4/0x360
ops_init+0xb9/0x410
setup_net+0x5aa/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Fixes: f875bae065 ("net: Automatically allocate per namespace data.")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kcm->rx_psock can be read locklessly in kcm_rfree().
Annotate the read and writes accordingly.
We do the same for kcm->rx_wait in the following patch.
syzbot reported:
BUG: KCSAN: data-race in kcm_rfree / unreserve_rx_kcm
write to 0xffff888123d827b8 of 8 bytes by task 2758 on cpu 1:
unreserve_rx_kcm+0x72/0x1f0 net/kcm/kcmsock.c:313
kcm_rcv_strparser+0x2b5/0x3a0 net/kcm/kcmsock.c:373
__strp_recv+0x64c/0xd20 net/strparser/strparser.c:301
strp_recv+0x6d/0x80 net/strparser/strparser.c:335
tcp_read_sock+0x13e/0x5a0 net/ipv4/tcp.c:1703
strp_read_sock net/strparser/strparser.c:358 [inline]
do_strp_work net/strparser/strparser.c:406 [inline]
strp_work+0xe8/0x180 net/strparser/strparser.c:415
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
read to 0xffff888123d827b8 of 8 bytes by task 5859 on cpu 0:
kcm_rfree+0x14c/0x220 net/kcm/kcmsock.c:181
skb_release_head_state+0x8e/0x160 net/core/skbuff.c:841
skb_release_all net/core/skbuff.c:852 [inline]
__kfree_skb net/core/skbuff.c:868 [inline]
kfree_skb_reason+0x5c/0x260 net/core/skbuff.c:891
kfree_skb include/linux/skbuff.h:1216 [inline]
kcm_recvmsg+0x226/0x2b0 net/kcm/kcmsock.c:1161
____sys_recvmsg+0x16c/0x2e0
___sys_recvmsg net/socket.c:2743 [inline]
do_recvmmsg+0x2f1/0x710 net/socket.c:2837
__sys_recvmmsg net/socket.c:2916 [inline]
__do_sys_recvmmsg net/socket.c:2939 [inline]
__se_sys_recvmmsg net/socket.c:2932 [inline]
__x64_sys_recvmmsg+0xde/0x160 net/socket.c:2932
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffff88812971ce00 -> 0x0000000000000000
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 5859 Comm: syz-executor.3 Not tainted 6.0.0-syzkaller-12189-g19d17ab7c68b-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Back in 2014, the LQI was saved in the skb control buffer (skb->cb, or
mac_cb(skb)) without any actual reset of this area prior to its use.
As part of a useful rework of the use of this region, 32edc40ae6
("ieee802154: change _cb handling slightly") introduced mac_cb_init() to
basically memset the cb field to 0. In particular, this new function got
called at the beginning of mac802154_parse_frame_start(), right before
the location where the buffer got actually filled.
What went through unnoticed however, is the fact that the very first
helper called by device drivers in the receive path already used this
area to save the LQI value for later extraction. Resetting the cb field
"so late" led to systematically zeroing the LQI.
If we consider the reset of the cb field needed, we can make it as soon
as we get an skb from a device driver, right before storing the LQI,
as is the very first time we need to write something there.
Cc: stable@vger.kernel.org
Fixes: 32edc40ae6 ("ieee802154: change _cb handling slightly")
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20221020142535.1038885-1-miquel.raynal@bootlin.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNUFz4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqrSEAC+jhEaIB4srOr5DMta/CxBoKwiZIcMmsaK
pzRMFSTKWWtsx3COGjT0vwzm4VuZsZztE+A6buYs5riEDsI0l5TJiZ0fOqi9N0nB
orehq+7T2Cn5E848bMzLo7tSmlibAionrOQde5PbtmDcltuKddu9TiXNzD6XufLB
dLWbTRLdfxuFW0c8DYwng+KNBocXad64gu3ADuxKVGkWDs9tfOvaFWE/NgoXsEoq
a7uaBCF+DBWYhHWk0WPOA2+BLyNMN+g7owX1GWqW/Sr48CQDJSw5YnpKCi8+jZdb
uHreUIH96w/2A7CFfNCOfx5MhYCrX/j9ik6mDt2B8Gbh6vg3LlMADSo6xXSPag0r
7Lu7AVr7Sko6NKU2x/pCzv/U85TFuvuqSfH+YFK3rdEouZk26o5PfJEugAtD3gKv
smAk+ATmgx/iye5a2Uq7ClVVdOcdQULrC15/8XdcG7eI+l2q3AbgTa53PdBw3oF7
S+ANKMP5kPkPe1wDxFR0g3v7vsZmmfahRuss3xWC+PnHZPFZPQFRIohjWSsu1Exl
Ztri7Xy/ypC7bZ5F1pch1AjiLfLCGzpmKjT4QAy/mSFAJVboRDb0PTwN1w7uVCBQ
qK8TIw2iVKjEeIps/CedO+nQQrxhOKcizxLIPTyfaT6ZOJrbHalcQheEJqpWMnrF
w7dYkDnjYw==
=tTNe
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux
Pull io_uring follow-up from Jens Axboe:
"Currently the zero-copy has automatic fallback to normal transmit, and
it was decided that it'd be cleaner to return an error instead if the
socket type doesn't support it.
Zero-copy does work with UDP and TCP, it's more of a future proofing
kind of thing (eg for samba)"
* tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux:
io_uring/net: fail zc sendmsg when unsupported by socket
io_uring/net: fail zc send when unsupported by socket
net: flag sockets supporting msghdr originated zerocopy
We need an efficient way in io_uring to check whether a socket supports
zerocopy with msghdr provided ubuf_info. Add a new flag into the struct
socket flags fields.
Cc: <stable@vger.kernel.org> # 6.0
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/3dafafab822b1c66308bb58a0ac738b1e3f53f74.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ethnl_default_dump_one() passes NULL as info.
It's correct not to set extack during dump, as we should just
silently skip interfaces which can't provide the information.
Reported-by: syzbot+81c4b4bbba6eea2cfcae@syzkaller.appspotmail.com
Fixes: 18ff0bcda6 ("ethtool: add interface to interact with Ethernet Power Equipment")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When invoking function cfg80211_calculate_bitrate_eht about
(320 MHz, EHT-MCS 13, EHT-NSS 2, EHT-GI 0), which means the
parameters as flags: 0x80, bw: 7, mcs: 13, eht_gi: 0, nss: 2,
this formula (result * rate->nss) will overflow and causes
the returned bitrate to be 3959 when it should be 57646.
Here is the explanation:
u64 tmp;
u32 result;
…
/* tmp = result = 4 * rates_996[0]
* = 4 * 480388888 = 0x72889c60
*/
tmp = result;
/* tmp = 0x72889c60 * 6144 = 0xabccea90000 */
tmp *= SCALE;
/* tmp = 0xabccea90000 / mcs_divisors[13]
* = 0xabccea90000 / 5120 = 0x8970bba6
*/
do_div(tmp, mcs_divisors[rate->mcs]);
/* result = 0x8970bba6 */
result = tmp;
/* normally (result * rate->nss) = 0x8970bba6 * 2 = 0x112e1774c,
* but since result is u32, (result * rate->nss) = 0x12e1774c,
* overflow happens and it loses the highest bit.
* Then result = 0x12e1774c / 8 = 39595753,
*/
result = (result * rate->nss) / 8;
Signed-off-by: Paul Zhang <quic_paulz@quicinc.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the function query_regdb_file() the alpha2 parameter is duplicated
using kmemdup() and subsequently freed in regdb_fw_cb(). However,
request_firmware_nowait() can fail without calling regdb_fw_cb() and
thus leak memory.
Fixes: 007f6c5e6e ("cfg80211: support loading regulatory database as firmware file")
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_register_hw free the allocated cipher suites when
registering wiphy fail, and ieee80211_free_hw will re-free it.
set wiphy_ciphers_allocated to false after freeing allocated
cipher suites.
Signed-off-by: taozhang <taozhang@bestechnic.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
All we're going to do with this pointer is assign it to
another __rcu pointer, but sparse can't see that, so
use rcu_access_pointer() to silence the warning here.
Fixes: c90b93b5b7 ("wifi: cfg80211: update hidden BSSes to avoid WARN_ON")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
syzbot found a crash in tipc_topsrv_accept:
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
Workqueue: tipc_rcv tipc_topsrv_accept
RIP: 0010:kernel_accept+0x22d/0x350 net/socket.c:3487
Call Trace:
<TASK>
tipc_topsrv_accept+0x197/0x280 net/tipc/topsrv.c:460
process_one_work+0x991/0x1610 kernel/workqueue.c:2289
worker_thread+0x665/0x1080 kernel/workqueue.c:2436
kthread+0x2e4/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
It was caused by srv->listener that might be set to null by
tipc_topsrv_stop() in net .exit whereas it's still used in
tipc_topsrv_accept() worker.
srv->listener is protected by srv->idr_lock in tipc_topsrv_stop(), so add
a check for srv->listener under srv->idr_lock in tipc_topsrv_accept() to
avoid the null-ptr-deref. To ensure the lsock is not released during the
tipc_topsrv_accept(), move sock_release() after tipc_topsrv_work_stop()
where it's waiting until the tipc_topsrv_accept worker to be done.
Note that sk_callback_lock is used to protect sk->sk_user_data instead of
srv->listener, and it should check srv in tipc_topsrv_listener_data_ready()
instead. This also ensures that no more tipc_topsrv_accept worker will be
started after tipc_conn_close() is called in tipc_topsrv_stop() where it
sets sk->sk_user_data to null.
Fixes: 0ef897be12 ("tipc: separate topology server listener socket from subcsriber sockets")
Reported-by: syzbot+c5ce866a8d30f4be0651@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Link: https://lore.kernel.org/r/4eee264380c409c61c6451af1059b7fb271a7e7b.1666120790.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Missing flowi uid field in nft_fib expression, from Guillaume Nault.
This is broken since the creation of the fib expression.
2) Relax sanity check to fix bogus EINVAL error when deleting elements
belonging set intervals. Broken since 6.0-rc.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements
netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces.
====================
Link: https://lore.kernel.org/r/20221019065225.1006344-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently qdisc ingress handling (sch_handle_ingress()) doesn't
set a return value and it is left to the old return value of
the caller (__netif_receive_skb_core()) which is RX drop, so if
the packet is consumed, caller will stop and return this value
as if the packet was dropped.
This causes a problem in the kernel tcp stack when having a
egress tc rule forwarding to a ingress tc rule.
The tcp stack sending packets on the device having the egress rule
will see the packets as not successfully transmitted (although they
actually were), will not advance it's internal state of sent data,
and packets returning on such tcp stream will be dropped by the tcp
stack with reason ack-of-unsent-data. See reproduction in [0] below.
Fix that by setting the return value to RX success if
the packet was handled successfully.
[0] Reproduction steps:
$ ip link add veth1 type veth peer name peer1
$ ip link add veth2 type veth peer name peer2
$ ifconfig peer1 5.5.5.6/24 up
$ ip netns add ns0
$ ip link set dev peer2 netns ns0
$ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
$ ifconfig veth2 0 up
$ ifconfig veth1 0 up
#ingress forwarding veth1 <-> veth2
$ tc qdisc add dev veth2 ingress
$ tc qdisc add dev veth1 ingress
$ tc filter add dev veth2 ingress prio 1 proto all flower \
action mirred egress redirect dev veth1
$ tc filter add dev veth1 ingress prio 1 proto all flower \
action mirred egress redirect dev veth2
#steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
$ tc qdisc add dev peer1 clsact
$ tc filter add dev peer1 egress prio 20 proto ip flower \
action mirred ingress redirect dev veth1
#run iperf and see connection not running
$ iperf3 -s&
$ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
#delete egress rule, and run again, now should work
$ tc filter del dev peer1 egress
$ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
Fixes: f697c3e8b3 ("[NET]: Avoid unnecessary cloning for ingress filtering")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the default qdisc is sfb, if the qdisc of dev_queue fails to be
inited during mqprio_init(), sfb_reset() is invoked to clear resources.
In this case, the q->qdisc is NULL, and it will cause gpf issue.
The process is as follows:
qdisc_create_dflt()
sfb_init()
tcf_block_get() --->failed, q->qdisc is NULL
...
qdisc_put()
...
sfb_reset()
qdisc_reset(q->qdisc) --->q->qdisc is NULL
ops = qdisc->ops
The following is the Call Trace information:
general protection fault, probably for non-canonical address
0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
RIP: 0010:qdisc_reset+0x2b/0x6f0
Call Trace:
<TASK>
sfb_reset+0x37/0xd0
qdisc_reset+0xed/0x6f0
qdisc_destroy+0x82/0x4c0
qdisc_put+0x9e/0xb0
qdisc_create_dflt+0x2c3/0x4a0
mqprio_init+0xa71/0x1760
qdisc_create+0x3eb/0x1000
tc_modify_qdisc+0x408/0x1720
rtnetlink_rcv_msg+0x38e/0xac0
netlink_rcv_skb+0x12d/0x3a0
netlink_unicast+0x4a2/0x740
netlink_sendmsg+0x826/0xcc0
sock_sendmsg+0xc5/0x100
____sys_sendmsg+0x583/0x690
___sys_sendmsg+0xe8/0x160
__sys_sendmsg+0xbf/0x160
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f2164122d04
</TASK>
Fixes: e13e02a3c6 ("net_sched: SFB flow scheduler")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 494f5063b8.
When the default qdisc is fq_codel, if the qdisc of dev_queue fails to be
inited during mqprio_init(), fq_codel_reset() is invoked to clear
resources. In this case, the flow is NULL, and it will cause gpf issue.
The process is as follows:
qdisc_create_dflt()
fq_codel_init()
...
q->flows_cnt = 1024;
...
q->flows = kvcalloc(...) --->failed, q->flows is NULL
...
qdisc_put()
...
fq_codel_reset()
...
flow = q->flows + i --->q->flows is NULL
The following is the Call Trace information:
general protection fault, probably for non-canonical address
0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
RIP: 0010:fq_codel_reset+0x14d/0x350
Call Trace:
<TASK>
qdisc_reset+0xed/0x6f0
qdisc_destroy+0x82/0x4c0
qdisc_put+0x9e/0xb0
qdisc_create_dflt+0x2c3/0x4a0
mqprio_init+0xa71/0x1760
qdisc_create+0x3eb/0x1000
tc_modify_qdisc+0x408/0x1720
rtnetlink_rcv_msg+0x38e/0xac0
netlink_rcv_skb+0x12d/0x3a0
netlink_unicast+0x4a2/0x740
netlink_sendmsg+0x826/0xcc0
sock_sendmsg+0xc5/0x100
____sys_sendmsg+0x583/0x690
___sys_sendmsg+0xe8/0x160
__sys_sendmsg+0xbf/0x160
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fd272b22d04
</TASK>
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the default qdisc is cake, if the qdisc of dev_queue fails to be
inited during mqprio_init(), cake_reset() is invoked to clear
resources. In this case, the tins is NULL, and it will cause gpf issue.
The process is as follows:
qdisc_create_dflt()
cake_init()
q->tins = kvcalloc(...) --->failed, q->tins is NULL
...
qdisc_put()
...
cake_reset()
...
cake_dequeue_one()
b = &q->tins[...] --->q->tins is NULL
The following is the Call Trace information:
general protection fault, probably for non-canonical address
0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:cake_dequeue_one+0xc9/0x3c0
Call Trace:
<TASK>
cake_reset+0xb1/0x140
qdisc_reset+0xed/0x6f0
qdisc_destroy+0x82/0x4c0
qdisc_put+0x9e/0xb0
qdisc_create_dflt+0x2c3/0x4a0
mqprio_init+0xa71/0x1760
qdisc_create+0x3eb/0x1000
tc_modify_qdisc+0x408/0x1720
rtnetlink_rcv_msg+0x38e/0xac0
netlink_rcv_skb+0x12d/0x3a0
netlink_unicast+0x4a2/0x740
netlink_sendmsg+0x826/0xcc0
sock_sendmsg+0xc5/0x100
____sys_sendmsg+0x583/0x690
___sys_sendmsg+0xe8/0x160
__sys_sendmsg+0xbf/0x160
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f89e5122d04
</TASK>
Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using GSO it can happen that the wrong seq_hi is used for the last
packets before the wrap around. This can lead to double usage of a
sequence number. To avoid this, we should serialize this last GSO
packet.
Fixes: d7dbefc45c ("xfrm: Add xfrm_replay_overflow functions for offloading")
Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Christian Langrock <christian.langrock@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Otherwise EINVAL is bogusly reported to userspace when deleting a set
element. NFTA_SET_ELEM_KEY_END does not need to be set in case of:
- insertion: if not present, start key is used as end key.
- deletion: only start key needs to be specified, end key is ignored.
Hence, relax the sanity check.
Fixes: 88cccd908d ("netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently netfilter's rpfilter and fib modules implicitely initialise
->flowic_uid with 0. This is normally the root UID. However, this isn't
the case in user namespaces, where user ID 0 is mapped to a different
kernel UID. By initialising ->flowic_uid with sock_net_uid(), we get
the root UID of the user namespace, thus keeping the same behaviour
whether or not we're running in a user namepspace.
Note, this is similar to commit 8bcfd0925e ("ipv4: add missing
initialization for flowi4_uid"), which fixed the rp_filter sysctl.
Fixes: 622ec2c9d5 ("net: core: add UID to flows, rules, and routes")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we call connect() for a UDP socket in a reuseport group, we have
to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel
could select a unconnected socket wrongly for packets sent to the
connected socket.
However, the current way to set has_conns is illegal and possible to
trigger that problem. reuseport_has_conns() changes has_conns under
rcu_read_lock(), which upgrades the RCU reader to the updater. Then,
it must do the update under the updater's lock, reuseport_lock, but
it doesn't for now.
For this reason, there is a race below where we fail to set has_conns
resulting in the wrong socket selection. To avoid the race, let's split
the reader and updater with proper locking.
cpu1 cpu2
+----+ +----+
__ip[46]_datagram_connect() reuseport_grow()
. .
|- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size)
| . |
| |- rcu_read_lock()
| |- reuse = rcu_dereference(sk->sk_reuseport_cb)
| |
| | | /* reuse->has_conns == 0 here */
| | |- more_reuse->has_conns = reuse->has_conns
| |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */
| | |
| | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
| | | more_reuse)
| `- rcu_read_unlock() `- kfree_rcu(reuse, rcu)
|
|- sk->sk_state = TCP_ESTABLISHED
Note the likely(reuse) in reuseport_has_conns_set() is always true,
but we put the test there for ease of review. [0]
For the record, usually, sk_reuseport_cb is changed under lock_sock().
The only exception is reuseport_grow() & TCP reqsk migration case.
1) shutdown() TCP listener, which is moved into the latter part of
reuse->socks[] to migrate reqsk.
2) New listen() overflows reuse->socks[] and call reuseport_grow().
3) reuse->max_socks overflows u16 with the new listener.
4) reuseport_grow() pops the old shutdown()ed listener from the array
and update its sk->sk_reuseport_cb as NULL without lock_sock().
shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(),
but, reuseport_has_conns_set() is called only for UDP under lock_sock(),
so likely(reuse) never be false in reuseport_has_conns_set().
[0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/
Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmNHYD0ACgkQSfxwEqXe
A655AA//dJK0PdRghqrKQsl18GOCffV5TUw5i1VbJQbI9d8anfxNjVUQiNGZi4et
qUwZ8OqVXxYx1Z1UDgUE39PjEDSG9/cCvOpMUWqN20/+6955WlNZjwA7Fk6zjvlM
R30fz5CIJns9RFvGT4SwKqbVLXIMvfg/wDENUN+8sxt36+VD2gGol7J2JJdngEhM
lW+zqzi0ABqYy5so4TU2kixpKmpC08rqFvQbD1GPid+50+JsOiIqftDErt9Eg1Mg
MqYivoFCvbAlxxxRh3+UHBd7ZpJLtp1UFEOl2Rf00OXO+ZclLCAQAsTczucIWK9M
8LCZjb7d4lPJv9RpXFAl3R1xvfc+Uy2ga5KeXvufZtc5G3aMUKPuIU7k28ZyblVS
XXsXEYhjTSd0tgi3d0JlValrIreSuj0z2QGT5pVcC9utuAqAqRIlosiPmgPlzXjr
Us4jXaUhOIPKI+Musv/fqrxsTQziT0jgVA3Njlt4cuAGm/EeUbLUkMWwKXjZLTsv
vDsBhEQFmyZqxWu4pYo534VX2mQWTaKRV1SUVVhQEHm57b00EAiZohoOvweB09SR
4KiJapikoopmW4oAUFotUXUL1PM6yi+MXguTuc1SEYuLz/tCFtK8DJVwNpfnWZpE
lZKvXyJnHq2Sgod/hEZq58PMvT6aNzTzSg7YzZy+VabxQGOO5mc=
=M+mV
-----END PGP SIGNATURE-----
Merge tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull more random number generator updates from Jason Donenfeld:
"This time with some large scale treewide cleanups.
The intent of this pull is to clean up the way callers fetch random
integers. The current rules for doing this right are:
- If you want a secure or an insecure random u64, use get_random_u64()
- If you want a secure or an insecure random u32, use get_random_u32()
The old function prandom_u32() has been deprecated for a while
now and is just a wrapper around get_random_u32(). Same for
get_random_int().
- If you want a secure or an insecure random u16, use get_random_u16()
- If you want a secure or an insecure random u8, use get_random_u8()
- If you want secure or insecure random bytes, use get_random_bytes().
The old function prandom_bytes() has been deprecated for a while
now and has long been a wrapper around get_random_bytes()
- If you want a non-uniform random u32, u16, or u8 bounded by a
certain open interval maximum, use prandom_u32_max()
I say "non-uniform", because it doesn't do any rejection sampling
or divisions. Hence, it stays within the prandom_*() namespace, not
the get_random_*() namespace.
I'm currently investigating a "uniform" function for 6.2. We'll see
what comes of that.
By applying these rules uniformly, we get several benefits:
- By using prandom_u32_max() with an upper-bound that the compiler
can prove at compile-time is ≤65536 or ≤256, internally
get_random_u16() or get_random_u8() is used, which wastes fewer
batched random bytes, and hence has higher throughput.
- By using prandom_u32_max() instead of %, when the upper-bound is
not a constant, division is still avoided, because
prandom_u32_max() uses a faster multiplication-based trick instead.
- By using get_random_u16() or get_random_u8() in cases where the
return value is intended to indeed be a u16 or a u8, we waste fewer
batched random bytes, and hence have higher throughput.
This series was originally done by hand while I was on an airplane
without Internet. Later, Kees and I worked on retroactively figuring
out what could be done with Coccinelle and what had to be done
manually, and then we split things up based on that.
So while this touches a lot of files, the actual amount of code that's
hand fiddled is comfortably small"
* tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
prandom: remove unused functions
treewide: use get_random_bytes() when possible
treewide: use get_random_u32() when possible
treewide: use get_random_{u8,u16}() when possible, part 2
treewide: use get_random_{u8,u16}() when possible, part 1
treewide: use prandom_u32_max() when possible, part 2
treewide: use prandom_u32_max() when possible, part 1
Return zero if both dsa_slave_dev_check() and netdev_uses_dsa() are false.
Fixes: acc43b7bf5 ("net: dsa: allow masters to join a LAG")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If smc_wr_alloc_lgr_mem() fails then return an error code. Don't return
success.
Fixes: 8799e310fb ("net/smc: add v2 support to the work request layer")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Then the input contains '\0' or '\n', proc_mpc_write has read them,
so the return value needs +1.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Xiaobo Liu <cppcoffee@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TLS tries to get away with using the TCP input queue directly.
This does not work if there is duplicated data (multiple skbs
holding bytes for the same seq number range due to retransmits).
Check for this condition and fall back to copy mode, it should
be rare.
Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a 8-byte write to initialize sub.usr_handle in
tipc_topsrv_kern_subscr(), otherwise four bytes remain uninitialized
when issuing setsockopt(..., SOL_TIPC, ...).
This resulted in an infoleak reported by KMSAN when the packet was
received:
=====================================================
BUG: KMSAN: kernel-infoleak in copyout+0xbc/0x100 lib/iov_iter.c:169
instrument_copy_to_user ./include/linux/instrumented.h:121
copyout+0xbc/0x100 lib/iov_iter.c:169
_copy_to_iter+0x5c0/0x20a0 lib/iov_iter.c:527
copy_to_iter ./include/linux/uio.h:176
simple_copy_to_iter+0x64/0xa0 net/core/datagram.c:513
__skb_datagram_iter+0x123/0xdc0 net/core/datagram.c:419
skb_copy_datagram_iter+0x58/0x200 net/core/datagram.c:527
skb_copy_datagram_msg ./include/linux/skbuff.h:3903
packet_recvmsg+0x521/0x1e70 net/packet/af_packet.c:3469
____sys_recvmsg+0x2c4/0x810 net/socket.c:?
___sys_recvmsg+0x217/0x840 net/socket.c:2743
__sys_recvmsg net/socket.c:2773
__do_sys_recvmsg net/socket.c:2783
__se_sys_recvmsg net/socket.c:2780
__x64_sys_recvmsg+0x364/0x540 net/socket.c:2780
do_syscall_x64 arch/x86/entry/common.c:50
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd arch/x86/entry/entry_64.S:120
...
Uninit was stored to memory at:
tipc_sub_subscribe+0x42d/0xb50 net/tipc/subscr.c:156
tipc_conn_rcv_sub+0x246/0x620 net/tipc/topsrv.c:375
tipc_topsrv_kern_subscr+0x2e8/0x400 net/tipc/topsrv.c:579
tipc_group_create+0x4e7/0x7d0 net/tipc/group.c:190
tipc_sk_join+0x2a8/0x770 net/tipc/socket.c:3084
tipc_setsockopt+0xae5/0xe40 net/tipc/socket.c:3201
__sys_setsockopt+0x87f/0xdc0 net/socket.c:2252
__do_sys_setsockopt net/socket.c:2263
__se_sys_setsockopt net/socket.c:2260
__x64_sys_setsockopt+0xe0/0x160 net/socket.c:2260
do_syscall_x64 arch/x86/entry/common.c:50
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd arch/x86/entry/entry_64.S:120
Local variable sub created at:
tipc_topsrv_kern_subscr+0x57/0x400 net/tipc/topsrv.c:562
tipc_group_create+0x4e7/0x7d0 net/tipc/group.c:190
Bytes 84-87 of 88 are uninitialized
Memory access of size 88 starts at ffff88801ed57cd0
Data copied to user address 0000000020000400
...
=====================================================
Signed-off-by: Alexander Potapenko <glider@google.com>
Fixes: 026321c6d0 ("tipc: rename tipc_server to tipc_topsrv")
Signed-off-by: David S. Miller <davem@davemloft.net>
The trial period exists until jiffies is after addr_trial_end. But as
jiffies will eventually overflow, just using time_after will eventually
give incorrect results. As the node address is set once the trial period
ends, this can be used to know that we are not in the trial period.
Fixes: e415577f57 ("tipc: correct discovery message handling during address trial period")
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNIXOYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr17D/9qA35JJ4lnZWo+ayqwRPAqpO9DRgmyt2W+
FQAdbaWpsi2cr+Iv4AU51rIWfydP1UZCfYBPkWvLCoCzLizWUz4c6bLBc4+HvJcR
wuiWG+js6PSi279mUA1s1XXYthYt7CiMpi3+2iatGiPBFtIIMeMqMos02ohKu/4L
gd9qTKUfwepBcM2D61wk/jl7UXtdIb+NCh685FvkQ32Nk4kQx3L5y3BRJ1NUtCTz
GI7HT2Lp6Im3pD6hpDPSS8H/DK7gTdyQYPiI9+rfdfIR8HW4k/3FkefYB5He554J
8b9MsdSN6g/oIQxPhMlbbE4e24ahEPK7h5V1HYCF8i/dPbxjuZQ0HpgC+LiLG9/J
cOVXjZVjcuzUmJM9eJ/jNZx0w3/RHGo8lstBvQ0SyuRpLYhe+pvVYlkNEkK4ptIZ
bw/scjmQxNp3xK8PRUMaIb7sHfha/+OoIj+XHXgeiujtDIFQOc+7G14o+gCkker5
ZIVrbXcn4cWAHKUMPlH3NWj89v6lkulfuOQ6NZPbz4CYj3tudqnG3ah810iGBTiR
1HOhtwKPYR8WGioj1ZjUiSJZ/8aT4GUuLM2EjKsIITo09mYwx24G+pfaWz+qBnzw
SQqg0M0XVCg+RSrfakl9LE2uX2V06YDMBg52HA014CQaRE0pJHX/0u1VKg4rdD0z
QvdbB5peXg==
=0dSL
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.1-2022-10-13' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"A collection of fixes that ended up either being later than the
initial pull, or dependent on multiple branches (6.0-late being one of
them) and hence deferred purposely. This contains:
- Cleanup fixes for the single submitter late 6.0 change, which we
pushed to 6.1 to keep the 6.0 changes small (Dylan, Pavel)
- Fix for IORING_OP_CONNECT not handling -EINPROGRESS correctly (me)
- Ensure that the zc sendmsg variant gets audited correctly (me)
- Regression fix from this merge window where kiocb_end_write()
doesn't always gets called, which can cause issues with fs freezing
(me)
- Registered files SCM handling fix (Pavel)
- Regression fix for big sqe dumping in fdinfo (Pavel)
- Registered buffers accounting fix (Pavel)
- Remove leftover notification structures, we killed them off late in
6.0 (Pavel)
- Minor optimizations (Pavel)
- Cosmetic variable shadowing fix (Stefan)"
* tag 'io_uring-6.1-2022-10-13' of git://git.kernel.dk/linux:
io_uring/rw: ensure kiocb_end_write() is always called
io_uring: fix fdinfo sqe offsets calculation
io_uring: local variable rw shadows outer variable in io_write
io_uring/opdef: remove 'audit_skip' from SENDMSG_ZC
io_uring: optimise locking for local tw with submit_wait
io_uring: remove redundant memory barrier in io_req_local_work_add
io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT
io_uring: remove notif leftovers
io_uring: correct pinned_vm accounting
io_uring/af_unix: defer registered files gc to io_uring release
io_uring: limit registration w/ SINGLE_ISSUER
io_uring: remove io_register_submitter
io_uring: simplify __io_uring_add_tctx_node
Current release - regressions:
- Revert "net/sched: taprio: make qdisc_leaf() see
the per-netdev-queue pfifo child qdiscs", it may cause crashes
when the qdisc is reconfigured
- inet: ping: fix splat due to packet allocation refactoring in inet
- tcp: clean up kernel listener's reqsk in inet_twsk_purge(),
fix UAF due to races when per-netns hash table is used
Current release - new code bugs:
- eth: adin1110: check in netdev_event that netdev belongs to driver
- fixes for PTR_ERR() vs NULL bugs in driver code, from Dan and co.
Previous releases - regressions:
- ipv4: handle attempt to delete multipath route when fib_info
contains an nh reference, avoid oob access
- wifi: fix handful of bugs in the new Multi-BSSID code
- wifi: mt76: fix rate reporting / throughput regression on mt7915
and newer, fix checksum offload
- wifi: iwlwifi: mvm: fix double list_add at
iwl_mvm_mac_wake_tx_queue (other cases)
- wifi: mac80211: do not drop packets smaller than the LLC-SNAP
header on fast-rx
Previous releases - always broken:
- ieee802154: don't warn zero-sized raw_sendmsg()
- ipv6: ping: fix wrong checksum for large frames
- mctp: prevent double key removal and unref
- tcp/udp: fix memory leaks and races around IPV6_ADDRFORM
- hv_netvsc: fix race between VF offering and VF association message
Misc:
- remove -Warray-bounds silencing in the drivers, compilers fixed
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmNISaMACgkQMUZtbf5S
IruEARAArjYZbOEGkUVqtcEbnV0vmxQ5GsVyvurDkmzUULJ1rVAITtG7BbxcPyZ7
tJf5BPmmpXxEXh/lZBIlgHLOGgf/cx4gkCH9Jz6LYlSoTpTZiTqxlOfAZNeei0FI
PD95Slvd3TnIOEysv5RH/pQzIoKdd6+YqOhVITbwCW36cCLaUm+r7JUhzDrnHMNE
KCcsOX9DDtW7MDJrJj/E0wlWeWcudpHY4DLG2A723X6Esu+8k6krK32XtkrFIKqa
PFxeU1NPgMkn4S2xRPKqy+W3dTMfMKB4WWBMMUzEU220MIxV4l/RZSrnI5nrnLh2
uXyUefpx+lD92D5BOiqUw8rK7B4Jq0uUrawuCf+70tbO1f13ThkkAlV6cEzrlnZY
tGQxs0ayFIDVypU1tpY9cemUiYXrnPpCkpz+V1G0us8L323eCHxjz/f5TUlb51Na
BVFvRqvxkjztprBv2LrH2SmnVtcH2kvQG8qMYmXRchBM+11rivz6BrPdE0V+muMg
Hjr6HefYMBpSgcD+ADVFr8a/OB/W7AuWpTBd3z/WyNQ5MxkFX9Kf2Lt2+j8SRfpE
ELO0AANFQZ1Gyp6LTbEkA3mFs1LhNNQyfjHcMHC16ZExHmV3i37BE9LJdnFM27N8
R8lIm4YDs6Jj6YIUDy2wExgUAgkUk7mfZNCMPNi2nSsdJksyAsc=
=AyqG
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, and wifi.
Current release - regressions:
- Revert "net/sched: taprio: make qdisc_leaf() see the
per-netdev-queue pfifo child qdiscs", it may cause crashes when the
qdisc is reconfigured
- inet: ping: fix splat due to packet allocation refactoring in inet
- tcp: clean up kernel listener's reqsk in inet_twsk_purge(), fix UAF
due to races when per-netns hash table is used
Current release - new code bugs:
- eth: adin1110: check in netdev_event that netdev belongs to driver
- fixes for PTR_ERR() vs NULL bugs in driver code, from Dan and co.
Previous releases - regressions:
- ipv4: handle attempt to delete multipath route when fib_info
contains an nh reference, avoid oob access
- wifi: fix handful of bugs in the new Multi-BSSID code
- wifi: mt76: fix rate reporting / throughput regression on mt7915
and newer, fix checksum offload
- wifi: iwlwifi: mvm: fix double list_add at
iwl_mvm_mac_wake_tx_queue (other cases)
- wifi: mac80211: do not drop packets smaller than the LLC-SNAP
header on fast-rx
Previous releases - always broken:
- ieee802154: don't warn zero-sized raw_sendmsg()
- ipv6: ping: fix wrong checksum for large frames
- mctp: prevent double key removal and unref
- tcp/udp: fix memory leaks and races around IPV6_ADDRFORM
- hv_netvsc: fix race between VF offering and VF association message
Misc:
- remove -Warray-bounds silencing in the drivers, compilers fixed"
* tag 'net-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
sunhme: fix an IS_ERR() vs NULL check in probe
net: marvell: prestera: fix a couple NULL vs IS_ERR() checks
kcm: avoid potential race in kcm_tx_work
tcp: Clean up kernel listener's reqsk in inet_twsk_purge()
net: phy: micrel: Fixes FIELD_GET assertion
openvswitch: add nf_ct_is_confirmed check before assigning the helper
tcp: Fix data races around icsk->icsk_af_ops.
ipv6: Fix data races around sk->sk_prot.
tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM).
tcp/udp: Fix memory leak in ipv6_renew_options().
mctp: prevent double key removal and unref
selftests: netfilter: Fix nft_fib.sh for all.rp_filter=1
netfilter: rpfilter/fib: Populate flowic_l3mdev field
selftests: netfilter: Test reverse path filtering
net/mlx5: Make ASO poll CQ usable in atomic context
tcp: cdg: allow tcp_cdg_release() to be called multiple times
inet: ping: fix recent breakage
ipv6: ping: fix wrong checksum for large frames
net: ethernet: ti: am65-cpsw: set correct devlink flavour for unused ports
...
noteworthy one being some additional wakeups in cap handling code, and
a messenger cleanup.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmNINMwTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5UeB/0ZzxdhZarepRYdOw4K+hHmCWYjlrEi
Aw91gfS9DmzXfLyV42/6kxhKGEmVH4Wpz/mAIfMcLaLJZxI9GspVZZuofK6XPJUY
eGqllxXgbgvCqnX9puCfw4RTrEJkt/y6e0/6EhAhjArDNyEylHcApONbEsvHLB+L
IjYEJRuDDNqBnacMjn0iqI2F3zpyu6DkuJNWLxfbGhnWWsj8LaxXVgLtBeePuoIN
udVZiNxiJAldDGc99r0xX5gicjyihBRiomjnz6FO6F459CtrPE/qdx6TNUUt63N3
Lt55JDCM8qJeA8ffblZrhNnT2iefcEuqRcSwSdLbQxW/l6y23O4drx+N
=l/PT
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.1-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A quiet round this time: several assorted filesystem fixes, the most
noteworthy one being some additional wakeups in cap handling code, and
a messenger cleanup"
* tag 'ceph-for-6.1-rc1' of https://github.com/ceph/ceph-client:
ceph: remove Sage's git tree from documentation
ceph: fix incorrectly showing the .snap size for stat
ceph: fail the open_by_handle_at() if the dentry is being unlinked
ceph: increment i_version when doing a setattr with caps
ceph: Use kcalloc for allocating multiple elements
ceph: no need to wait for transition RDCACHE|RD -> RD
ceph: fail the request if the peer MDS doesn't support getvxattr op
ceph: wake up the waiters if any new caps comes
libceph: drop last_piece flag from ceph_msg_data_cursor
- New Features:
- Add NFSv4.2 xattr tracepoints
- Replace xprtiod WQ in rpcrdma
- Flexfiles cancels I/O on layout recall or revoke
- Bugfixes and Cleanups:
- Directly use ida_alloc() / ida_free()
- Don't open-code max_t()
- Prefer using strscpy over strlcpy
- Remove unused forward declarations
- Always return layout states on flexfiles layout return
- Have LISTXATTR treat NFS4ERR_NOXATTR as an empty reply instead of error
- Allow more xprtrdma memory allocations to fail without triggering a reclaim
- Various other xprtrdma clean ups
- Fix rpc_killall_tasks() races
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmNHJToACgkQ18tUv7Cl
QOtTbA//QiresBzf7cnZOAwiZbe9LXiWfR2p5IkBLJPYJ8xtTliRLwnwYgQib9OI
+4DzBiEqujah9BDac5OeatYW1UDLQ9lMIoCyvPjSw8Yxa8JEHDb/1ODDUOMS+ZIo
dk1AKV2Wi2stxn85Sy+VGriE3JKiaeJxAlsWgiT/BLP0hAyZw1L3Tg017EgxVIVz
8cfPBciu/Bc2/pZp9f5+GBjAlcUX0u/JFKiLPDHDZkvFTr4RgREZOyStDWncgsxK
iHAIfSr6TxlynHabNAnFNVuYq7gkBe3jg1TkABdQ+SilAgdLpugAW8MFdig0AZQO
UIsVJHjRHLpz6cJurnDcu9tGB6jLVTZfyz8PZQl5H9CqnbSHUxdOCTuve7fGhVas
+wSXq1U98gStzoqtw5pMwsB2YSSOsUR8QEZpLEkvQgzHwoszNa7FrELqaZUJyJHR
qmRH2nKCzsSBbQn5AhnzHBxzeOv6r0r3YjvKd5utwsRtq3g9GX14KAOmqvDTKk2q
9KmrGlDVtVmOww2QnPTXH6mSthHLuqcKg1H2H7Xymmskq9n8PC6M+EiQd8XsKNJa
MfBkOVFdxrJq6Htpx4IMLJP6jvYVKEbef2eRFt8hNnla8pMPlsDqoIysJulaWpiB
HqdoPHR9Y26Qxuw7G91ba5Q5qqu9+ZOLB9jeSRjcXtsUDxq9f/A=
=p47k
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Add NFSv4.2 xattr tracepoints
- Replace xprtiod WQ in rpcrdma
- Flexfiles cancels I/O on layout recall or revoke
Bugfixes and Cleanups:
- Directly use ida_alloc() / ida_free()
- Don't open-code max_t()
- Prefer using strscpy over strlcpy
- Remove unused forward declarations
- Always return layout states on flexfiles layout return
- Have LISTXATTR treat NFS4ERR_NOXATTR as an empty reply instead of
error
- Allow more xprtrdma memory allocations to fail without triggering a
reclaim
- Various other xprtrdma clean ups
- Fix rpc_killall_tasks() races"
* tag 'nfs-for-6.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFSv4/flexfiles: Cancel I/O if the layout is recalled or revoked
SUNRPC: Add API to force the client to disconnect
SUNRPC: Add a helper to allow pNFS drivers to selectively cancel RPC calls
SUNRPC: Fix races with rpc_killall_tasks()
xprtrdma: Fix uninitialized variable
xprtrdma: Prevent memory allocations from driving a reclaim
xprtrdma: Memory allocation should be allowed to fail during connect
xprtrdma: MR-related memory allocation should be allowed to fail
xprtrdma: Clean up synopsis of rpcrdma_regbuf_alloc()
xprtrdma: Clean up synopsis of rpcrdma_req_create()
svcrdma: Clean up RPCRDMA_DEF_GFP
SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
NFSv4.2: Add a tracepoint for listxattr
NFSv4.2: Add tracepoints for getxattr, setxattr, and removexattr
NFSv4.2: Move TRACE_DEFINE_ENUM(NFS4_CONTENT_*) under CONFIG_NFS_V4_2
NFSv4.2: Add special handling for LISTXATTR receiving NFS4ERR_NOXATTR
nfs: remove nfs_wait_atomic_killable() and nfs_write_prepare() declaration
NFSv4: remove nfs4_renewd_prepare_shutdown() declaration
fs/nfs/pnfs_nfs.c: fix spelling typo and syntax error in comment
NFSv4/pNFS: Always return layout stats on layout return for flexfiles
...
Eric Dumazet reported a use-after-free related to the per-netns ehash
series. [0]
When we create a TCP socket from userspace, the socket always holds a
refcnt of the netns. This guarantees that a reqsk timer is always fired
before netns dismantle. Each reqsk has a refcnt of its listener, so the
listener is not freed before the reqsk, and the net is not freed before
the listener as well.
OTOH, when in-kernel users create a TCP socket, it might not hold a refcnt
of its netns. Thus, a reqsk timer can be fired after the netns dismantle
and access freed per-netns ehash.
To avoid the use-after-free, we need to clean up TCP_NEW_SYN_RECV sockets
in inet_twsk_purge() if the netns uses a per-netns ehash.
[0]: https://lore.kernel.org/netdev/CANn89iLXMup0dRD_Ov79Xt8N9FM0XdhCHEN05sf3eLwxKweM6w@mail.gmail.com/
BUG: KASAN: use-after-free in tcp_or_dccp_get_hashinfo
include/net/inet_hashtables.h:181 [inline]
BUG: KASAN: use-after-free in reqsk_queue_unlink+0x320/0x350
net/ipv4/inet_connection_sock.c:913
Read of size 8 at addr ffff88807545bd80 by task syz-executor.2/8301
CPU: 1 PID: 8301 Comm: syz-executor.2 Not tainted
6.0.0-syzkaller-02757-gaf7d23f9d96a #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 09/22/2022
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:317 [inline]
print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
tcp_or_dccp_get_hashinfo include/net/inet_hashtables.h:181 [inline]
reqsk_queue_unlink+0x320/0x350 net/ipv4/inet_connection_sock.c:913
inet_csk_reqsk_queue_drop net/ipv4/inet_connection_sock.c:927 [inline]
inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:939 [inline]
reqsk_timer_handler+0x724/0x1160 net/ipv4/inet_connection_sock.c:1053
call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
expire_timers kernel/time/timer.c:1519 [inline]
__run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
__run_timers kernel/time/timer.c:1768 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
__do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
invoke_softirq kernel/softirq.c:445 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1107
</IRQ>
Fixes: d1e5e6408b ("tcp: Introduce optional per-netns ehash.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221012145036.74960-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This has only the fixes for the scan parsing issues.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmNH4nUACgkQB8qZga/f
l8RFHRAAlskV3quBEnFgrOyuIcWVKq4DOFFKXXQ89cwD3ZK6auYm5gyM2obE+HkD
JQZjZoJc57qh3/IoYrOlC1PO0oAHiBReQ+5xKD0d1Ye04euFuV3Boe25rH3+7wya
u8uVJQqlV0sqG0DhoOPhDV/A17owtQlN/0j3yVFukPAvBjZHRduWF/lk+ozwmAsE
xwByw1K/v9BqOSx3aZjFhb3snT/4IiV8yYLiHE0KMBC+p5Zk6iwOWbCHFstWFbku
U6rTmWSvTDLaR+koTq4WGo/fRI68Sh4eRhcZsMnSRR8rj1LQWrRehPc+IQ9TXsgS
Li0z+HeLUEUrJ8bouJV0wzizvLG473qWm5PyemomKfYA68VvplheBpL3ClLIpbW+
On3VKnyZESIKsRYoJzJ3HhG7lnay74yKC1F4GMj4GCdzkXchSBoF7z0QGM14a7+E
gby4WHDSxK5kdC01vQqDCk3PBgGw+Qte6ipIvcVXnRVWCn8O0Z1lIdOkEZkODp5s
F1Sg9YpFRqR1h8Uw664xm/RulhPx2o8iritPjQyv0trmFWsdRpll/EXqIUPMiT0n
yJI+jD+1hDXnRt2dhTTtpG46F9oru+DbBWK+iN0GHISAlIUMXlKocIyCIKxBUUAq
0kzJrqJGPbXoodEDxvLMnAObVavo4RHux+jQVBlEkjHDpj0N7WA=
=yEqx
-----END PGP SIGNATURE-----
Merge tag 'wireless-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
More wireless fixes for 6.1
This has only the fixes for the scan parsing issues.
* tag 'wireless-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: cfg80211: update hidden BSSes to avoid WARN_ON
wifi: mac80211: fix crash in beacon protection for P2P-device
wifi: mac80211_hwsim: avoid mac80211 warning on bad rate
wifi: cfg80211: avoid nontransmitted BSS list corruption
wifi: cfg80211: fix BSS refcounting bugs
wifi: cfg80211: ensure length byte is present before access
wifi: mac80211: fix MBSSID parsing use-after-free
wifi: cfg80211/mac80211: reject bad MBSSID elements
wifi: cfg80211: fix u8 overflow in cfg80211_update_notlisted_nontrans()
====================
Link: https://lore.kernel.org/r/20221013100522.46346-1-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A WARN_ON call trace would be triggered when 'ct(commit, alg=helper)'
applies on a confirmed connection:
WARNING: CPU: 0 PID: 1251 at net/netfilter/nf_conntrack_extend.c:98
RIP: 0010:nf_ct_ext_add+0x12d/0x150 [nf_conntrack]
Call Trace:
<TASK>
nf_ct_helper_ext_add+0x12/0x60 [nf_conntrack]
__nf_ct_try_assign_helper+0xc4/0x160 [nf_conntrack]
__ovs_ct_lookup+0x72e/0x780 [openvswitch]
ovs_ct_execute+0x1d8/0x920 [openvswitch]
do_execute_actions+0x4e6/0xb60 [openvswitch]
ovs_execute_actions+0x60/0x140 [openvswitch]
ovs_packet_cmd_execute+0x2ad/0x310 [openvswitch]
genl_family_rcv_msg_doit.isra.15+0x113/0x150
genl_rcv_msg+0xef/0x1f0
which can be reproduced with these OVS flows:
table=0, in_port=veth1,tcp,tcp_dst=2121,ct_state=-trk
actions=ct(commit, table=1)
table=1, in_port=veth1,tcp,tcp_dst=2121,ct_state=+trk+new
actions=ct(commit, alg=ftp),normal
The issue was introduced by commit 248d45f1e1 ("openvswitch: Allow
attaching helper in later commit") where it somehow removed the check
of nf_ct_is_confirmed before asigning the helper. This patch is to fix
it by bringing it back.
Fixes: 248d45f1e1 ("openvswitch: Allow attaching helper in later commit")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Tested-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/c5c9092a22a2194650222bffaf786902613deb16.1665085502.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
setsockopt(IPV6_ADDRFORM) and tcp_v6_connect() change icsk->icsk_af_ops
under lock_sock(), but tcp_(get|set)sockopt() read it locklessly. To
avoid load/store tearing, we need to add READ_ONCE() and WRITE_ONCE()
for the reads and writes.
Thanks to Eric Dumazet for providing the syzbot report:
BUG: KCSAN: data-race in tcp_setsockopt / tcp_v6_connect
write to 0xffff88813c624518 of 8 bytes by task 23936 on cpu 0:
tcp_v6_connect+0x5b3/0xce0 net/ipv6/tcp_ipv6.c:240
__inet_stream_connect+0x159/0x6d0 net/ipv4/af_inet.c:660
inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:724
__sys_connect_file net/socket.c:1976 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1993
__do_sys_connect net/socket.c:2003 [inline]
__se_sys_connect net/socket.c:2000 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:2000
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff88813c624518 of 8 bytes by task 23937 on cpu 1:
tcp_setsockopt+0x147/0x1c80 net/ipv4/tcp.c:3789
sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3585
__sys_setsockopt+0x212/0x2b0 net/socket.c:2252
__do_sys_setsockopt net/socket.c:2263 [inline]
__se_sys_setsockopt net/socket.c:2260 [inline]
__x64_sys_setsockopt+0x62/0x70 net/socket.c:2260
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffffffff8539af68 -> 0xffffffff8539aff8
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 23937 Comm: syz-executor.5 Not tainted
6.0.0-rc4-syzkaller-00331-g4ed9c1e971b1-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 08/26/2022
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 086d49058c ("ipv6: annotate some data-races around sk->sk_prot")
fixed some data-races around sk->sk_prot but it was not enough.
Some functions in inet6_(stream|dgram)_ops still access sk->sk_prot
without lock_sock() or rtnl_lock(), so they need READ_ONCE() to avoid
load tearing.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were
able to clean them up by calling inet6_destroy_sock() during the IPv6 ->
IPv4 conversion by IPV6_ADDRFORM. However, commit 03485f2adc ("udpv6:
Add lockless sendmsg() support") added a lockless memory allocation path,
which could cause a memory leak:
setsockopt(IPV6_ADDRFORM) sendmsg()
+-----------------------+ +-------+
- do_ipv6_setsockopt(sk, ...) - udpv6_sendmsg(sk, ...)
- sockopt_lock_sock(sk) ^._ called via udpv6_prot
- lock_sock(sk) before WRITE_ONCE()
- WRITE_ONCE(sk->sk_prot, &tcp_prot)
- inet6_destroy_sock() - if (!corkreq)
- sockopt_release_sock(sk) - ip6_make_skb(sk, ...)
- release_sock(sk) ^._ lockless fast path for
the non-corking case
- __ip6_append_data(sk, ...)
- ipv6_local_rxpmtu(sk, ...)
- xchg(&np->rxpmtu, skb)
^._ rxpmtu is never freed.
- goto out_no_dst;
- lock_sock(sk)
For now, rxpmtu is only the case, but not to miss the future change
and a similar bug fixed in commit e27326009a ("net: ping6: Fix
memleak in ipv6_renew_options()."), let's set a new function to IPv6
sk->sk_destruct() and call inet6_cleanup_sock() there. Since the
conversion does not change sk->sk_destruct(), we can guarantee that
we can clean up IPv6 resources finally.
We can now remove all inet6_destroy_sock() calls from IPv6 protocol
specific ->destroy() functions, but such changes are invasive to
backport. So they can be posted as a follow-up later for net-next.
Fixes: 03485f2adc ("udpv6: Add lockless sendmsg() support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 4b340ae20d ("IPv6: Complete IPV6_DONTFRAG support") forgot
to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6
socket into IPv4 with IPV6_ADDRFORM. After conversion, sk_prot is
changed to udp_prot and ->destroy() never cleans it up, resulting in
a memory leak.
This is due to the discrepancy between inet6_destroy_sock() and
IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM
to remove the difference.
However, this is not enough for now because rxpmtu can be changed
without lock_sock() after commit 03485f2adc ("udpv6: Add lockless
sendmsg() support"). We will fix this case in the following patch.
Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and
remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy()
in the future.
Fixes: 4b340ae20d ("IPv6: Complete IPV6_DONTFRAG support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of putting io_uring's registered files in unix_gc() we want it
to be done by io_uring itself. The trick here is to consider io_uring
registered files for cycle detection but not actually putting them down.
Because io_uring can't register other ring instances, this will remove
all refs to the ring file triggering the ->release path and clean up
with io_ring_ctx_free().
Cc: stable@vger.kernel.org
Fixes: 6b06314c47 ("io_uring: add file set registration")
Reported-and-tested-by: David Bouman <dbouman03@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, we have a bug where a simultaneous DROPTAG ioctl and socket
close may race, as we attempt to remove a key from lists twice, and
perform an unref for each removal operation. This may result in a uaf
when we attempt the second unref.
This change fixes the race by making __mctp_key_remove tolerant to being
called on a key that has already been removed from the socket/net lists,
and only performs the unref when we do the actual remove. We also need
to hold the list lock on the ioctl cleanup path.
This fix is based on a bug report and comprehensive analysis from
butt3rflyh4ck <butterflyhuangxx@gmail.com>, found via syzkaller.
Cc: stable@vger.kernel.org
Fixes: 63ed1aab3d ("mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag control")
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the introduced field for correct operation with VRF devices instead
of conditionally overwriting flowic_oif. This is a partial revert of
commit b575b24b8e ("netfilter: Fix rpfilter dropping vrf packets by
mistake"), implementing a simpler solution.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Ido reported that a kernel warning [1] can be triggered from
user space when the kernel is compiled with CONFIG_MODULES=y and
CONFIG_XFRM=n when adding an xfrm encap type route, e.g:
$ ip route add 198.51.100.0/24 dev dummy1 encap xfrm if_id 1
Error: lwt encapsulation type not supported.
The reason for the warning is that the LWT infrastructure has an
autoloading feature which is meant only for encap types that don't
use a net device, which is not the case in xfrm encap.
Mute this warning for xfrm encap as there's no encap module to autoload
in this case.
[1]
WARNING: CPU: 3 PID: 2746262 at net/core/lwtunnel.c:57 lwtunnel_valid_encap_type+0x4f/0x120
[...]
Call Trace:
<TASK>
rtm_to_fib_config+0x211/0x350
inet_rtm_newroute+0x3a/0xa0
rtnetlink_rcv_msg+0x154/0x3c0
netlink_rcv_skb+0x49/0xf0
netlink_unicast+0x22f/0x350
netlink_sendmsg+0x208/0x440
____sys_sendmsg+0x21f/0x250
___sys_sendmsg+0x83/0xd0
__sys_sendmsg+0x54/0xa0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Reported-by: Ido Schimmel <idosch@idosch.org>
Fixes: 2c2493b9da ("xfrm: lwtunnel: add lwtunnel support for xfrm interfaces in collect_md mode")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The commit in the "Fixes" tag tried to avoid a case where policy check
is ignored due to dst caching in next hops.
However, when the traffic is locally consumed, the dst may be cached
in a local TCP or UDP socket as part of early demux. In this case the
"disable_policy" flag is not checked as ip_route_input_noref() was only
called before caching, and thus, packets after the initial packet in a
flow will be dropped if not matching policies.
Fix by checking the "disable_policy" flag also when a valid dst is
already available.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216557
Reported-by: Monil Patel <monil191989@gmail.com>
Fixes: e6175a2ed1 ("xfrm: fix "disable_policy" flag use when arriving from different devices")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
----
v2: use dev instead of skb->dev
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
For a given ping datagram, ping_getfrag() is called once
per skb fragment.
A large datagram requiring more than one page fragment
is currently getting the checksum of the last fragment,
instead of the cumulative one.
After this patch, "ping -s 35000 ::1" is working correctly.
Fixes: 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First set of fixes for v6.1. Quite a lot of fixes in stack but also
for mt76.
cfg80211/mac80211
* fix locking error in mac80211's hw addr change
* fix TX queue stop for internal TXQs
* handling of very small (e.g. STP TCN) packets
* two memcpy() hardening fixes
* fix probe request 6 GHz capability warning
* fix various connection prints
* fix decapsulation offload for AP VLAN
mt76
* fix rate reporting, LLC packets and receive checksum offload on specific chipsets
iwlwifi
* fix crash due to list corruption
ath11k
* fix a compiler warning with GCC 11 and KASAN
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmNFmekRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZvdswf/ZMyK5fTQBzOoAna360T8YIHqTaWtpLxt
Uf4OrB0SlzHgEFPBdPqJHarVCoCaRvj31Y7iFc6bHtKpVgra170HQebbaKZWbLZ1
pKlL75mrwH2K+A+wJDEPnpyAXB1hCmswDApOfLc3RSLnFd3h2muLsJeAEny6Z1Ua
62Sz/IqbdpgTHeh0kJqOTNw50GBiUOAYQhnNeHTjoc4b76dRViAptGLZE1AueoPM
mv8hIkMDgmoZA10sIg2cZJcTARfvLfh534rtZ2r0oml3d9TX2fpjVP6N+UQIEPgr
lAkltywDtKWpgUCVsNTN82kRpVGa2QPUq8Ey/yTsNBqu9iuHSlgs3w==
=b5ut
-----END PGP SIGNATURE-----
Merge tag 'wireless-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.1
First set of fixes for v6.1. Quite a lot of fixes in stack but also
for mt76.
cfg80211/mac80211
- fix locking error in mac80211's hw addr change
- fix TX queue stop for internal TXQs
- handling of very small (e.g. STP TCN) packets
- two memcpy() hardening fixes
- fix probe request 6 GHz capability warning
- fix various connection prints
- fix decapsulation offload for AP VLAN
mt76
- fix rate reporting, LLC packets and receive checksum offload on specific chipsets
iwlwifi
- fix crash due to list corruption
ath11k
- fix a compiler warning with GCC 11 and KASAN
* tag 'wireless-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: ath11k: mac: fix reading 16 bytes from a region of size 0 warning
wifi: iwlwifi: mvm: fix double list_add at iwl_mvm_mac_wake_tx_queue (other cases)
wifi: mt76: fix rx checksum offload on mt7615/mt7915/mt7921
wifi: mt76: fix receiving LLC packets on mt7615/mt7915
wifi: nl80211: Split memcpy() of struct nl80211_wowlan_tcp_data_token flexible array
wifi: wext: use flex array destination for memcpy()
wifi: cfg80211: fix ieee80211_data_to_8023_exthdr handling of small packets
wifi: mac80211: netdev compatible TX stop for iTXQ drivers
wifi: mac80211: fix decap offload for stations on AP_VLAN interfaces
wifi: mac80211: unlock on error in ieee80211_can_powered_addr_change()
wifi: mac80211: remove/avoid misleading prints
wifi: mac80211: fix probe req HE capabilities access
wifi: mac80211: do not drop packets smaller than the LLC-SNAP header on fast-rx
wifi: mt76: fix rate reporting / throughput regression on mt7915 and newer
====================
Link: https://lore.kernel.org/r/20221011163123.A093CC433D6@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The prandom_bytes() function has been a deprecated inline wrapper around
get_random_bytes() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. This was done as a basic find and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value,
simply use the get_random_{u8,u16}() functions, which are faster than
wasting the additional bytes from a 32-bit value. This was done by hand,
identifying all of the places where one of the random integer functions
was used in a non-32-bit context.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
The highlight of this PR is Christian's patch to allocate smaller buffers
for most metadata requests: 9p with a big msize would try to allocate large
buffers when just 4 or 8k would be more than enough; this brings in nice
performance improvements.
There's also a few fixes for problems reported by syzkaller (thanks to
Schspa Shi, Tetsuo Handa for tests and feedback/patches) as well as some
minor cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmNDq18ACgkQq06b7GqY
5nDQhBAAsl1C2gOvnCl2OA8HRf4EFWeal6wvG61y7yWaYos+oowu3TNw2wE6XsSI
f2aqbngmvQsVt/iosvVjCrwOyeKdNUaMoKcrqAZ6IFC5YzR+Xu1dvpWsTff0lA7T
nq6dj6W1F/KoKpV5RwWJM1lWWt/hHi2+pzH8HsAZUaNfD3tgH6ku4Qm1xM9dP5Ma
k2ckEIx7U6NeqAte816K5HA1aat2wFyC0PlMXTx1bOEYNRk9c5kGWY4DcAzA4vF9
x9PoqFfZow7UZfa0mo3cAk39Z8n6E4y93Eu+Mw3JCSfEQy3qB2wnzlLt2xMqiCtU
Pi2RrEd+doqRw/kvyEUCCl9ycVlb8iuAAXsUGuZFHYCMMlwXN/D1EY8x+b+XCwbE
TH8yUMMhgLfVPhVtFDBDTXgPDZbXAchb9BOoM6np4tOUdEe41kURd2wgdYhCP6+e
9BerKTXRqukyj/lf+4fp7XV1Ctj8Fi6FO4pH++rOXoiX765UFZhUiJd/vSE5Zou+
ANCSE8KWDXZwLS1sq1pjjHQOGyzHO76PUYDZL/JV8yu/NwR7Jwu1gbbABEb7sGfQ
Jxg7KuaJvQMrYcbguKBDhTpSariPyIq7bvUImh2jrJB23WbA+/L/1vo/5XOBEtZs
SGitcbyhvjKAB5XYLMaZhTw8eKtoTY6xTORyEnYpz2xXcM84Mmw=
=70yS
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Smaller buffers for small messages and fixes.
The highlight of this is Christian's patch to allocate smaller buffers
for most metadata requests: 9p with a big msize would try to allocate
large buffers when just 4 or 8k would be more than enough; this brings
in nice performance improvements.
There's also a few fixes for problems reported by syzkaller (thanks to
Schspa Shi, Tetsuo Handa for tests and feedback/patches) as well as
some minor cleanup"
* tag '9p-for-6.1' of https://github.com/martinetd/linux:
net/9p: clarify trans_fd parse_opt failure handling
net/9p: add __init/__exit annotations to module init/exit funcs
net/9p: use a dedicated spinlock for trans_fd
9p/trans_fd: always use O_NONBLOCK read/write
net/9p: allocate appropriate reduced message buffers
net/9p: add 'pooled_rbuffers' flag to struct p9_trans_module
net/9p: add p9_msg_buf_size()
9p: add P9_ERRMAX for 9p2000 and 9p2000.u
net/9p: split message size argument into 't_size' and 'r_size' pair
9p: trans_fd/p9_conn_cancel: drop client lock earlier
* cpuset now support isolated cpus.partition type, which will enable dynamic
CPU isolation.
* pids.peak added to remember the max number of pids used.
* Holes in cgroup namespace plugged.
* Internal cleanups.
Note that for-6.1-fixes was pulled into for-6.1 twice. Both were for
follow-up cleanups and each merge commit has details.
Also, 8a693f7766 ("cgroup: Remove CFTYPE_PRESSURE") removes the flag used
by PSI changes in the tip tree and the merged result won't compile due to
the missing flag. Simply removing the struct init lines specifying the flag
is the correct resolution. linux-next already contains the correct fix:
https://lkml.kernel.org/r/20220912161812.072aaa3b@canb.auug.org.au
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYzsl7w4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGYsxAP4kad4YPw+CueLyyEMiYgBHouqDt8cG0+FJWK3X
svTC7wD/eCLfxZM8TjjSrMmvaMrml586mr3NoQaFeW0x3twptQQ=
=LERu
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cpuset now support isolated cpus.partition type, which will enable
dynamic CPU isolation
- pids.peak added to remember the max number of pids used
- holes in cgroup namespace plugged
- internal cleanups
* tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (25 commits)
cgroup: use strscpy() is more robust and safer
iocost_monitor: reorder BlkgIterator
cgroup: simplify code in cgroup_apply_control
cgroup: Make cgroup_get_from_id() prettier
cgroup/cpuset: remove unreachable code
cgroup: Remove CFTYPE_PRESSURE
cgroup: Improve cftype add/rm error handling
kselftest/cgroup: Add cpuset v2 partition root state test
cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
cgroup/cpuset: Relocate a code block in validate_change()
cgroup/cpuset: Show invalid partition reason string
cgroup/cpuset: Add a new isolated cpus.partition type
cgroup/cpuset: Relax constraints to partition & cpus changes
cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
cgroup/cpuset: Miscellaneous cleanups & add helper functions
cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
cgroup: add pids.peak interface for pids controller
cgroup: Remove data-race around cgrp_dfl_visible
cgroup: Fix build failure when CONFIG_SHRINKER_DEBUG
...
- Debuggability:
- Change most occurances of BUG_ON() to WARN_ON_ONCE()
- Reorganize & fix TASK_ state comparisons, turn it into a bitmap
- Update/fix misc scheduler debugging facilities
- Load-balancing & regular scheduling:
- Improve the behavior of the scheduler in presence of lot of
SCHED_IDLE tasks - in particular they should not impact other
scheduling classes.
- Optimize task load tracking, cleanups & fixes
- Clean up & simplify misc load-balancing code
- Freezer:
- Rewrite the core freezer to behave better wrt thawing and be simpler
in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting
all the fallout.
- Deadline scheduler:
- Fix the DL capacity-aware code
- Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period()
- Relax/optimize locking in task_non_contending()
- Cleanups:
- Factor out the update_current_exec_runtime() helper
- Various cleanups, simplifications
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw
wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG
IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ
aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+
LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE
srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI
L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH
CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq
4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V
IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o
yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i
TeSLCxQxXlU=
=KjMD
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Debuggability:
- Change most occurances of BUG_ON() to WARN_ON_ONCE()
- Reorganize & fix TASK_ state comparisons, turn it into a bitmap
- Update/fix misc scheduler debugging facilities
Load-balancing & regular scheduling:
- Improve the behavior of the scheduler in presence of lot of
SCHED_IDLE tasks - in particular they should not impact other
scheduling classes.
- Optimize task load tracking, cleanups & fixes
- Clean up & simplify misc load-balancing code
Freezer:
- Rewrite the core freezer to behave better wrt thawing and be
simpler in general, by replacing PF_FROZEN with TASK_FROZEN &
fixing/adjusting all the fallout.
Deadline scheduler:
- Fix the DL capacity-aware code
- Factor out dl_task_is_earliest_deadline() &
replenish_dl_new_period()
- Relax/optimize locking in task_non_contending()
Cleanups:
- Factor out the update_current_exec_runtime() helper
- Various cleanups, simplifications"
* tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
sched: Fix more TASK_state comparisons
sched: Fix TASK_state comparisons
sched/fair: Move call to list_last_entry() in detach_tasks
sched/fair: Cleanup loop_max and loop_break
sched/fair: Make sure to try to detach at least one movable task
sched: Show PF_flag holes
freezer,sched: Rewrite core freezer logic
sched: Widen TAKS_state literals
sched/wait: Add wait_event_state()
sched/completion: Add wait_for_completion_state()
sched: Add TASK_ANY for wait_task_inactive()
sched: Change wait_task_inactive()s match_state
freezer,umh: Clean up freezer/initrd interaction
freezer: Have {,un}lock_system_sleep() save/restore flags
sched: Rename task_running() to task_on_cpu()
sched/fair: Cleanup for SIS_PROP
sched/fair: Default to false in test_idle_cores()
sched/fair: Remove useless check in select_idle_core()
sched/fair: Avoid double search on same cpu
sched/fair: Remove redundant check in select_idle_smt()
...
When updating beacon elements in a non-transmitted BSS,
also update the hidden sub-entries to the same beacon
elements, so that a future update through other paths
won't trigger a WARN_ON().
The warning is triggered because the beacon elements in
the hidden BSSes that are children of the BSS should
always be the same as in the parent.
Reported-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Tested-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Fixes: 0b8fb8235b ("cfg80211: Parsing of Multiple BSSID information in scanning")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If beacon protection is active but the beacon cannot be
decrypted or is otherwise malformed, we call the cfg80211
API to report this to userspace, but that uses a netdev
pointer, which isn't present for P2P-Device. Fix this to
call it only conditionally to ensure cfg80211 won't crash
in the case of P2P-Device.
This fixes CVE-2022-42722.
Reported-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Fixes: 9eaf183af7 ("mac80211: Report beacon protection failures to user space")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If a non-transmitted BSS shares enough information (both
SSID and BSSID!) with another non-transmitted BSS of a
different AP, then we can find and update it, and then
try to add it to the non-transmitted BSS list. We do a
search for it on the transmitted BSS, but if it's not
there (but belongs to another transmitted BSS), the list
gets corrupted.
Since this is an erroneous situation, simply fail the
list insertion in this case and free the non-transmitted
BSS.
This fixes CVE-2022-42721.
Reported-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Tested-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Fixes: 0b8fb8235b ("cfg80211: Parsing of Multiple BSSID information in scanning")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There are multiple refcounting bugs related to multi-BSSID:
- In bss_ref_get(), if the BSS has a hidden_beacon_bss, then
the bss pointer is overwritten before checking for the
transmitted BSS, which is clearly wrong. Fix this by using
the bss_from_pub() macro.
- In cfg80211_bss_update() we copy the transmitted_bss pointer
from tmp into new, but then if we release new, we'll unref
it erroneously. We already set the pointer and ref it, but
need to NULL it since it was copied from the tmp data.
- In cfg80211_inform_single_bss_data(), if adding to the non-
transmitted list fails, we unlink the BSS and yet still we
return it, but this results in returning an entry without
a reference. We shouldn't return it anyway if it was broken
enough to not get added there.
This fixes CVE-2022-42720.
Reported-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Tested-by: Sönke Huster <shuster@seemoo.tu-darmstadt.de>
Fixes: a3584f56de ("cfg80211: Properly track transmitting and non-transmitting BSS")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When iterating the elements here, ensure the length byte is
present before checking it to see if the entire element will
fit into the buffer.
Longer term, we should rewrite this code using the type-safe
element iteration macros that check all of this.
Fixes: 0b8fb8235b ("cfg80211: Parsing of Multiple BSSID information in scanning")
Reported-by: Soenke Huster <shuster@seemoo.tu-darmstadt.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we parse a multi-BSSID element, we might point some
element pointers into the allocated nontransmitted_profile.
However, we free this before returning, causing UAF when the
relevant pointers in the parsed elements are accessed.
Fix this by not allocating the scratch buffer separately but
as part of the returned structure instead, that way, there
are no lifetime issues with it.
The scratch buffer introduction as part of the returned data
here is taken from MLO feature work done by Ilan.
This fixes CVE-2022-42719.
Fixes: 5023b14cf4 ("mac80211: support profile split between elements")
Co-developed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Per spec, the maximum value for the MaxBSSID ('n') indicator is 8,
and the minimum is 1 since a multiple BSSID set with just one BSSID
doesn't make sense (the # of BSSIDs is limited by 2^n).
Limit this in the parsing in both cfg80211 and mac80211, rejecting
any elements with an invalid value.
This fixes potentially bad shifts in the processing of these inside
the cfg80211_gen_new_bssid() function later.
I found this during the investigation of CVE-2022-41674 fixed by the
previous patch.
Fixes: 0b8fb8235b ("cfg80211: Parsing of Multiple BSSID information in scanning")
Fixes: 78ac51f815 ("mac80211: support multi-bssid")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the copy code of the elements, we do the following calculation
to reach the end of the MBSSID element:
/* copy the IEs after MBSSID */
cpy_len = mbssid[1] + 2;
This looks fine, however, cpy_len is a u8, the same as mbssid[1],
so the addition of two can overflow. In this case the subsequent
memcpy() will overflow the allocated buffer, since it copies 256
bytes too much due to the way the allocation and memcpy() sizes
are calculated.
Fix this by using size_t for the cpy_len variable.
This fixes CVE-2022-41674.
Reported-by: Soenke Huster <shuster@seemoo.tu-darmstadt.de>
Tested-by: Soenke Huster <shuster@seemoo.tu-darmstadt.de>
Fixes: 0b8fb8235b ("cfg80211: Parsing of Multiple BSSID information in scanning")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix wrong pointer passed to PTR_ERR() in dsa_port_phylink_create() to print
error message.
Fixes: cf5ca4ddc3 ("net: dsa: don't leave dangling pointers in dp->pl when failing")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the big set of driver core and debug printk changes for 6.1-rc1.
Included in here is:
- dynamic debug updates for the core and the drm subsystem. The
drm changes have all been acked by the relevant maintainers.
- kernfs fixes for syzbot reported problems
- kernfs refactors and updates for cgroup requirements
- magic number cleanups and removals from the kernel tree (they
were not being used and they really did not actually do
anything.)
- other tiny cleanups
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY0BYUA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylozwCdFRlcghaf7XBUyNgRZRwMC+oQI8EAn1G/nEDE
6aFd2er41uK0IGQnSmYO
=OK0k
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big set of driver core and debug printk changes for
6.1-rc1. Included in here is:
- dynamic debug updates for the core and the drm subsystem. The drm
changes have all been acked by the relevant maintainers
- kernfs fixes for syzbot reported problems
- kernfs refactors and updates for cgroup requirements
- magic number cleanups and removals from the kernel tree (they were
not being used and they really did not actually do anything)
- other tiny cleanups
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits)
docs: filesystems: sysfs: Make text and code for ->show() consistent
Documentation: NBD_REQUEST_MAGIC isn't a magic number
a.out: restore CMAGIC
device property: Add const qualifier to device_get_match_data() parameter
drm_print: add _ddebug descriptor to drm_*dbg prototypes
drm_print: prefer bare printk KERN_DEBUG on generic fn
drm_print: optimize drm_debug_enabled for jump-label
drm-print: add drm_dbg_driver to improve namespace symmetry
drm-print.h: include dyndbg header
drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro
drm_print: interpose drm_*dbg with forwarding macros
drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers.
drm_print: condense enum drm_debug_category
debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops
driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs()
Documentation: ENI155_MAGIC isn't a magic number
Documentation: NBD_REPLY_MAGIC isn't a magic number
nbd: remove define-only NBD_MAGIC, previously magic number
Documentation: FW_HEADER_MAGIC isn't a magic number
Documentation: EEPROM_MAGIC_VALUE isn't a magic number
...
Here is the big set of TTY and Serial driver updates for 6.1-rc1.
Lots of cleanups in here, no real new functionality this time around,
with the diffstat being that we removed more lines than we added!
Included in here are:
- termios unification cleanups from Al Viro, it's nice to
finally get this work done
- tty serial transmit cleanups in various drivers in preparation
for more cleanup and unification in future releases (that work
was not ready for this release.)
- n_gsm fixes and updates
- ktermios cleanups and code reductions
- dt bindings json conversions and updates for new devices
- some serial driver updates for new devices
- lots of other tiny cleanups and janitorial stuff. Full
details in the shortlog.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY0BSdA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylucQCfaXIrYuh2AHcb6+G+Nqp1xD2BYaEAoIdLyOCA
a2yziLrDF6us2oav6j4x
=Wv+X
-----END PGP SIGNATURE-----
Merge tag 'tty-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the big set of TTY and Serial driver updates for 6.1-rc1.
Lots of cleanups in here, no real new functionality this time around,
with the diffstat being that we removed more lines than we added!
Included in here are:
- termios unification cleanups from Al Viro, it's nice to finally get
this work done
- tty serial transmit cleanups in various drivers in preparation for
more cleanup and unification in future releases (that work was not
ready for this release)
- n_gsm fixes and updates
- ktermios cleanups and code reductions
- dt bindings json conversions and updates for new devices
- some serial driver updates for new devices
- lots of other tiny cleanups and janitorial stuff. Full details in
the shortlog.
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (102 commits)
serial: cpm_uart: Don't request IRQ too early for console port
tty: serial: do unlock on a common path in altera_jtaguart_console_putc()
tty: serial: unify TX space reads under altera_jtaguart_tx_space()
tty: serial: use FIELD_GET() in lqasc_tx_ready()
tty: serial: extend lqasc_tx_ready() to lqasc_console_putchar()
tty: serial: allow pxa.c to be COMPILE_TESTed
serial: stm32: Fix unused-variable warning
tty: serial: atmel: Add COMMON_CLK dependency to SERIAL_ATMEL
serial: 8250: Fix restoring termios speed after suspend
serial: Deassert Transmit Enable on probe in driver-specific way
serial: 8250_dma: Convert to use uart_xmit_advance()
serial: 8250_omap: Convert to use uart_xmit_advance()
MAINTAINERS: Solve warning regarding inexistent atmel-usart binding
serial: stm32: Deassert Transmit Enable on ->rs485_config()
serial: ar933x: Deassert Transmit Enable on ->rs485_config()
tty: serial: atmel: Use FIELD_PREP/FIELD_GET
tty: serial: atmel: Make the driver aware of the existence of GCLK
tty: serial: atmel: Only divide Clock Divisor if the IP is USART
tty: serial: atmel: Separate mode clearing between UART and USART
dt-bindings: serial: atmel,at91-usart: Add gclk as a possible USART clock
...
To work around a misbehavior of the compiler's ability to see into
composite flexible array structs (as detailed in the coming memcpy()
hardening series[1]), split the memcpy() of the header and the payload
so no false positive run-time overflow warning will be generated.
[1] https://lore.kernel.org/linux-hardening/20220901065914.1417829-2-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Syzkaller reports buffer overflow false positive as follows:
------------[ cut here ]------------
memcpy: detected field-spanning write (size 8) of single field
"&compat_event->pointer" at net/wireless/wext-core.c:623 (size 4)
WARNING: CPU: 0 PID: 3607 at net/wireless/wext-core.c:623
wireless_send_event+0xab5/0xca0 net/wireless/wext-core.c:623
Modules linked in:
CPU: 1 PID: 3607 Comm: syz-executor659 Not tainted
6.0.0-rc6-next-20220921-syzkaller #0
[...]
Call Trace:
<TASK>
ioctl_standard_call+0x155/0x1f0 net/wireless/wext-core.c:1022
wireless_process_ioctl+0xc8/0x4c0 net/wireless/wext-core.c:955
wext_ioctl_dispatch net/wireless/wext-core.c:988 [inline]
wext_ioctl_dispatch net/wireless/wext-core.c:976 [inline]
wext_handle_ioctl+0x26b/0x280 net/wireless/wext-core.c:1049
sock_ioctl+0x285/0x640 net/socket.c:1220
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
[...]
</TASK>
Wireless events will be sent on the appropriate channels in
wireless_send_event(). Different wireless events may have different
payload structure and size, so kernel uses **len** and **cmd** field
in struct __compat_iw_event as wireless event common LCP part, uses
**pointer** as a label to mark the position of remaining different part.
Yet the problem is that, **pointer** is a compat_caddr_t type, which may
be smaller than the relative structure at the same position. So during
wireless_send_event() tries to parse the wireless events payload, it may
trigger the memcpy() run-time destination buffer bounds checking when the
relative structure's data is copied to the position marked by **pointer**.
This patch solves it by introducing flexible-array field **ptr_bytes**,
to mark the position of the wireless events remaining part next to
LCP part. What's more, this patch also adds **ptr_len** variable in
wireless_send_event() to improve its maintainability.
Reported-and-tested-by: syzbot+473754e5af963cf014cf@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/00000000000070db2005e95a5984@google.com/
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
STP topology change notification packets only have a payload of 7 bytes,
so they get dropped due to the skb->len < hdrlen + 8 check.
Fix this by removing the extra 8 from the skb->len check and checking the
return code on the skb_copy_bits calls.
Fixes: 2d1c304cb2 ("cfg80211: add function for 802.3 conversion with separate output buffer")
Reported-by: Chad Monroe <chad.monroe@smartrg.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Properly handle TX stop for internal queues (iTXQs) within mac80211.
mac80211 must not stop netdev queues when using mac80211 iTXQs.
For these drivers the netdev interface is created with IFF_NO_QUEUE.
While netdev still drops frames for IFF_NO_QUEUE interfaces when we stop
the netdev queues, it also prints a warning when this happens:
Assuming the mac80211 interface is called wlan0 we would get
"Virtual device wlan0 asks to queue packet!" when netdev has to drop a
frame.
This patch is keeping the harmless netdev queue starts for iTXQ drivers.
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since AP_VLAN interfaces are not passed to the driver, check offload_flags
on the bss vif instead.
Reported-by: Howard Hsu <howard-yh.hsu@mediatek.com>
Fixes: 80a915ec44 ("mac80211: add rx decapsulation offload support")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Unlock before returning -EOPNOTSUPP.
Fixes: 3c06e91b40 ("wifi: mac80211: Support POWERED_ADDR_CHANGE feature")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
At some point a few kernel debug prints started appearing which
indicated something was sending invalid IEs:
"bad VHT capabilities, disabling VHT"
"Invalid HE elem, Disable HE"
Turns out these were being printed because the local hardware
supported HE/VHT but the peer/AP did not. Bad/invalid indicates,
to me at least, that the IE is in some way malformed, not missing.
For the HE print (ieee80211_verify_peer_he_mcs_support) it will
now silently fail if the HE capability element is missing (still
prints if the element size is wrong).
For the VHT print, it has been removed completely and will silently
set the DISABLE_VHT flag which is consistent with how DISABLE_HT
is set.
Signed-off-by: James Prestwood <prestwoj@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When building the probe request IEs HE support is checked for
the 6GHz band (wiphy->bands[NL80211_BAND_6GHZ]). If supported
the HE capability IE should be included according to the spec.
The problem is the 16-bit capability is obtained from the
band object (sband) that was passed in, not the 6GHz band
object (sband6). If the sband object doesn't support HE it will
result in a warning.
Fixes: 7d29bc50b3 ("mac80211: always include HE 6GHz capability in probe request")
Signed-off-by: James Prestwood <prestwoj@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since STP TCN frames are only 7 bytes, the pskb_may_pull call returns an error.
Instead of dropping those packets, bump them back to the slow path for proper
processing.
Fixes: 49ddf8e6e2 ("mac80211: add fast-rx path")
Reported-by: Chad Monroe <chad.monroe@smartrg.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This parse_opts will set invalid opts.rfd/wfd in case of failure which
we already check, but it is not clear for readers that parse_opts error
are handled in p9_fd_create: clarify this by explicitely checking the
return value.
Link: https://lkml.kernel.org/r/20220921210921.1654735-1-floridsleeves@gmail.com
Signed-off-by: Li Zhong <floridsleeves@gmail.com>
[Dominique: reworded commit message to clarify this is NOOP]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Shamelessly copying the explanation from Tetsuo Handa's suggested
patch[1] (slightly reworded):
syzbot is reporting inconsistent lock state in p9_req_put()[2],
for p9_tag_remove() from p9_req_put() from IRQ context is using
spin_lock_irqsave() on "struct p9_client"->lock but trans_fd
(not from IRQ context) is using spin_lock().
Since the locks actually protect different things in client.c and in
trans_fd.c, just replace trans_fd.c's lock by a new one specific to the
transport (client.c's protect the idr for fid/tag allocations,
while trans_fd.c's protects its own req list and request status field
that acts as the transport's state machine)
Link: https://lore.kernel.org/r/20220904112928.1308799-1-asmadeus@codewreck.org
Link: https://lkml.kernel.org/r/2470e028-9b05-2013-7198-1fdad071d999@I-love.SAKURA.ne.jp [1]
Link: https://syzkaller.appspot.com/bug?extid=2f20b523930c32c160cc [2]
Reported-by: syzbot <syzbot+2f20b523930c32c160cc@syzkaller.appspotmail.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Gwangun Jung reported a slab-out-of-bounds access in fib_nh_match:
fib_nh_match+0xf98/0x1130 linux-6.0-rc7/net/ipv4/fib_semantics.c:961
fib_table_delete+0x5f3/0xa40 linux-6.0-rc7/net/ipv4/fib_trie.c:1753
inet_rtm_delroute+0x2b3/0x380 linux-6.0-rc7/net/ipv4/fib_frontend.c:874
Separate nexthop objects are mutually exclusive with the legacy
multipath spec. Fix fib_nh_match to return if the config for the
to be deleted route contains a multipath spec while the fib_info
is using a nexthop object.
Fixes: 493ced1ac4 ("ipv4: Allow routes to use nexthop objects")
Fixes: 6bf92d70e6 ("net: ipv4: fix route with nexthop object delete warning")
Reported-by: Gwangun Jung <exsociety@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to return error code -EINVAL from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: 94160108a7 ("net/ieee802154: fix uninit value bug in dgram_sendmsg")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Link: https://lore.kernel.org/r/20220919160830.1436109-1-weiyongjun@huaweicloud.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
syzbot is reporting hung task at p9_fd_close() [1], for p9_mux_poll_stop()
from p9_conn_destroy() from p9_fd_close() is failing to interrupt already
started kernel_read() from p9_fd_read() from p9_read_work() and/or
kernel_write() from p9_fd_write() from p9_write_work() requests.
Since p9_socket_open() sets O_NONBLOCK flag, p9_mux_poll_stop() does not
need to interrupt kernel_read()/kernel_write(). However, since p9_fd_open()
does not set O_NONBLOCK flag, but pipe blocks unless signal is pending,
p9_mux_poll_stop() needs to interrupt kernel_read()/kernel_write() when
the file descriptor refers to a pipe. In other words, pipe file descriptor
needs to be handled as if socket file descriptor.
We somehow need to interrupt kernel_read()/kernel_write() on pipes.
A minimal change, which this patch is doing, is to set O_NONBLOCK flag
from p9_fd_open(), for O_NONBLOCK flag does not affect reading/writing
of regular files. But this approach changes O_NONBLOCK flag on userspace-
supplied file descriptors (which might break userspace programs), and
O_NONBLOCK flag could be changed by userspace. It would be possible to set
O_NONBLOCK flag every time p9_fd_read()/p9_fd_write() is invoked, but still
remains small race window for clearing O_NONBLOCK flag.
If we don't want to manipulate O_NONBLOCK flag, we might be able to
surround kernel_read()/kernel_write() with set_thread_flag(TIF_SIGPENDING)
and recalc_sigpending(). Since p9_read_work()/p9_write_work() works are
processed by kernel threads which process global system_wq workqueue,
signals could not be delivered from remote threads when p9_mux_poll_stop()
from p9_conn_destroy() from p9_fd_close() is called. Therefore, calling
set_thread_flag(TIF_SIGPENDING)/recalc_sigpending() every time would be
needed if we count on signals for making kernel_read()/kernel_write()
non-blocking.
Link: https://lkml.kernel.org/r/345de429-a88b-7097-d177-adecf9fed342@I-love.SAKURA.ne.jp
Link: https://syzkaller.appspot.com/bug?extid=8b41a1365f1106fd0f33 [1]
Reported-by: syzbot <syzbot+8b41a1365f1106fd0f33@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+8b41a1365f1106fd0f33@syzkaller.appspotmail.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
[Dominique: add comment at Christian's suggestion]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxjQAAKCRBZ7Krx/gZQ
683pAP9oSHaXo3Twl6rweirNbHocgm8MynCgIU3bpzeVPi6Z1wEApfEq4IInWQyL
R6ObOneoSobi+9Iaqsoe+uKu54MghAY=
=rt7w
-----END PGP SIGNATURE-----
Merge tag 'pull-d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs d_path updates from Al Viro.
* tag 'pull-d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
d_path.c: typo fix...
dynamic_dname(): drop unused dentry argument
Allow the caller to force a disconnection of the RPC client so that we
can clear any pending requests that are buffered in the socket.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add the helper rpc_cancel_tasks(), which uses a caller-defined selection
function to define a set of in-flight RPC calls to cancel. This is
mainly intended for pNFS drivers which are subject to a layout recall,
and which may therefore want to cancel all pending I/O using that layout
in order to redrive it after the layout recall has been satisfied.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that we immediately call rpc_exit_task() after waking up, and
that the tk_rpc_status cannot get clobbered by some other function.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Stefan Schmidt says:
====================
pull-request: ieee802154 for net 2022-10-05
Only two patches this time around. A revert from Alexander Aring to a patch
that hit net and the updated patch to fix the problem from Tetsuo Handa.
* tag 'ieee802154-for-net-2022-10-05' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan:
net/ieee802154: don't warn zero-sized raw_sendmsg()
Revert "net/ieee802154: reject zero-sized raw_sendmsg()"
====================
Link: https://lore.kernel.org/r/20221005144508.787376-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
taprio_attach() has this logic at the end, which should have been
removed with the blamed patch (which is now being reverted):
/* access to the child qdiscs is not needed in offload mode */
if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
kfree(q->qdiscs);
q->qdiscs = NULL;
}
because otherwise, we make use of q->qdiscs[] even after this array was
deallocated, namely in taprio_leaf(). Therefore, whenever one would try
to attach a valid child qdisc to a fully offloaded taprio root, one
would immediately dereference a NULL pointer.
$ tc qdisc replace dev eno0 handle 8001: parent root taprio \
num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
max-sdu 0 0 0 0 0 200 0 0 \
base-time 200 \
sched-entry S 80 20000 \
sched-entry S a0 20000 \
sched-entry S 5f 60000 \
flags 2
$ max_frame_size=1500
$ data_rate_kbps=20000
$ port_transmit_rate_kbps=1000000
$ idleslope=$data_rate_kbps
$ sendslope=$(($idleslope - $port_transmit_rate_kbps))
$ locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps))
$ hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps))
$ tc qdisc replace dev eno0 parent 8001:7 cbs \
idleslope $idleslope \
sendslope $sendslope \
hicredit $hicredit \
locredit $locredit \
offload 0
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030
pc : taprio_leaf+0x28/0x40
lr : qdisc_leaf+0x3c/0x60
Call trace:
taprio_leaf+0x28/0x40
tc_modify_qdisc+0xf0/0x72c
rtnetlink_rcv_msg+0x12c/0x390
netlink_rcv_skb+0x5c/0x130
rtnetlink_rcv+0x1c/0x2c
The solution is not as obvious as the problem. The code which deallocates
q->qdiscs[] is in fact copied and pasted from mqprio, which also
deallocates the array in mqprio_attach() and never uses it afterwards.
Therefore, the identical cleanup logic of priv->qdiscs[] that
mqprio_destroy() has is deceptive because it will never take place at
qdisc_destroy() time, but just at raw ops->destroy() time (otherwise
said, priv->qdiscs[] do not last for the entire lifetime of the mqprio
root), but rather, this is just the twisted way in which the Qdisc API
understands error path cleanup should be done (Qdisc_ops :: destroy() is
called even when Qdisc_ops :: init() never succeeded).
Side note, in fact this is also what the comment in mqprio_init() says:
/* pre-allocate qdisc, attachment can't fail */
Or reworded, mqprio's priv->qdiscs[] scheme is only meant to serve as
data passing between Qdisc_ops :: init() and Qdisc_ops :: attach().
[ this comment was also copied and pasted into the initial taprio
commit, even though taprio_attach() came way later ]
The problem is that taprio also makes extensive use of the q->qdiscs[]
array in the software fast path (taprio_enqueue() and taprio_dequeue()),
but it does not keep a reference of its own on q->qdiscs[i] (you'd think
that since it creates these Qdiscs, it holds the reference, but nope,
this is not completely true).
To understand the difference between taprio_destroy() and mqprio_destroy()
one must look before commit 13511704f8 ("net: taprio offload: enforce
qdisc to netdev queue mapping"), because that just muddied the waters.
In the "original" taprio design, taprio always attached itself (the root
Qdisc) to all netdev TX queues, so that dev_qdisc_enqueue() would go
through taprio_enqueue().
It also called qdisc_refcount_inc() on itself for as many times as there
were netdev TX queues, in order to counter-balance what tc_get_qdisc()
does when destroying a Qdisc (simplified for brevity below):
if (n->nlmsg_type == RTM_DELQDISC)
err = qdisc_graft(dev, parent=NULL, new=NULL, q, extack);
qdisc_graft(where "new" is NULL so this deletes the Qdisc):
for (i = 0; i < num_q; i++) {
struct netdev_queue *dev_queue;
dev_queue = netdev_get_tx_queue(dev, i);
old = dev_graft_qdisc(dev_queue, new);
if (new && i > 0)
qdisc_refcount_inc(new);
qdisc_put(old);
~~~~~~~~~~~~~~
this decrements taprio's refcount once for each TX queue
}
notify_and_destroy(net, skb, n, classid,
rtnl_dereference(dev->qdisc), new);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
and this finally decrements it to zero,
making qdisc_put() call qdisc_destroy()
The q->qdiscs[] created using qdisc_create_dflt() (or their
replacements, if taprio_graft() was ever to get called) were then
privately freed by taprio_destroy().
This is still what is happening after commit 13511704f8 ("net: taprio
offload: enforce qdisc to netdev queue mapping"), but only for software
mode.
In full offload mode, the per-txq "qdisc_put(old)" calls from
qdisc_graft() now deallocate the child Qdiscs rather than decrement
taprio's refcount. So when notify_and_destroy(taprio) finally calls
taprio_destroy(), the difference is that the child Qdiscs were already
deallocated.
And this is exactly why the taprio_attach() comment "access to the child
qdiscs is not needed in offload mode" is deceptive too. Not only the
q->qdiscs[] array is not needed, but it is also necessary to get rid of
it as soon as possible, because otherwise, we will also call qdisc_put()
on the child Qdiscs in qdisc_destroy() -> taprio_destroy(), and this
will cause a nasty use-after-free/refcount-saturate/whatever.
In short, the problem is that since the blamed commit, taprio_leaf()
needs q->qdiscs[] to not be freed by taprio_attach(), while qdisc_destroy()
-> taprio_destroy() does need q->qdiscs[] to be freed by taprio_attach()
for full offload. Fixing one problem triggers the other.
All of this can be solved by making taprio keep its q->qdiscs[i] with a
refcount elevated at 2 (in offloaded mode where they are attached to the
netdev TX queues), both in taprio_attach() and in taprio_graft(). The
generic qdisc_graft() would just decrement the child qdiscs' refcounts
to 1, and taprio_destroy() would give them the final coup de grace.
However the rabbit hole of changes is getting quite deep, and the
complexity increases. The blamed commit was supposed to be a bug fix in
the first place, and the bug it addressed is not so significant so as to
justify further rework in stable trees. So I'd rather just revert it.
I don't know enough about multi-queue Qdisc design to make a proper
judgement right now regarding what is/isn't idiomatic use of Qdisc
concepts in taprio. I will try to study the problem more and come with a
different solution in net-next.
Fixes: 1461d212ab ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs")
Reported-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Reported-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20221004220100.1650558-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/sunrpc/xprtrdma/frwr_ops.c:151:32: warning: variable 'rc' is uninitialized when used here [-Wuninitialized]
trace_xprtrdma_frwr_alloc(mr, rc);
^~
net/sunrpc/xprtrdma/frwr_ops.c:127:8: note: initialize the variable 'rc' to silence this warning
int rc;
^
= 0
1 warning generated.
The tracepoint is intended to record the error returned from
ib_alloc_mr(). In the current code there is no other purpose for
@rc, so simply replace it.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: d8cf39a280c3b0 ('xprtrdma: MR-related memory allocation should be allowed to fail')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Many memory allocations that xprtrdma does can fail safely. Let's
use this fact to avoid some potential deadlocks: Replace GFP_KERNEL
with GFP flags that do not try hard to acquire memory.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
An attempt to establish a connection can always fail and then be
retried. GFP_KERNEL allocation is not necessary here.
Like MR allocation, establishing a connection is always done in a
worker thread. The new GFP flags align with the flags that would
be returned by rpc_task_gfp_mask() in this case.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
xprtrdma always drives a retry of MR allocation if it should fail.
It should be safe to not use GFP_KERNEL for this purpose rather
than sleeping in the memory allocator.
In theory, if these weaker allocations are attempted first, memory
exhaustion is likely to cause xprtrdma to fail fast and not then
invoke the RDMA core APIs, which still might use GFP_KERNEL.
Also note that rpc_task_gfp_mask() always sets __GFP_NORETRY and
__GFP_NOWARN when an RPC-related allocation is being done in a
worker thread. MR allocation is already always done in worker
threads.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently all rpcrdma_regbuf_alloc() call sites pass the same value
as their third argument. That argument can therefore be eliminated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 1769e6a816 ("xprtrdma: Clean up rpcrdma_create_req()")
added rpcrdma_req_create() with a GFP flags argument in case a
caller might want to avoid waiting for memory.
There has never been a caller that does not pass GFP_KERNEL as
the third argument. That argument can therefore be eliminated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
xprt_rdma_bc_allocate() is now the only user of RPCRDMA_DEF_GFP.
Replace that macro with the raw flags.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
While setting up a new lab, I accidentally misconfigured the
Ethernet port for a system that tried an NFS mount using RoCE.
This made the NFS server unreachable. The following WARNING
popped on the NFS client while waiting for the mount attempt to
time out:
kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
kernel: FS: 0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel: <TASK>
kernel: __flush_work.isra.0+0xaf/0x188
kernel: ? _raw_spin_lock_irqsave+0x2c/0x37
kernel: ? lock_timer_base+0x38/0x5f
kernel: __cancel_work_timer+0xea/0x13d
kernel: ? preempt_latency_start+0x2b/0x46
kernel: rdma_addr_cancel+0x70/0x81 [ib_core]
kernel: _destroy_id+0x1a/0x246 [rdma_cm]
kernel: rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
kernel: ? _raw_spin_unlock+0x14/0x29
kernel: ? raw_spin_rq_unlock_irq+0x5/0x10
kernel: ? finish_task_switch.isra.0+0x171/0x249
kernel: xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
kernel: process_one_work+0x1d8/0x2d4
kernel: worker_thread+0x18b/0x24f
kernel: ? rescuer_thread+0x280/0x280
kernel: kthread+0xf4/0xfc
kernel: ? kthread_complete_and_exit+0x1b/0x1b
kernel: ret_from_fork+0x22/0x30
kernel: </TASK>
SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
prevent a priority inversion. The internal workqueues in the
RDMA/core are currently non-MEM_RECLAIM.
Jason Gunthorpe says this about the current state of RDMA/core:
> If you attempt to do a reconnection/etc from within a RECLAIM
> context it will deadlock on one of the many allocations that are
> made to support opening the connection.
>
> The general idea of reclaim is that the entire task context
> working under the reclaim is marked with an override of the gfp
> flags to make all allocations under that call chain reclaim safe.
>
> But rdmacm does allocations outside this, eg in the WQs processing
> the CM packets. So this doesn't work and we will deadlock.
>
> Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
> here and there.
So we will change the ULP in this case to avoid the use of
WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
are not fixed, but at least we no longer have a false sense of
confidence that the stack won't allocate memory during memory
reclaim.
Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
So far 'msize' was simply used for all 9p message types, which is far
too much and slowed down performance tremendously with large values
for user configurable 'msize' option.
Let's stop this waste by using the new p9_msg_buf_size() function for
allocating more appropriate, smaller buffers according to what is
actually sent over the wire.
Only exception: RDMA transport is currently excluded from this message
size optimization - for its response buffers that is - as RDMA transport
would not cope with it, due to its response buffers being pulled from a
shared pool. [1]
Link: https://lore.kernel.org/all/Ys3jjg52EIyITPua@codewreck.org/ [1]
Link: https://lkml.kernel.org/r/3f51590535dc96ed0a165b8218c57639cfa5c36c.1657920926.git.linux_oss@crudebyte.com
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
This new function calculates a buffer size suitable for holding the
intended 9p request or response. For rather small message types (which
applies to almost all 9p message types actually) simply use hard coded
values. For some variable-length and potentially large message types
calculate a more precise value according to what data is actually
transmitted to avoid unnecessarily huge buffers.
So p9_msg_buf_size() divides the individual 9p message types into 3
message size categories:
- dynamically calculated message size (i.e. potentially large)
- 8k hard coded message size
- 4k hard coded message size
As for the latter two hard coded message types: for most 9p message
types it is pretty obvious whether they would always fit into 4k or
8k. But for some of them it depends on the maximum directory entry
name length allowed by OS and filesystem for determining into which
of the two size categories they would fit into. Currently Linux
supports directory entry names up to NAME_MAX (255), however when
comparing the limitation of individual filesystems, ReiserFS
theoretically supports up to slightly below 4k long names. So in
order to make this code more future proof, and as revisiting it
later on is a bit tedious and has the potential to miss out details,
the decision [1] was made to take 4k as basis as for max. name length.
Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com
Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1]
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Refactor 'max_size' argument of p9_tag_alloc() and 'req_size' argument
of p9_client_prepare_req() both into a pair of arguments 't_size' and
'r_size' respectively to allow handling the buffer size for request and
reply separately from each other.
Link: https://lkml.kernel.org/r/9431a25fe4b37fd12cecbd715c13af71f701f220.1657920926.git.linux_oss@crudebyte.com
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Core
----
- Introduce and use a single page frag cache for allocating small skb
heads, clawing back the 10-20% performance regression in UDP flood
test from previous fixes.
- Run packets which already went thru HW coalescing thru SW GRO.
This significantly improves TCP segment coalescing and simplifies
deployments as different workloads benefit from HW or SW GRO.
- Shrink the size of the base zero-copy send structure.
- Move TCP init under a new slow / sleepable version of DO_ONCE().
BPF
---
- Add BPF-specific, any-context-safe memory allocator.
- Add helpers/kfuncs for PKCS#7 signature verification from BPF
programs.
- Define a new map type and related helpers for user space -> kernel
communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).
- Allow targeting BPF iterators to loop through resources of one
task/thread.
- Add ability to call selected destructive functions.
Expose crash_kexec() to allow BPF to trigger a kernel dump.
Use CAP_SYS_BOOT check on the loading process to judge permissions.
- Enable BPF to collect custom hierarchical cgroup stats efficiently
by integrating with the rstat framework.
- Support struct arguments for trampoline based programs.
Only structs with size <= 16B and x86 are supported.
- Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
sockets (instead of just TCP and UDP sockets).
- Add a helper for accessing CLOCK_TAI for time sensitive network
related programs.
- Support accessing network tunnel metadata's flags.
- Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.
- Add support for writing to Netfilter's nf_conn:mark.
Protocols
---------
- WiFi: more Extremely High Throughput (EHT) and Multi-Link
Operation (MLO) work (802.11be, WiFi 7).
- vsock: improve support for SO_RCVLOWAT.
- SMC: support SO_REUSEPORT.
- Netlink: define and document how to use netlink in a "modern" way.
Support reporting missing attributes via extended ACK.
- IPSec: support collect metadata mode for xfrm interfaces.
- TCPv6: send consistent autoflowlabel in SYN_RECV state
and RST packets.
- TCP: introduce optional per-netns connection hash table to allow
better isolation between namespaces (opt-in, at the cost of memory
and cache pressure).
- MPTCP: support TCP_FASTOPEN_CONNECT.
- Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.
- Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.
- Open vSwitch:
- Allow specifying ifindex of new interfaces.
- Allow conntrack and metering in non-initial user namespace.
- TLS: support the Korean ARIA-GCM crypto algorithm.
- Remove DECnet support.
Driver API
----------
- Allow selecting the conduit interface used by each port
in DSA switches, at runtime.
- Ethernet Power Sourcing Equipment and Power Device support.
- Add tc-taprio support for queueMaxSDU parameter, i.e. setting
per traffic class max frame size for time-based packet schedules.
- Support PHY rate matching - adapting between differing host-side
and link-side speeds.
- Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.
- Validate OF (device tree) nodes for DSA shared ports; make
phylink-related properties mandatory on DSA and CPU ports.
Enforcing more uniformity should allow transitioning to phylink.
- Require that flash component name used during update matches one
of the components for which version is reported by info_get().
- Remove "weight" argument from driver-facing NAPI API as much
as possible. It's one of those magic knobs which seemed like
a good idea at the time but is too indirect to use in practice.
- Support offload of TLS connections with 256 bit keys.
New hardware / drivers
----------------------
- Ethernet:
- Microchip KSZ9896 6-port Gigabit Ethernet Switch
- Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
- Analog Devices ADIN1110 and ADIN2111 industrial single pair
Ethernet (10BASE-T1L) MAC+PHY.
- Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).
- Ethernet SFPs / modules:
- RollBall / Hilink / Turris 10G copper SFPs
- HALNy GPON module
- WiFi:
- CYW43439 SDIO chipset (brcmfmac)
- CYW89459 PCIe chipset (brcmfmac)
- BCM4378 on Apple platforms (brcmfmac)
Drivers
-------
- CAN:
- gs_usb: HW timestamp support
- Ethernet PHYs:
- lan8814: cable diagnostics
- Ethernet NICs:
- Intel (100G):
- implement control of FCS/CRC stripping
- port splitting via devlink
- L2TPv3 filtering offload
- nVidia/Mellanox:
- tunnel offload for sub-functions
- MACSec offload, w/ Extended packet number and replay
window offload
- significantly restructure, and optimize the AF_XDP support,
align the behavior with other vendors
- Huawei:
- configuring DSCP map for traffic class selection
- querying standard FEC statistics
- querying SerDes lane number via ethtool
- Marvell/Cavium:
- egress priority flow control
- MACSec offload
- AMD/SolarFlare:
- PTP over IPv6 and raw Ethernet
- small / embedded:
- ax88772: convert to phylink (to support SFP cages)
- altera: tse: convert to phylink
- ftgmac100: support fixed link
- enetc: standard Ethtool counters
- macb: ZynqMP SGMII dynamic configuration support
- tsnep: support multi-queue and use page pool
- lan743x: Rx IP & TCP checksum offload
- igc: add xdp frags support to ndo_xdp_xmit
- Ethernet high-speed switches:
- Marvell (prestera):
- support SPAN port features (traffic mirroring)
- nexthop object offloading
- Microchip (sparx5):
- multicast forwarding offload
- QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- support RGMII cmode
- NXP (felix):
- standardized ethtool counters
- Microchip (lan966x):
- QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
- traffic policing and mirroring
- link aggregation / bonding offload
- QUSGMII PHY mode support
- Qualcomm 802.11ax WiFi (ath11k):
- cold boot calibration support on WCN6750
- support to connect to a non-transmit MBSSID AP profile
- enable remain-on-channel support on WCN6750
- Wake-on-WLAN support for WCN6750
- support to provide transmit power from firmware via nl80211
- support to get power save duration for each client
- spectral scan support for 160 MHz
- MediaTek WiFi (mt76):
- WiFi-to-Ethernet bridging offload for MT7986 chips
- RealTek WiFi (rtw89):
- P2P support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmM7vtkACgkQMUZtbf5S
Irvotg//dmh53rC+UMKO3OgOqPlSMnaqzbUdDEfN6mj4Mpox7Csb8zERVURHhBHY
fvlXWsDgxmvgTebI5fvNC5+f1iW5xcqgJV2TWnNmDOKWwvQwb6qQfgixVmunvkpe
IIukMXYt0dAf9bXeeEfbNXcCb85cPwB76stX0tMV6BX7osp3T0TL1fvFk0NJkL0j
TeydLad/yAQtPb4TbeWYjNDoxPVDf0cVpUrevLGmWE88UMYmgTqPze+h1W5Wri52
bzjdLklY/4cgcIZClHQ6F9CeRWqEBxvujA5Hj/cwOcn/ptVVJWUGi7sQo3sYkoSs
HFu+F8XsTec14kGNC0Ab40eVdqs5l/w8+E+4jvgXeKGOtVns8DwoiUIzqXpyty89
Ib04mffrwWNjFtHvo/kIsNwP05X2PGE9HUHfwsTUfisl/ASvMmQp7D7vUoqQC/4B
AMVzT5qpjkmfBHYQQGuw8FxJhMeAOjC6aAo6censhXJyiUhIfleQsN0syHdaNb8q
9RZlhAgQoVb6ZgvBV8r8unQh/WtNZ3AopwifwVJld2unsE/UNfQy2KyqOWBES/zf
LP9sfuX0JnmHn8s1BQEUMPU1jF9ZVZCft7nufJDL6JhlAL+bwZeEN4yCiAHOPZqE
ymSLHI9s8yWZoNpuMWKrI9kFexVnQFKmA3+quAJUcYHNMSsLkL8=
=Gsio
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Introduce and use a single page frag cache for allocating small skb
heads, clawing back the 10-20% performance regression in UDP flood
test from previous fixes.
- Run packets which already went thru HW coalescing thru SW GRO. This
significantly improves TCP segment coalescing and simplifies
deployments as different workloads benefit from HW or SW GRO.
- Shrink the size of the base zero-copy send structure.
- Move TCP init under a new slow / sleepable version of DO_ONCE().
BPF:
- Add BPF-specific, any-context-safe memory allocator.
- Add helpers/kfuncs for PKCS#7 signature verification from BPF
programs.
- Define a new map type and related helpers for user space -> kernel
communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).
- Allow targeting BPF iterators to loop through resources of one
task/thread.
- Add ability to call selected destructive functions. Expose
crash_kexec() to allow BPF to trigger a kernel dump. Use
CAP_SYS_BOOT check on the loading process to judge permissions.
- Enable BPF to collect custom hierarchical cgroup stats efficiently
by integrating with the rstat framework.
- Support struct arguments for trampoline based programs. Only
structs with size <= 16B and x86 are supported.
- Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
sockets (instead of just TCP and UDP sockets).
- Add a helper for accessing CLOCK_TAI for time sensitive network
related programs.
- Support accessing network tunnel metadata's flags.
- Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.
- Add support for writing to Netfilter's nf_conn:mark.
Protocols:
- WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation
(MLO) work (802.11be, WiFi 7).
- vsock: improve support for SO_RCVLOWAT.
- SMC: support SO_REUSEPORT.
- Netlink: define and document how to use netlink in a "modern" way.
Support reporting missing attributes via extended ACK.
- IPSec: support collect metadata mode for xfrm interfaces.
- TCPv6: send consistent autoflowlabel in SYN_RECV state and RST
packets.
- TCP: introduce optional per-netns connection hash table to allow
better isolation between namespaces (opt-in, at the cost of memory
and cache pressure).
- MPTCP: support TCP_FASTOPEN_CONNECT.
- Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.
- Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.
- Open vSwitch:
- Allow specifying ifindex of new interfaces.
- Allow conntrack and metering in non-initial user namespace.
- TLS: support the Korean ARIA-GCM crypto algorithm.
- Remove DECnet support.
Driver API:
- Allow selecting the conduit interface used by each port in DSA
switches, at runtime.
- Ethernet Power Sourcing Equipment and Power Device support.
- Add tc-taprio support for queueMaxSDU parameter, i.e. setting per
traffic class max frame size for time-based packet schedules.
- Support PHY rate matching - adapting between differing host-side
and link-side speeds.
- Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.
- Validate OF (device tree) nodes for DSA shared ports; make
phylink-related properties mandatory on DSA and CPU ports.
Enforcing more uniformity should allow transitioning to phylink.
- Require that flash component name used during update matches one of
the components for which version is reported by info_get().
- Remove "weight" argument from driver-facing NAPI API as much as
possible. It's one of those magic knobs which seemed like a good
idea at the time but is too indirect to use in practice.
- Support offload of TLS connections with 256 bit keys.
New hardware / drivers:
- Ethernet:
- Microchip KSZ9896 6-port Gigabit Ethernet Switch
- Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
- Analog Devices ADIN1110 and ADIN2111 industrial single pair
Ethernet (10BASE-T1L) MAC+PHY.
- Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).
- Ethernet SFPs / modules:
- RollBall / Hilink / Turris 10G copper SFPs
- HALNy GPON module
- WiFi:
- CYW43439 SDIO chipset (brcmfmac)
- CYW89459 PCIe chipset (brcmfmac)
- BCM4378 on Apple platforms (brcmfmac)
Drivers:
- CAN:
- gs_usb: HW timestamp support
- Ethernet PHYs:
- lan8814: cable diagnostics
- Ethernet NICs:
- Intel (100G):
- implement control of FCS/CRC stripping
- port splitting via devlink
- L2TPv3 filtering offload
- nVidia/Mellanox:
- tunnel offload for sub-functions
- MACSec offload, w/ Extended packet number and replay window
offload
- significantly restructure, and optimize the AF_XDP support,
align the behavior with other vendors
- Huawei:
- configuring DSCP map for traffic class selection
- querying standard FEC statistics
- querying SerDes lane number via ethtool
- Marvell/Cavium:
- egress priority flow control
- MACSec offload
- AMD/SolarFlare:
- PTP over IPv6 and raw Ethernet
- small / embedded:
- ax88772: convert to phylink (to support SFP cages)
- altera: tse: convert to phylink
- ftgmac100: support fixed link
- enetc: standard Ethtool counters
- macb: ZynqMP SGMII dynamic configuration support
- tsnep: support multi-queue and use page pool
- lan743x: Rx IP & TCP checksum offload
- igc: add xdp frags support to ndo_xdp_xmit
- Ethernet high-speed switches:
- Marvell (prestera):
- support SPAN port features (traffic mirroring)
- nexthop object offloading
- Microchip (sparx5):
- multicast forwarding offload
- QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- support RGMII cmode
- NXP (felix):
- standardized ethtool counters
- Microchip (lan966x):
- QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
- traffic policing and mirroring
- link aggregation / bonding offload
- QUSGMII PHY mode support
- Qualcomm 802.11ax WiFi (ath11k):
- cold boot calibration support on WCN6750
- support to connect to a non-transmit MBSSID AP profile
- enable remain-on-channel support on WCN6750
- Wake-on-WLAN support for WCN6750
- support to provide transmit power from firmware via nl80211
- support to get power save duration for each client
- spectral scan support for 160 MHz
- MediaTek WiFi (mt76):
- WiFi-to-Ethernet bridging offload for MT7986 chips
- RealTek WiFi (rtw89):
- P2P support"
* tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits)
eth: pse: add missing static inlines
once: rename _SLOW to _SLEEPABLE
net: pse-pd: add regulator based PSE driver
dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller
ethtool: add interface to interact with Ethernet Power Equipment
net: mdiobus: search for PSE nodes by parsing PHY nodes.
net: mdiobus: fwnode_mdiobus_register_phy() rework error handling
net: add framework to support Ethernet PSE and PDs devices
dt-bindings: net: phy: add PoDL PSE property
net: marvell: prestera: Propagate nh state from hw to kernel
net: marvell: prestera: Add neighbour cache accounting
net: marvell: prestera: add stub handler neighbour events
net: marvell: prestera: Add heplers to interact with fib_notifier_info
net: marvell: prestera: Add length macros for prestera_ip_addr
net: marvell: prestera: add delayed wq and flush wq on deinit
net: marvell: prestera: Add strict cleanup of fib arbiter
net: marvell: prestera: Add cleanup of allocated fib_nodes
net: marvell: prestera: Add router nexthops ABI
eth: octeon: fix build after netif_napi_add() changes
net/mlx5: E-Switch, Return EBUSY if can't get mode lock
...
ceph_msg_data_next is always passed a NULL pointer for this field. Some
of the "next" operations look at it in order to determine the length,
but we can just take the min of the data on the page or cursor->resid.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This release is mostly bug fixes, clean-ups, and optimizations.
One notable set of fixes addresses a subtle buffer overflow issue
that occurs if a small RPC Call message arrives in an oversized
RPC record. This is only possible on a framed RPC transport such
as TCP.
Because NFSD shares the receive and send buffers in one set of
pages, an oversized RPC record steals pages from the send buffer
that will be used to construct the RPC Reply message. NFSD must
not assume that a full-sized buffer is always available to it;
otherwise, it will walk off the end of the send buffer while
constructing its reply.
In this release, we also introduce the ability for the server to
wait a moment for clients to return delegations before it responds
with NFS4ERR_DELAY. This saves a retransmit and a network round-
trip when a delegation recall is needed. This work will be built
upon in future releases.
The NFS server adds another shrinker to its collection. Because
courtesy clients can linger for quite some time, they might be
freeable when the server host comes under memory pressure. A new
shrinker has been added that releases courtesy client resources
during low memory scenarios.
Lastly, of note: the maximum number of operations per NFSv4
COMPOUND that NFSD can handle is increased from 16 to 50. There
are NFSv4 client implementations that need more than 16 to
successfully perform a mount operation that uses a pathname
with many components.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmM66P4ACgkQM2qzM29m
f5fhMg//afS2mp4fgPz4MjoFIqD/Icep8qFEPA8Gy6I1dDGRxd9wNgjoN4JALFdr
NKX1oRVISBvDrOG/C84GbYnXEDlzY8q1HmyPoJA8VAR57hnJXfPZN6CBN//Bx4mU
nISPJeNGY9SMNVhS8916V/yzd41uWDQuD+H+i5mluBTJHONgSzwzc80sQ+eq+yZQ
PV6mlJN6hcm14LCaDOTXF7oY2Wm6dQc2rV87YChJWnc+vdXKnme/LWTMY1ABkePD
g88mSL6w3YDKEuKciWda5/QU1ETp/Q7XTjFGDKEQSnnNsvCLmUKogJTKVa2QqyLY
P1qlrj6XwukqAe414W4amlLL3q4NUFmJZPNWDxdf+qtTrQrBBEFrsKy/bSt27XoD
cTvBWcorMG2riSlYPViVeh8RpyC6qwhttPbvGAmflVF2KEyXpfgc5Pnn0/Xm1Ac9
XKzaCTJlUyRb/2wdqVtQbIpyh3sbhzp8zhv7sWKXgQOEXxKOO3ZAIrQXeL6oFN/b
HlXDty7wKhFRj8IbkZfQ9SvN1saTONwB3clYHbCXTetkw/nnrUgLYcu8NIDBK9ou
wkBcz1++XgVTqjRFwUwagb62cPJnRM6UiROYCVbQp7qcUe4/U+WP9t6dnZlnGZVZ
dtipKlH/LTGKW+d7ysZOqb4hsRza5Kaduz5a7lML7UIGQXmxjM0=
=hE6t
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"This release is mostly bug fixes, clean-ups, and optimizations.
One notable set of fixes addresses a subtle buffer overflow issue that
occurs if a small RPC Call message arrives in an oversized RPC record.
This is only possible on a framed RPC transport such as TCP.
Because NFSD shares the receive and send buffers in one set of pages,
an oversized RPC record steals pages from the send buffer that will be
used to construct the RPC Reply message. NFSD must not assume that a
full-sized buffer is always available to it; otherwise, it will walk
off the end of the send buffer while constructing its reply.
In this release, we also introduce the ability for the server to wait
a moment for clients to return delegations before it responds with
NFS4ERR_DELAY. This saves a retransmit and a network round- trip when
a delegation recall is needed. This work will be built upon in future
releases.
The NFS server adds another shrinker to its collection. Because
courtesy clients can linger for quite some time, they might be
freeable when the server host comes under memory pressure. A new
shrinker has been added that releases courtesy client resources during
low memory scenarios.
Lastly, of note: the maximum number of operations per NFSv4 COMPOUND
that NFSD can handle is increased from 16 to 50. There are NFSv4
client implementations that need more than 16 to successfully perform
a mount operation that uses a pathname with many components"
* tag 'nfsd-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (53 commits)
nfsd: extra checks when freeing delegation stateids
nfsd: make nfsd4_run_cb a bool return function
nfsd: fix comments about spinlock handling with delegations
nfsd: only fill out return pointer on success in nfsd4_lookup_stateid
NFSD: fix use-after-free on source server when doing inter-server copy
NFSD: Cap rsize_bop result based on send buffer size
NFSD: Rename the fields in copy_stateid_t
nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_file_cache_stats_fops
nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_reply_cache_stats_fops
nfsd: use DEFINE_SHOW_ATTRIBUTE to define client_info_fops
nfsd: use DEFINE_SHOW_ATTRIBUTE to define export_features_fops and supported_enctypes_fops
nfsd: use DEFINE_PROC_SHOW_ATTRIBUTE to define nfsd_proc_ops
NFSD: Pack struct nfsd4_compoundres
NFSD: Remove unused nfsd4_compoundargs::cachetype field
NFSD: Remove "inline" directives on op_rsize_bop helpers
NFSD: Clean up nfs4svc_encode_compoundres()
SUNRPC: Fix typo in xdr_buf_subsegment's kdoc comment
NFSD: Clean up WRITE arg decoders
NFSD: Use xdr_inline_decode() to decode NFSv3 symlinks
NFSD: Refactor common code out of dirlist helpers
...
The _SLOW designation wasn't really descriptive of anything. This is
meant to be called from process context when it's possible to sleep. So
name this more aptly _SLEEPABLE, which better fits its intended use.
Fixes: 62c07983be ("once: add DO_ONCE_SLOW() for sleepable contexts")
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221003181413.1221968-1-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add interface to support Power Sourcing Equipment. At current step it
provides generic way to address all variants of PSE devices as defined
in IEEE 802.3-2018 but support only objects specified for IEEE 802.3-2018 104.4
PoDL Power Sourcing Equipment (PSE).
Currently supported and mandatory objects are:
IEEE 802.3-2018 30.15.1.1.3 aPoDLPSEPowerDetectionStatus
IEEE 802.3-2018 30.15.1.1.2 aPoDLPSEAdminState
IEEE 802.3-2018 30.15.1.2.1 acPoDLPSEAdminControl
This is minimal interface needed to control PSE on each separate
ethernet port but it provides not all mandatory objects specified in
IEEE 802.3-2018.
Since "PoDL PSE" and "PSE" have similar names, but some different values
I decide to not merge them and keep separate naming schema. This should
allow as to be as close to IEEE 802.3 spec as possible and avoid name
conflicts in the future.
This implementation is connected to PHYs instead of MACs because PSE
auto classification can potentially interfere with PHY auto negotiation.
So, may be some extra PHY related initialization will be needed.
With WIP version of ethtools interaction with PSE capable link looks
as following:
$ ip l
...
5: t1l1@eth0: <BROADCAST,MULTICAST> ..
...
$ ethtool --show-pse t1l1
PSE attributs for t1l1:
PoDL PSE Admin State: disabled
PoDL PSE Power Detection Status: disabled
$ ethtool --set-pse t1l1 podl-pse-admin-control enable
$ ethtool --show-pse t1l1
PSE attributs for t1l1:
PoDL PSE Admin State: enabled
PoDL PSE Power Detection Status: delivering power
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2022-10-03
We've added 10 non-merge commits during the last 23 day(s) which contain
a total of 14 files changed, 130 insertions(+), 69 deletions(-).
The main changes are:
1) Fix dynptr helper API to gate behind CAP_BPF given it was not intended
for unprivileged BPF programs, from Kumar Kartikeya Dwivedi.
2) Fix need_wakeup flag inheritance from umem buffer pool for shared xsk
sockets, from Jalal Mostafa.
3) Fix truncated last_member_type_id in btf_struct_resolve() which had a
wrong storage type, from Lorenz Bauer.
4) Fix xsk back-pressure mechanism on tx when amount of produced
descriptors to CQ is lower than what was grabbed from xsk tx ring,
from Maciej Fijalkowski.
5) Fix wrong cgroup attach flags being displayed to effective progs,
from Pu Lehui.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xsk: Inherit need_wakeup flag for shared sockets
bpf: Gate dynptr API behind CAP_BPF
selftests/bpf: Adapt cgroup effective query uapi change
bpftool: Fix wrong cgroup attach flags being assigned to effective progs
bpf, cgroup: Reject prog_attach_flags array when effective query
bpf: Ensure correct locking around vulnerable function find_vpid()
bpf: btf: fix truncated last_member_type_id in btf_struct_resolve
selftests/xsk: Add missing close() on netns fd
xsk: Fix backpressure mechanism on Tx
MAINTAINERS: Add include/linux/tnum.h to BPF CORE
====================
Link: https://lore.kernel.org/r/20221003201957.13149-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-10-03
We've added 143 non-merge commits during the last 27 day(s) which contain
a total of 151 files changed, 8321 insertions(+), 1402 deletions(-).
The main changes are:
1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu.
2) Add support for struct-based arguments for trampoline based BPF programs,
from Yonghong Song.
3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa.
4) Batch of improvements to veristat selftest tool in particular to add CSV output,
a comparison mode for CSV outputs and filtering, from Andrii Nakryiko.
5) Add preparatory changes needed for the BPF core for upcoming BPF HID support,
from Benjamin Tissoires.
6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program
types, from Daniel Xu.
7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler.
8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer /
single-kernel-consumer semantics for BPF ring buffer, from David Vernet.
9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF
hashtab's bucket lock, from Hou Tao.
10) Allow creating an iterator that loops through only the resources of one
task/thread instead of all, from Kui-Feng Lee.
11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi.
12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT
entry which is not yet inserted, from Lorenzo Bianconi.
13) Remove invalid recursion check for struct_ops for TCP congestion control BPF
programs, from Martin KaFai Lau.
14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu.
15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa.
16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen.
17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu.
18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta.
19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits)
net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c
Documentation: bpf: Add implementation notes documentations to table of contents
bpf, docs: Delete misformatted table.
selftests/xsk: Fix double free
bpftool: Fix error message of strerror
libbpf: Fix overrun in netlink attribute iteration
selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged"
samples/bpf: Fix typo in xdp_router_ipv4 sample
bpftool: Remove unused struct event_ring_info
bpftool: Remove unused struct btf_attach_point
bpf, docs: Add TOC and fix formatting.
bpf, docs: Add Clang note about BPF_ALU
bpf, docs: Move Clang notes to a separate file
bpf, docs: Linux byteswap note
bpf, docs: Move legacy packet instructions to a separate file
selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION)
bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function
bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
...
====================
Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.
Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use max_t() to simplify open code which uses "if...else" to get maximum of
two values.
Generated by coccinelle script:
scripts/coccinelle/misc/minmax.cocci
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use ida_alloc()/ida_free() instead of
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.
Signed-off-by: Bo Liu <liubo03@inspur.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Christophe Leroy reported a ~80ms latency spike
happening at first TCP connect() time.
This is because __inet_hash_connect() uses get_random_once()
to populate a perturbation table which became quite big
after commit 4c2c8f03a5 ("tcp: increase source port perturb table to 2^16")
get_random_once() uses DO_ONCE(), which block hard irqs for the duration
of the operation.
This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock
for operations where we prefer to stay in process context.
Then __inet_hash_connect() can use get_random_slow_once()
to populate its perturbation table.
Fixes: 4c2c8f03a5 ("tcp: increase source port perturb table to 2^16")
Fixes: 190cc82489 ("tcp: change source port randomizarion at connect() time")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#t
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot is hitting skb_assert_len() warning at raw_sendmsg() for ieee802154
socket. What commit dc633700f0 ("net/af_packet: check len when
min_header_len equals to 0") does also applies to ieee802154 socket.
Link: https://syzkaller.appspot.com/bug?extid=5ea725c25d06fb9114c4
Reported-by: syzbot <syzbot+5ea725c25d06fb9114c4@syzkaller.appspotmail.com>
Fixes: fd18942244 ("bpf: Don't redirect packets with invalid pkt_len")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current GRO stack only supports incoming packets containing
one frame/MSS.
This patch changes GRO to accept packets that are already GRO.
HW-GRO (aka RSC for some vendors) is very often limited in presence
of interleaved packets. Linux SW GRO stack can complete the job
and provide larger GRO packets, thus reducing rate of ACK packets
and cpu overhead.
This also means BIG TCP can still be used, even if HW-GRO/RSC was
able to cook ~64 KB GRO packets.
v2: fix logic in tcp_gro_receive()
Only support TCP for the moment (Paolo)
Co-Developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MPTCP data path is quite complex and hard to understend even
without some foggy comments referring to modified code and/or
completely misleading from the beginning.
Update a few of them to more accurately describing the current
status.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daire reported a user-space application hang-up when the
peer is forcibly closed before the data transfer completion.
The relevant application expects the peer to either
do an application-level clean shutdown or a transport-level
connection reset.
We can accommodate a such user by extending the fastclose
usage: at fd close time, if the msk socket has some unread
data, and at FIN_WAIT timeout.
Note that at MPTCP close time we must ensure that the TCP
subflows will reset: set the linger socket option to a suitable
value.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an mptcp socket is closed due to an incoming FASTCLOSE
option, so specific sk_err is set and later syscall will
fail usually with EPIPE.
Align the current fastclose error handling with TCP reset,
properly setting the socket error according to the current
msk state and propagating such error.
Additionally sendmsg() is currently not handling properly
the sk_err, always returning EPIPE.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported a sequence of memory leaks, and one of them indicated we
failed to free a whole sk:
unreferenced object 0xffff8880126e0000 (size 1088):
comm "syz-executor419", pid 326, jiffies 4294773607 (age 12.609s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 7d 00 00 00 00 00 00 00 ........}.......
01 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace:
[<000000006fefe750>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:1970
[<0000000074006db5>] sk_alloc+0x3b/0x800 net/core/sock.c:2029
[<00000000728cd434>] unix_create1+0xaf/0x920 net/unix/af_unix.c:928
[<00000000a279a139>] unix_create+0x113/0x1d0 net/unix/af_unix.c:997
[<0000000068259812>] __sock_create+0x2ab/0x550 net/socket.c:1516
[<00000000da1521e1>] sock_create net/socket.c:1566 [inline]
[<00000000da1521e1>] __sys_socketpair+0x1a8/0x550 net/socket.c:1698
[<000000007ab259e1>] __do_sys_socketpair net/socket.c:1751 [inline]
[<000000007ab259e1>] __se_sys_socketpair net/socket.c:1748 [inline]
[<000000007ab259e1>] __x64_sys_socketpair+0x97/0x100 net/socket.c:1748
[<000000007dedddc1>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<000000007dedddc1>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
[<000000009456679f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
We can reproduce this issue by creating two AF_UNIX SOCK_STREAM sockets,
send()ing an OOB skb to each other, and close()ing them without consuming
the OOB skbs.
int skpair[2];
socketpair(AF_UNIX, SOCK_STREAM, 0, skpair);
send(skpair[0], "x", 1, MSG_OOB);
send(skpair[1], "x", 1, MSG_OOB);
close(skpair[0]);
close(skpair[1]);
Currently, we free an OOB skb in unix_sock_destructor() which is called via
__sk_free(), but it's too late because the receiver's unix_sk(sk)->oob_skb
is accounted against the sender's sk->sk_wmem_alloc and __sk_free() is
called only when sk->sk_wmem_alloc is 0.
In the repro sequences, we do not consume the OOB skb, so both two sk's
sock_put() never reach __sk_free() due to the positive sk->sk_wmem_alloc.
Then, no one can consume the OOB skb nor call __sk_free(), and we finally
leak the two whole sk.
Thus, we must free the unconsumed OOB skb earlier when close()ing the
socket.
Fixes: 314001f0bf ("af_unix: Add OOB support")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>