Our CI reported lockdep splat in the fastopen code:
======================================================
WARNING: possible circular locking dependency detected
6.0.0.mptcp_f5e8bfe9878d+ #1558 Not tainted
------------------------------------------------------
packetdrill/1071 is trying to acquire lock:
ffff8881bd198140 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_wait_for_connect+0x19c/0x310
but task is already holding lock:
ffff8881b8346540 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0xfdf/0x1740
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (k-sk_lock-AF_INET){+.+.}-{0:0}:
__lock_acquire+0xb6d/0x1860
lock_acquire+0x1d8/0x620
lock_sock_nested+0x37/0xd0
inet_stream_connect+0x3f/0xa0
mptcp_connect+0x411/0x800
__inet_stream_connect+0x3ab/0x800
mptcp_stream_connect+0xac/0x110
__sys_connect+0x101/0x130
__x64_sys_connect+0x6e/0xb0
do_syscall_64+0x59/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (sk_lock-AF_INET){+.+.}-{0:0}:
check_prev_add+0x15e/0x2110
validate_chain+0xace/0xdf0
__lock_acquire+0xb6d/0x1860
lock_acquire+0x1d8/0x620
lock_sock_nested+0x37/0xd0
inet_wait_for_connect+0x19c/0x310
__inet_stream_connect+0x26c/0x800
tcp_sendmsg_fastopen+0x341/0x650
mptcp_sendmsg+0x109d/0x1740
sock_sendmsg+0xe1/0x120
__sys_sendto+0x1c7/0x2a0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x59/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(k-sk_lock-AF_INET);
lock(sk_lock-AF_INET);
lock(k-sk_lock-AF_INET);
lock(sk_lock-AF_INET);
*** DEADLOCK ***
1 lock held by packetdrill/1071:
#0: ffff8881b8346540 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0xfdf/0x1740
======================================================
The problem is caused by the blocking inet_wait_for_connect() releasing
and re-acquiring the msk socket lock while the subflow socket lock is
still held and the MPTCP socket requires that the msk socket lock must
be acquired before the subflow socket lock.
Address the issue always invoking tcp_sendmsg_fastopen() in an
unblocking manner, and later eventually complete the blocking
__inet_stream_connect() as needed.
Fixes: d98a82a6af ("mptcp: handle defer connect in mptcp_sendmsg")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current MPTCP connect implementation duplicates a bit of inet
code and does not use nor provide a struct proto->connect callback,
which in turn will not fit the upcoming fastopen implementation.
Refactor such implementation to use the common helper, moving the
MPTCP-specific bits into mptcp_connect(). Additionally, avoid an
indirect call to the subflow connect callback.
Note that the fastopen call-path invokes mptcp_connect() while already
holding the subflow socket lock. Explicitly keep track of such path
via a new MPTCP-level flag and handle the locking accordingly.
Additionally, track the connect flags in a new msk field to allow
propagating them to the subflow inet_stream_connect call.
Fixes: d98a82a6af ("mptcp: handle defer connect in mptcp_sendmsg")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The mptcp_pm_nl_get_local_id() code assumes that the msk local address
is available at that point. For passive sockets, we initialize such
address at accept() time.
Depending on the running configuration and the user-space timing, a
passive MPJ subflow can join the msk socket before accept() completes.
In such case, the PM assigns a wrong local id to the MPJ subflow
and later PM netlink operations will end-up touching the wrong/unexpected
subflow.
All the above causes sporadic self-tests failures, especially when
the host is heavy loaded.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/308
Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Fixes: d045b9eb95 ("mptcp: introduce implicit endpoints")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: sfp: improve high power module implementation
This series aims to improve the power level switching between standard
level 1 and the higher power levels.
The first patch updates the DT binding documentation to include the
minimum and default of 1W, which is the base level that every SFP cage
must support. Hence, it makes sense to document this in the binding.
The second patch enforces a minimum of 1W when parsing the firmware
description, and optimises the code for that case; there's no need to
check for SFF8472 compliance since we will not need to touch the
A2h registers.
Patch 3 validates that the module supports SFF-8472 rev 10.2 before
checking for power level 2 - rev 10.2 is where support for power
levels was introduced, so if the module doesn't support this revision,
it doesn't support power levels. Setting the power level 2 declaration
bit is likely to be spurious.
Patch 4 does the same for power level 3, except this was introduced in
SFF-8472 rev 11.9. The revision code was never updated, so we use the
rev 11.4 to signify this.
Patch 5 cleans up the code - rather than using BIT(0), we now use a
properly named value for the power level select bit.
Patch 6 introduces a read-modify-write helper.
Patch 7 gets rid of the DM7052 hack (which sets a power level
declaration bit but is not compatible with SFF-8472 rev 10.2, and
the module does not implement the A2h I2C address.)
Series tested with my DM7052.
v2: update sff.sfp.yaml with Rob's feedback
====================
Andrew's review tags from v1.
Link: https://lore.kernel.org/r/Y0%2F7dAB8OU3jrbz6@shell.armlinux.org.uk
Link: https://lore.kernel.org/r/Y1K17UtfFopACIi2@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since we no longer mis-detect high-power mode with the DM7052 module,
we no longer need the hack in sfp_module_enable_high_power(), and can
now switch this to use sfp_modify_u8().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a helper to modify bits in a single byte in memory space, and use
it when updating the soft tx-disable flag in the module.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide a named definition for the power level select bit in the
extended status register, rather than using BIT(0) in the code.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Power level 3 was included in SFF-8472 revision 11.9, but this does
not have a compliance code. Use revision 11.4 as the minimum
compliance level instead.
This should avoid any spurious indication of 2W modules.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Power level 2 was introduced by SFF-8472 revision 10.2. Ignore
the power declaration bit for modules that are not compliant with
at least this revision.
This should remove any spurious indication of 1.5W modules.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Check that the firmware provided maximum power is at least 1W, which
is the minimum power level for any SFP module.
Now that we enforce the minimum of 1W, we can exit early from
sfp_module_parse_power() if the module power is 1W or less.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a minimum and default for the maximum-power-milliwatt option;
module power levels were originally up to 1W, so this is the default
and the minimum power level we can have for a functional SFP cage.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a frame is sent using FDMA, the skb is mapped and then the mapped
address is given to an tx dcb that is different than the last used tx
dcb. Once the HW finish with this frame, it would generate an interrupt
and then the dcb can be reused and memory can be freed. For each dcb
there is an dcb buf that contains some meta-data(is used by PTP, is
it free). There is 1 to 1 relationship between dcb and dcb_buf.
The following issue was observed. That sometimes after changing the MTU
to allocate new tx dcbs and dcbs_buf, two frames were not
transmitted. The frames were not transmitted because when reloading the
tx dcbs, it was always presuming to use the first dcb but that was not
always happening. Because it could be that the last tx dcb used before
changing MTU was first dcb and then when it tried to get the next dcb it
would take dcb 1 instead of 0. Because it is supposed to take a
different dcb than the last used one. This can be fixed simply by
changing tx->last_in_use to -1 when the fdma is disabled to reload the
new dcb and dcbs_buff.
But there could be a different issue. For example, right after the frame
is sent, the MTU is changed. Now all the dcbs and dcbs_buf will be
cleared. And now get the interrupt from HW that it finished with the
frame. So when we try to clear the skb, it is not possible because we
lost all the dcbs_buf.
The solution here is to stop replacing the tx dcbs and dcbs_buf when
changing MTU because the TX doesn't care what is the MTU size, it is
only the RX that needs this information.
Fixes: 2ea1cbac26 ("net: lan966x: Update FDMA to change MTU.")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20221021090711.3749009-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Chan says:
====================
bnxt_en: Driver updates
This patchset adds .get_module_eeprom_by_page() support and adds
an NVRAM resize step to allow larger firmware images to be flashed
to older firmware.
====================
Link: https://lore.kernel.org/r/1666334243-23866-1-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Resize of the UPDATE entry is required if the image to
be flashed is larger than the available space. Add this step,
otherwise flashing larger firmware images by ethtool or devlink
may fail.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for .get_module_eeprom_by_page() callback which
implements generic solution for module`s eeprom access.
v3: Add bnxt_get_module_status() to get a more specific extack error
string.
Return -EINVAL from bnxt_get_module_eeprom_by_page() when we
don't want to fallback to old method.
v2: Simplification suggested by Ido Schimmel
Link: https://lore.kernel.org/netdev/YzVJ%2FvKJugoz15yV@shredder/
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The main changes are PTM timestamp support, CMIS EEPROM support, and
asymmetric CoS queues support.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To keep backward compatibility we used to leave attribute parsing
to the family if no policy is specified. This becomes tedious as
we move to more strict validation. Families must define reject all
policies if they don't want any attributes accepted.
Piggy back on the resv_start_op field as the switchover point.
AFAICT only ethtool has added new commands since the resv_start_op
was defined, and it has per-op policies so this should be a no-op.
Nonetheless the patch should still go into v6.1 for consistency.
Link: https://lore.kernel.org/all/20221019125745.3f2e7659@kernel.org/
Link: https://lore.kernel.org/r/20221021193532.1511293-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adding on the kernel command line "ftrace=function" triggered:
CPA detected W^X violation: 8000000000000063 -> 0000000000000063 range: 0xffffffffc0013000 - 0xffffffffc0013fff PFN 10031b
WARNING: CPU: 0 PID: 0 at arch/x86/mm/pat/set_memory.c:609
verify_rwx+0x61/0x6d
Call Trace:
__change_page_attr_set_clr+0x146/0x8a6
change_page_attr_set_clr+0x135/0x268
change_page_attr_clear.constprop.0+0x16/0x1c
set_memory_x+0x2c/0x32
arch_ftrace_update_trampoline+0x218/0x2db
ftrace_update_trampoline+0x16/0xa1
__register_ftrace_function+0x93/0xb2
ftrace_startup+0x21/0xf0
register_ftrace_function_nolock+0x26/0x40
register_ftrace_function+0x4e/0x143
function_trace_init+0x7d/0xc3
tracer_init+0x23/0x2c
tracing_set_tracer+0x1d5/0x206
register_tracer+0x1c0/0x1e4
init_function_trace+0x90/0x96
early_trace_init+0x25c/0x352
start_kernel+0x424/0x6e4
x86_64_start_reservations+0x24/0x2a
x86_64_start_kernel+0x8c/0x95
secondary_startup_64_no_verify+0xe0/0xeb
This is because at boot up, kernel text is writable, and there's no
reason to do tricks to updated it. But the verifier does not
distinguish updates at boot up and at run time, and causes a warning at
time of boot.
Add a check for system_state == SYSTEM_BOOTING and allow it if that is
the case.
[ These SYSTEM_BOOTING special cases are all pretty horrid, but the x86
text_poke() code does some odd things at bootup, forcing this for now
- Linus ]
Link: https://lore.kernel.org/r/20221024112730.180916b3@gandalf.local.home
Fixes: 652c5bf380 ("x86/mm: Refuse W^X violations")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current release - regressions:
- eth: fman: re-expose location of the MAC address to userspace,
apparently some udev scripts depended on the exact value
Current release - new code bugs:
- bpf:
- wait for busy refill_work when destroying bpf memory allocator
- allow bpf_user_ringbuf_drain() callbacks to return 1
- fix dispatcher patchable function entry to 5 bytes nop
Previous releases - regressions:
- net-memcg: avoid stalls when under memory pressure
- tcp: fix indefinite deferral of RTO with SACK reneging
- tipc: fix a null-ptr-deref in tipc_topsrv_accept
- eth: macb: specify PHY PM management done by MAC
- tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
Previous releases - always broken:
- eth: amd-xgbe: SFP fixes and compatibility improvements
Misc:
- docs: netdev: offer performance feedback to contributors
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmNW024ACgkQMUZtbf5S
IrvX7w//SP/zKZwgzC13zd2rrCP16TX2QvkHPmSLvcldQDXdCypmsoc5Vb8UNpkG
jwAuy2pxqPy2oxTwTBQv9TNRT2oqEFOsFTK+w410whlL7g1wZ02aXU8qFhV2XumW
o4gRtM+UISPUKFbOnawdK1XlrNdeLF3bjETvW2GP9zxCb0iqoQXtDDNKxv2B2iQA
MSyTtzHA4n9GS7LKGtPgsP2Ose7h1Z+AjTIpQH1nvfEHJUf/wmxUdCK+fuwfeLjY
PhmYaPG/333j1bfBk1Ms/nUYA5KRXlEj9A/7jDtxhxNEwaTNKyLB19a6oVxXxpSQ
x/k+nZP1RColn5xeco5a1X9aHHQ46PJQ8wVAmxYDIeIA5XPMgShNmhAyjrq1ac+o
9vYeYpmnMGSTLdBMvGbWpynWHe7SddgF8LkbnYf2HLKbxe4bgkOnmxOUH4q9iinZ
MfVSknjax4DP0C7X1kGgR6WyltWnkrahOdUkINsIUNxj0KxJa/eStpJIbJrfkxNV
gHbOjB2/bF3SXENrS4A0IJCgsbO9YugN83Eyu0WDWQOw9wVgopzxOJx9R+H0wkVH
XpGGP8qi1DZiTE3iQiq1LHj6f6kirFmtt9QFH5yzaqtKBaqXakHaXwUO4VtD+BI9
NPFKvFL6jrp8EAn0PTM/RrvhJZN+V0bFXiyiMe0TLx+aR0UMxGc=
=dD6N
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf.
The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
I'm biased.
Current release - regressions:
- eth: fman: re-expose location of the MAC address to userspace,
apparently some udev scripts depended on the exact value
Current release - new code bugs:
- bpf:
- wait for busy refill_work when destroying bpf memory allocator
- allow bpf_user_ringbuf_drain() callbacks to return 1
- fix dispatcher patchable function entry to 5 bytes nop
Previous releases - regressions:
- net-memcg: avoid stalls when under memory pressure
- tcp: fix indefinite deferral of RTO with SACK reneging
- tipc: fix a null-ptr-deref in tipc_topsrv_accept
- eth: macb: specify PHY PM management done by MAC
- tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
Previous releases - always broken:
- eth: amd-xgbe: SFP fixes and compatibility improvements
Misc:
- docs: netdev: offer performance feedback to contributors"
* tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
net-memcg: avoid stalls when under memory pressure
tcp: fix indefinite deferral of RTO with SACK reneging
tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
docs: netdev: offer performance feedback to contributors
kcm: annotate data-races around kcm->rx_wait
kcm: annotate data-races around kcm->rx_psock
net: fman: Use physical address for userspace interfaces
net/mlx5e: Cleanup MACsec uninitialization routine
atlantic: fix deadlock at aq_nic_stop
nfp: only clean `sp_indiff` when application firmware is unloaded
amd-xgbe: add the bit rate quirk for Molex cables
amd-xgbe: fix the SFP compliance codes check for DAC cables
amd-xgbe: enable PLL_CTL for fixed PHY modes only
amd-xgbe: use enums for mailbox cmd and sub_cmds
amd-xgbe: Yellow carp devices do not need rrc
bpf: Use __llist_del_all() whenever possbile during memory draining
bpf: Wait for busy refill_work when destroying bpf memory allocator
MAINTAINERS: add keyword match on PTP
...
This pull request contains a commit that fixes bf95b2bc3e ("rcu: Switch
polled grace-period APIs to ->gp_seq_polled"), which could incorrectly
leave interrupts enabled after an early-boot call to synchronize_rcu().
Such synchronize_rcu() calls must acquire leaf rcu_node locks in order to
properly interact with polled grace periods, but the code did not take
into account the possibility of synchronize_rcu() being invoked from
the portion of the boot sequence during which interrupts are disabled.
This commit therefore switches the lock acquisition and release from
irq to irqsave/irqrestore.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmNS54gTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jI5pD/0aURo25lH/gxCxt1A5gi/UI+/rlz+O
7rV51ZXW3cAPY1O7EP/57D8sXtgcALxYyRZ2JVh1WH+ScP+Se9YENs+UBp5zwy75
NIOkJ2LRxMCeX+T7Vw8jtsYd4bsgSQ1q2IXiGHCrRxtUwXewYMxLEVx4Bd89P45F
P49NJx66s5uXmKH2VV6UUcJT7yfyn8+USYxTaDmhhnEGE1frKBoYVqsqoaVfa0ON
r8O50/06bj6VdSRooGGw9vUcuoUlsrmnnXDd2aITkWqtBkeBOA6S3Nx+LUiRRXaw
m9e0s0yLulTlMsXVx/UAM+eBaXP3SDKK7wF1xXoyCqLdPKtZ4ABM0RQzXMsFBQQy
xs04Ba/0CBe1wVv9BijXuR1WgmcRUtoqxAFKXR6bIwh7uPtFDINRO2hfi9vEEVSQ
+4XbnoqsyHaul8tvBWV7O1Fo7+hNgFgJwAZq9I2xOJzh8wbVCc1+5E13K0Oktbq4
HERDAzeqA8Cr+6VxicrXt8/5xw7GsUJbYoXMpttB1WQBIps3/ayUxww1RyK9qegi
DnpUSAUdd0bF1ZfVtkh7V59uk+BBsL9MKNowTRrT1nVWO2VcAbf9IVvONHBFlIw4
d2IKn4IbpGsQR9ZJUfo/6y7GkZYBLU46lknduW+n7vuP+7j2B3KkFQw7z71e1c7u
LSEgmtPZS/c7SA==
=o6hI
-----END PGP SIGNATURE-----
Merge tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"Fix a regression caused by commit bf95b2bc3e ("rcu: Switch polled
grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
interrupts enabled after an early-boot call to synchronize_rcu().
Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
to properly interact with polled grace periods, but the code did not
take into account the possibility of synchronize_rcu() being invoked
from the portion of the boot sequence during which interrupts are
disabled.
This commit therefore switches the lock acquisition and release from
irq to irqsave/irqrestore"
* tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Keep synchronize_rcu() from enabling irqs in early boot
This KUnit fixes update for Linux 6.1-rc3 consists of one single fix
to update alloc_string_stream() callers to check for IS_ERR() instead
of NULL to be in sync with alloc_string_stream() returning IS_ERR().
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmNWwFwACgkQCwJExA0N
QxzH8Q/+PdNr/qmF3eNWiz5u/UC+jBQOiNURb+DenFkS9i6iMp61HBSUTDFakR//
NsPiOg4G8RUT7XI9kvhb22vC5b+RacHTwHrxh1TBd83j3WW2xsvCEOF228yiRM3m
8XM2pg+oLSkmcPmt9ym2hgiZBQ5rvzkyWmZKSZy+gzW9vylCA8BXDoVlTNbsiemh
JbxFyPaq3ApSkNcLBPuKYWpz1PungGg6RybJ2ZlP0384Z7JJtpdpom/a1fEXTABi
d4DCkvKntI6xyVkv62t3F2WVKnV78VJJX87RcwDV4qZ5W8V2MXXZwuPPrbeyC6tB
TLkh4M1EdfSfr6zxkVOiPz8noCmiF9TzJauwjPV/tyOnDT8dDSpcty2iAgjyIdFg
PSS1rNeFo6EkeouETkX+h8TI3VzzNTFBrWiAcqOlr1i5R2Pt20IqAQcUdhQVFg0C
Y3BjycG83f7k2uYQivUSLSv3DYbrDIov29Ej9P4sZ9MZSO/DbkkhNb87A7IQQCI5
zRKRh8KZWUzjAHJge5M9a5wZdhnECpwoJNiBlhKxa1/TmrIS3Vc0KthC1URrgS4E
IK/T3HGqlMlkV3Zi59R/bIBFCfz7Y6OJ0CjPt/Ey+xUpm8uUYwU4PgDfPbVWglzf
5KMEkEXmEWqZJviMbs+YF+FlqaIz692WTEYMQ/obrz2cvchPsUc=
=rHRJ
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit fixes from Shuah Khan:
"One single fix to update alloc_string_stream() callers to check for
IS_ERR() instead of NULL to be in sync with alloc_string_stream()
returning an ERR_PTR()"
* tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: update NULL vs IS_ERR() tests
- Fix typos in UART1 and MMC in the Ingenic driver.
- A really well researched glitch bug fix to the Qualcomm driver
that was tracked down and fixed by Dough Anderson from
Chromium. Hats off for this one!
- Revert two patches on the Xilinx ZynqMP driver: this needs a
proper solution making use of firmware version information to
adapt to different firmware releases.
- Fix interrupt triggers in the Ocelot driver.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAmNWPCQACgkQQRCzN7AZ
XXOcbRAAyyqqUTiHc3v6OiU12WyNxTay3Fypq4hl2UKsi6BN+NIzliavDdSKYJ+y
dR5h9YnXZ766OShGPoBgXAusY5Apx09IJkFePsc11IZcZwbsBRSmAwKkD9LHxb93
AIVHyn6zA1Ic2oD8ysgCC7kMAXUWibiSLgEqj3RSuwbD9lN2pYuUqZUK/Q7ydHpC
2yPRTlJig/9Ai79NlbFQXz8QUXKAR/niPPtaVtYEii+M87643kCHxKop52oWmF8V
KV9WTxFqtW7TgNqunrBn0JJjxU/k0Dlbecj4Y6gDszg9V+7sR0u+LpZ/KVFakOI0
P8FcmQquAOG+jApGoe8XMwQw0xYuSTJxKZ3sBHzkj5k3f//MzB47UUMh34wbwo/N
nt2lIzyV/Jlbhhj/1NjRMqECS00Ap+bo5BmDoyNFsPpQFqFfuyEmT27Mq/hJnmfm
i896lGkvJvnNAofBzw0/QiB65iAhNt6xhy2L0VqyNRAeFfQ0K9ltf3wipg0g3KH4
rq0W2e7Fepl/vE+2qSyViDXbPG5GgQG3ljv/DE+8DdvMWYLLc2zMz0FenYj5k2bv
XW9NtvEPRxVMCzTW5WEcDL21UQF7F5Vu8yojrjlv5XZ/QWafPjm/xpKhNBhNXuYf
hmR4hL2k4YCBRqWmhn9B3S+7jotuTRF4E5CUIPkXIOjUWXKJE1Y=
=w7Lm
-----END PGP SIGNATURE-----
Merge tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Fix typos in UART1 and MMC in the Ingenic driver
- A really well researched glitch bug fix to the Qualcomm driver that
was tracked down and fixed by Dough Anderson from Chromium. Hats off
for this one!
- Revert two patches on the Xilinx ZynqMP driver: this needs a proper
solution making use of firmware version information to adapt to
different firmware releases
- Fix interrupt triggers in the Ocelot driver
* tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: ocelot: Fix incorrect trigger of the interrupt.
Revert "dt-bindings: pinctrl-zynqmp: Add output-enable configuration"
Revert "pinctrl: pinctrl-zynqmp: Add support for output-enable and bias-high-impedance"
pinctrl: qcom: Avoid glitching lines when we first mux to output
pinctrl: Ingenic: JZ4755 bug fixes
As Shakeel explains the commit under Fixes had the unintended
side-effect of no longer pre-loading the cached memory allowance.
Even tho we previously dropped the first packet received when
over memory limit - the consecutive ones would get thru by using
the cache. The charging was happening in batches of 128kB, so
we'd let in 128kB (truesize) worth of packets per one drop.
After the change we no longer force charge, there will be no
cache filling side effects. This causes significant drops and
connection stalls for workloads which use a lot of page cache,
since we can't reclaim page cache under GFP_NOWAIT.
Some of the latency can be recovered by improving SACK reneg
handling but nowhere near enough to get back to the pre-5.15
performance (the application I'm experimenting with still
sees 5-10x worst latency).
Apply the suggested workaround of using GFP_ATOMIC. We will now
be more permissive than previously as we'll drop _no_ packets
in softirq when under pressure. But I can't think of any good
and simple way to address that within networking.
Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/
Suggested-by: Shakeel Butt <shakeelb@google.com>
Fixes: 4b1327be9f ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit fixes a bug that can cause a TCP data sender to repeatedly
defer RTOs when encountering SACK reneging.
The bug is that when we're in fast recovery in a scenario with SACK
reneging, every time we get an ACK we call tcp_check_sack_reneging()
and it can note the apparent SACK reneging and rearm the RTO timer for
srtt/2 into the future. In some SACK reneging scenarios that can
happen repeatedly until the receive window fills up, at which point
the sender can't send any more, the ACKs stop arriving, and the RTO
fires at srtt/2 after the last ACK. But that can take far too long
(O(10 secs)), since the connection is stuck in fast recovery with a
low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
available.
This fix changes the logic in tcp_check_sack_reneging() to only rearm
the RTO timer if data is cumulatively ACKed, indicating forward
progress. This avoids this kind of nearly infinite loop of RTO timer
re-arming. In addition, this meets the goals of
tcp_check_sack_reneging() in handling Windows TCP behavior that looks
temporarily like SACK reneging but is not really.
Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
and provided critical packet traces that enabled root-causing this
issue. Also, many thanks to Jakub Kicinski for testing this fix.
Fixes: 5ae344c949 ("tcp: reduce spurious retransmits due to transient SACK reneging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmNVkYkACgkQ6rmadz2v
bTqzHw/+NYMwfLm5Ck+BK0+HiYU5VVLoG4jp8G7B3sJL/6nUDduajzoqa+nM19Xl
+HEjbMza7CizmhkCRkzIs1VVtx8mtvKdTxbhvm77SU2+GBn+X1es+XhtFd4EOpok
MINNHs+cOC/HlnPD/QbFgvxKiKkjyjWxInjUp6Y/mLMcKCn7l9KOkc07/la9Dj3j
RI0gXCywq1pJaPuTCnt0/wcYLJvzn6QsZnKmmksQwt59GQqOd11HWid3rBWZhDp6
beEoHDIMGHROtu60vm4DB0p4l6tauZfeXyPCeu3Tx5ZSsypJIyU1iTdKiIUjG963
ilpy55nrX9bWxadB7LIKHyYfW3in4o+D1ZZaUvLIato/69CZJZ7Uc4kU1RF4Ay1F
Df1Fmal2WeNAxxETPmQPvVeCePvQvwLHl4KNogdZZvd/67cyc1cDhnuTJp37iPak
FALHaaw0VOrTdTvxsWym7yEbkhPbCHpPrKYFZFHgGrRTFk/GM2k38mM07lcLxFGw
aKyooS+eoIZMEgtK5Hma2wpeIVSlkJiJk1d0K20OxdnIUyYEsMXmI+uV1gMxq/8z
EHNi0+296xOoxy22I1Bd5Tu7fIeecHFN44q7YFmpGsB54UNLpFsP0vYUmYT/6hLI
Y0KVZu4c3oQDX7ttifMvkeOCURDJBPrZx37bpNpNXF55fB5ehNk=
=eV7W
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:
====================
pull-request: bpf 2022-10-23
We've added 7 non-merge commits during the last 18 day(s) which contain
a total of 8 files changed, 69 insertions(+), 5 deletions(-).
The main changes are:
1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.
2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.
3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.
4) Prevent decl_tag from being referenced in func_proto, from Stanislav.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Use __llist_del_all() whenever possbile during memory draining
bpf: Wait for busy refill_work when destroying bpf memory allocator
bpf: Fix dispatcher patchable function entry to 5 bytes nop
bpf: prevent decl_tag from being referenced in func_proto
selftests/bpf: Add reproducer for decl_tag in func_proto return type
selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
====================
Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Chromebooks don't have backlight in ACPI table, they suppose to use
native backlight in this case. Check presence of the CrOS embedded
controller ACPI device and prefer the native backlight if EC found.
Suggested-by: Hans de Goede <hdegoede@redhat.com>
Fixes: 2600bfa3df ("ACPI: video: Add acpi_video_backlight_use_native() helper")
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20221024141210.67784-1-dmitry.osipenko@collabora.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Vadim Fedorenko says:
====================
ptp: ocp: add support for Orolia ART-CARD
Orolia company created alternative open source TimeCard. The hardware of
the card provides similar to OCP's card functions, that's why the support
is added to current driver.
The first patch in the series changes the way to store information about
serial ports and is more like preparation.
The patches 2 to 4 introduces actual hardware support.
The last patch removes fallback from devlink flashing interface to protect
against flashing wrong image. This became actual now as we have 2 different
boards supported and wrong image can ruin hardware easily.
v2:
Address comments from Jonathan Lemon
v3:
Fix issue reported by kernel test robot <lkp@intel.com>
v4:
Fix clang build issue
v5:
Fix warnings and per-patch build errors
v6:
Fix more style issues
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously there was a fallback mode to flash firmware image without
proper header. But now we have different supported vendors and flashing
wrong image could destroy the hardware. Remove fallback mode and force
header check. Both vendors have published firmware images with headers.
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orolia card has disciplining configuration and temperature table
stored in EEPROM. This patch exposes them as binary attributes to
have read and write access.
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ART card provides interface to access to serial port of miniature atomic
clock found on the card. Add support for this device and configure it
during init phase.
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This brings in the Orolia timecard support from the GitHub repository.
The card uses different drivers to provide access to i2c EEPROM and
firmware SPI flash. And it also has a bit different EEPROM map, but
other parts of the code are the same and could be reused.
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce structure to hold serial port line number and the baud rate
it supports.
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and
in tcp_add_backlog(), the variable limit is caculated by adding
sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value
of int and overflow. This patch reduces the limit budget by
halving the sndbuf to solve this issue since ACK packets are much
smaller than the payload.
Fixes: c9c3321257 ("tcp: add tcp_add_backlog()")
Signed-off-by: Lu Wei <luwei32@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_pp_recycle() is only used by skb_free_head() in
skbuff.c, so move it to skbuff.c.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ndo_start_xmit() method must not free skb when returning
NETDEV_TX_BUSY, since caller is going to requeue freed skb.
Fixes: 504d4721ee ("MIPS: Lantiq: Add ethernet driver")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_stop_all_queues must be called before calling H_FREE_LOGICAL_LAN.
As a result, we can remove the pool_config field from the ibmveth
adapter structure.
Some device configuration changes call ibmveth_close in order to free
the current resources held by the device. These functions then make
their changes and call ibmveth_open to reallocate and reserve resources
for the device.
Prior to this commit, the flag pool_config was used to tell ibmveth_close
that it should not halt the transmit queue. pool_config was introduced in
commit 860f242eb5 ("[PATCH] ibmveth change buffer pools dynamically")
to avoid interrupting the tx flow when making rx config changes. Since
then, other commits adopted this approach, even if making tx config
changes.
The issue with this approach was that the hypervisor freed all of
the devices control structures after the hcall H_FREE_LOGICAL_LAN
was performed but the transmit queues were never stopped. So the higher
layers in the network stack would continue transmission but any
H_SEND_LOGICAL_LAN hcall would fail with H_PARAMETER until the
hypervisor's structures for the device were allocated with the
H_REGISTER_LOGICAL_LAN hcall in ibmveth_open. This resulted in
no real networking harm but did cause several of these error
messages to be logged: "h_send_logical_lan failed with rc=-4"
So, instead of trying to keep the transmit queues alive during network
configuration changes, just stop the queues, make necessary changes then
restart the queues.
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parameter 'msg' has never been used by __sock_cmsg_send, so we can remove it
safely.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the support for configuring periodic output
signal of PPS. So the PPS can be output at a specified time
and period.
For developers or testers, they can use the command "echo
<channel> <start.sec> <start.nsec> <period.sec> <period.
nsec> > /sys/class/ptp/ptp0/period" to specify time and
period to output PPS signal.
Notice that, the channel can only be set to 0. In addtion,
the start time must larger than the current PTP clock time.
So users can use the command "phc_ctl /dev/ptp0 -- get" to
get the current PTP clock time before.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the ops_init() interface is invoked to initialize the net, but
ops->init() fails, data is released. However, the ptr pointer in
net->gen is invalid. In this case, when nfqnl_nf_hook_drop() is invoked
to release the net, invalid address access occurs.
The process is as follows:
setup_net()
ops_init()
data = kzalloc(...) ---> alloc "data"
net_assign_generic() ---> assign "date" to ptr in net->gen
...
ops->init() ---> failed
...
kfree(data); ---> ptr in net->gen is invalid
...
ops_exit_list()
...
nfqnl_nf_hook_drop()
*q = nfnl_queue_pernet(net) ---> q is invalid
The following is the Call Trace information:
BUG: KASAN: use-after-free in nfqnl_nf_hook_drop+0x264/0x280
Read of size 8 at addr ffff88810396b240 by task ip/15855
Call Trace:
<TASK>
dump_stack_lvl+0x8e/0xd1
print_report+0x155/0x454
kasan_report+0xba/0x1f0
nfqnl_nf_hook_drop+0x264/0x280
nf_queue_nf_hook_drop+0x8b/0x1b0
__nf_unregister_net_hook+0x1ae/0x5a0
nf_unregister_net_hooks+0xde/0x130
ops_exit_list+0xb0/0x170
setup_net+0x7ac/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Allocated by task 15855:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0xa1/0xb0
__kmalloc+0x49/0xb0
ops_init+0xe7/0x410
setup_net+0x5aa/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Freed by task 15855:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x40
____kasan_slab_free+0x155/0x1b0
slab_free_freelist_hook+0x11b/0x220
__kmem_cache_free+0xa4/0x360
ops_init+0xb9/0x410
setup_net+0x5aa/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Fixes: f875bae065 ("net: Automatically allocate per namespace data.")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ffa84b5ffb ("net: add netns refcount tracker to struct sock")
added a tracker to sockets, but did not track kernel sockets.
We still have syzbot reports hinting about netns being destroyed
while some kernel TCP sockets had not been dismantled.
This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
call to net_free() right before the netns is freed.
Normally, each layer is responsible for properly releasing its
kernel sockets before last call to net_free().
This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of us gotten used to producing large quantities of peer feedback
at work, every 3 or 6 months. Extending the same courtesy to community
members seems like a logical step. It may be hard for some folks to
get validation of how important their work is internally, especially
at smaller companies which don't employ many kernel experts.
The concept of "peer feedback" may be a hyperscaler / silicon valley
thing so YMMV. Hopefully we can build more context as we go.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet says:
====================
kcm: annotate data-races
This series address two different syzbot reports for KCM.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
kcm->rx_psock can be read locklessly in kcm_rfree().
Annotate the read and writes accordingly.
We do the same for kcm->rx_wait in the following patch.
syzbot reported:
BUG: KCSAN: data-race in kcm_rfree / unreserve_rx_kcm
write to 0xffff888123d827b8 of 8 bytes by task 2758 on cpu 1:
unreserve_rx_kcm+0x72/0x1f0 net/kcm/kcmsock.c:313
kcm_rcv_strparser+0x2b5/0x3a0 net/kcm/kcmsock.c:373
__strp_recv+0x64c/0xd20 net/strparser/strparser.c:301
strp_recv+0x6d/0x80 net/strparser/strparser.c:335
tcp_read_sock+0x13e/0x5a0 net/ipv4/tcp.c:1703
strp_read_sock net/strparser/strparser.c:358 [inline]
do_strp_work net/strparser/strparser.c:406 [inline]
strp_work+0xe8/0x180 net/strparser/strparser.c:415
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
read to 0xffff888123d827b8 of 8 bytes by task 5859 on cpu 0:
kcm_rfree+0x14c/0x220 net/kcm/kcmsock.c:181
skb_release_head_state+0x8e/0x160 net/core/skbuff.c:841
skb_release_all net/core/skbuff.c:852 [inline]
__kfree_skb net/core/skbuff.c:868 [inline]
kfree_skb_reason+0x5c/0x260 net/core/skbuff.c:891
kfree_skb include/linux/skbuff.h:1216 [inline]
kcm_recvmsg+0x226/0x2b0 net/kcm/kcmsock.c:1161
____sys_recvmsg+0x16c/0x2e0
___sys_recvmsg net/socket.c:2743 [inline]
do_recvmmsg+0x2f1/0x710 net/socket.c:2837
__sys_recvmmsg net/socket.c:2916 [inline]
__do_sys_recvmmsg net/socket.c:2939 [inline]
__se_sys_recvmmsg net/socket.c:2932 [inline]
__x64_sys_recvmmsg+0xde/0x160 net/socket.c:2932
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffff88812971ce00 -> 0x0000000000000000
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 5859 Comm: syz-executor.3 Not tainted 6.0.0-syzkaller-12189-g19d17ab7c68b-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni says:
====================
udp: avoid false sharing on receive
Under high UDP load, the BH processing and the user-space receiver can
run on different cores.
The UDP implementation does a lot of effort to avoid false sharing in
the receive path, but recent changes to the struct sock layout moved
the sk_forward_alloc and the sk_rcvbuf fields on the same cacheline:
/* --- cacheline 4 boundary (256 bytes) --- */
struct sk_buff * tail;
} sk_backlog;
int sk_forward_alloc;
unsigned int sk_reserved_mem;
unsigned int sk_ll_usec;
unsigned int sk_napi_id;
int sk_rcvbuf;
sk_forward_alloc is updated by the BH, while sk_rcvbuf is accessed by
udp_recvmsg(), causing false sharing.
A possible solution would be to re-order the struct sock fields to avoid
the false sharing. Such change is subject to being invalidated by future
changes and could have negative side effects on other workload.
Instead this series uses a different approach, touching only the UDP
socket layout.
The first patch generalizes the custom setsockopt infrastructure, to
allow UDP tracking the buffer size, and the second patch addresses the
issue, copying the relevant buffer information into an already hot
cacheline.
Overall the above gives a 10% peek throughput increase under UDP flood.
v1 -> v2:
- introduce and use a common helper to initialize the UDP v4/v6 sockets
(Kuniyuki)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When the receiver process and the BH runs on different cores,
udp_rmem_release() experience a cache miss while accessing sk_rcvbuf,
as the latter shares the same cacheline with sk_forward_alloc, written
by the BH.
With this patch, UDP tracks the rcvbuf value and its update via custom
SOL_SOCKET socket options, and copies the forward memory threshold value
used by udp_rmem_release() in a different cacheline, already accessed by
the above function and uncontended.
Since the UDP socket init operation grown a bit, factor out the common
code between v4 and v6 in a shared helper.
Overall the above give a 10% peek throughput increase under UDP flood.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon introduce custom setsockopt for UDP sockets, too.
Instead of doing even more complex arbitrary checks inside
sock_use_custom_sol_socket(), add a new socket flag and set it
for the relevant socket types (currently only MPTCP).
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>