Our current route lookups (mctp_route_lookup and mctp_route_lookup_null)
traverse the net's route list without the RCU read lock held. This means
the route lookup is subject to preemption, resulting in an potential
grace period expiry, and so an eventual kfree() while we still have the
route pointer.
Add the proper read-side critical section locks around the route
lookups, preventing premption and a possible parallel kfree.
The remaining net->mctp.routes accesses are already under a
rcu_read_lock, or protected by the RTNL for updates.
Based on an analysis from Sili Luo <rootlab@huawei.com>, where
introducing a delay in the route lookup could cause a UAF on
simultaneous sendmsg() and route deletion.
Reported-by: Sili Luo <rootlab@huawei.com>
Fixes: 889b7da23a ("mctp: Add initial routing framework")
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/29c4b0e67dc1bf3571df3982de87df90cae9b631.1696837310.git.jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 1e66220948 ("net/mlx5e: Update rx ring hw mtu upon each rx-fcs
flag change") seems to have accidentally inverted the logic added in
commit 0bc73ad46a ("net/mlx5e: Mutually exclude RX-FCS and
RX-port-timestamp").
The impact of this is a little unclear since it seems the FCS scattered
with RX-FCS is (usually?) correct regardless.
Fixes: 1e66220948 ("net/mlx5e: Update rx ring hw mtu upon each rx-fcs flag change")
Tested-by: Charlotte Tan <charlotte@extrahop.com>
Reviewed-by: Charlotte Tan <charlotte@extrahop.com>
Cc: Adham Faris <afaris@nvidia.com>
Cc: Aya Levin <ayal@nvidia.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: Moshe Shemesh <moshe@nvidia.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Will Mortensen <will@extrahop.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20231006053706.514618-1-will@extrahop.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When the SMC protocol is built into the kernel proper while ISM is
configured to be built as module, linking the kernel fails due to
unresolved dependencies out of net/smc/smc_ism.o to
ism_get_smcd_ops, ism_register_client, and ism_unregister_client
as reported via the linux-next test automation (see link).
This however is a bug introduced a while ago.
Correct the dependency list in ISM's and SMC's Kconfig to reflect the
dependencies that are actually inverted. With this you cannot build a
kernel with CONFIG_SMC=y and CONFIG_ISM=m. Either ISM needs to be 'y',
too - or a 'n'. That way, SMC can still be configured on non-s390
architectures that do not have (nor need) an ISM driver.
Fixes: 89e7d2ba61 ("net/ism: Add new API for client registration")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/linux-next/d53b5b50-d894-4df8-8969-fd39e63440ae@infradead.org/
Co-developed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20231006125847.1517840-1-gbayer@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The adapter->vf_mvs.l list needs to be initialized even if the list is
empty. Otherwise it will lead to crashes.
Fixes: a1cbb15c13 ("ixgbe: Add macvlan support for VF")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/ZSADNdIw8zFx1xw2@kadam
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When updating the SA, use the new update_pn flags instead of comparing the
new PN with the initial one.
Comparing the initial PN value with the new value will allow the user
to update the SA using the initial PN value as a parameter like this:
$ ip macsec add macsec0 tx sa 0 pn 1 on key 00 \
ead3664f508eb06c40ac7104cdae4ce5
$ ip macsec set macsec0 tx sa 0 pn 1 off
Fixes: 8ff0ac5be1 ("net/mlx5: Add MACsec offload Tx command support")
Fixes: aae3454e4d ("net/mlx5e: Add MACsec offload Rx command support")
Signed-off-by: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Updating the PN is not supported.
Return -EINVAL if update_pn is true.
The following command succeeded, but it should fail because the driver
does not update the PN:
ip macsec set macsec0 tx sa 0 pn 232 on
Fixes: 28c5107aa9 ("net: phy: mscc: macsec support")
Signed-off-by: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When updating SA, update the PN only when the update_pn flag is true.
Otherwise, the PN will be reset to its previous value using the
following command and this should not happen:
$ ip macsec set macsec0 tx sa 0 on
Fixes: c54ffc7360 ("octeontx2-pf: mcs: Introduce MACSEC hardware offloading")
Signed-off-by: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Indicate next PN update using update_pn flag in macsec_context.
Offloaded MACsec implementations does not know whether or not the
MACSEC_SA_ATTR_PN attribute was passed for an SA update and assume
that next PN should always updated, but this is not always true.
The PN can be reset to its initial value using the following command:
$ ip macsec set macsec0 tx sa 0 off #octeontx2-pf case
Or, the update PN command will succeed even if the driver does not support
PN updates.
$ ip macsec set macsec0 tx sa 0 pn 1 on #mscc phy driver case
Comparing the initial PN with the new PN value is not a solution. When
the user updates the PN using its initial value the command will
succeed, even if the driver does not support it. Like this:
$ ip macsec add macsec0 tx sa 0 pn 1 on key 00 \
ead3664f508eb06c40ac7104cdae4ce5
$ ip macsec set macsec0 tx sa 0 pn 1 on #mlx5 case
Signed-off-by: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
syzbot uses panic_on_warn.
This means that the skb_dump() I added in the blamed commit are
not even called.
Rewrite this so that we get the needed skb dump before syzbot crashes.
Fixes: eeee4b77dc ("net: add more debug info in skb_checksum_help()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231006173355.2254983-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When one of the LAG interfaces is in switchdev mode, setting default rule
can't be done.
The interface on which switchdev is running has ice_set_rx_mode() blocked
to avoid default rule adding (and other rules). The other interfaces
(without switchdev running but connected via bond with interface that
runs switchdev) can't follow the same scheme, because rx filtering needs
to be disabled when failover happens. Notification for bridge to set
promisc mode seems like good place to do that.
Fixes: bb52f42ace ("ice: Add driver support for firmware changes for LAG")
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not set netback interfaces (vifs) default TX queue size to the ring size.
The TX queue size is not related to the ring size, and using the ring size (32)
as the queue size can lead to packet drops. Note the TX side of the vif
interface in the netback domain is the one receiving packets to be injected
to the guest.
Do not explicitly set the TX queue length to any value when creating the
interface, and instead use the system default. Note that the queue length can
also be adjusted at runtime.
Fixes: f942dc2552 ('xen network backend driver')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mlxsw_sp2_nve_vxlan_learning_set() function is supposed to return
zero on success or negative error codes. So it needs to be type int
instead of bool.
Fixes: 4ee70efab6 ("mlxsw: spectrum_nve: Add support for VXLAN on Spectrum-2")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ravb_stop() should call cancel_work_sync(). Otherwise,
ravb_tx_timeout_work() is possible to use the freed priv after
ravb_remove() was called like below:
CPU0 CPU1
ravb_tx_timeout()
ravb_remove()
unregister_netdev()
free_netdev(ndev)
// free priv
ravb_tx_timeout_work()
// use priv
unregister_netdev() will call .ndo_stop() so that ravb_stop() is
called. And, after phy_stop() is called, netif_carrier_off()
is also called. So that .ndo_tx_timeout() will not be called
after phy_stop().
Fixes: c156633f13 ("Renesas Ethernet AVB driver proper")
Reported-by: Zheng Wang <zyytlz.wz@163.com>
Closes: https://lore.kernel.org/netdev/20230725030026.1664873-1-zyytlz.wz@163.com/
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20231005011201.14368-3-yoshihiro.shimoda.uh@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In ravb_remove(), dma_free_coherent() should be call after
unregister_netdev(). Otherwise, this controller is possible to use
the freed buffer.
Fixes: c156633f13 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20231005011201.14368-2-yoshihiro.shimoda.uh@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Devlink health dump get callback should take devlink lock as any other
devlink callback. Otherwise, since devlink_mutex was removed, this
callback is not protected from a race of the reporter being destroyed
while handling the callback.
Add devlink lock to the callback and to any call for
devlink_health_do_dump(). This should be safe as non of the drivers dump
callback implementation takes devlink lock.
As devlink lock is added to any callback of dump, the reporter dump_lock
is now redundant and can be removed.
Fixes: d3efc2a6a6 ("net: devlink: remove devlink_mutex")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/1696510216-189379-1-git-send-email-moshe@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
commit d61491a51f ("net/sched: cls_u32: Replace one-element array
with flexible-array member") incorrecly replaced an instance of
`sizeof(*tp_c)` with `struct_size(tp_c, hlist->ht, 1)`. This results
in a an over-allocation of 8 bytes.
This change is wrong because `hlist` in `struct tc_u_common` is a
pointer:
net/sched/cls_u32.c:
struct tc_u_common {
struct tc_u_hnode __rcu *hlist;
void *ptr;
int refcnt;
struct idr handle_idr;
struct hlist_node hnode;
long knodes;
};
So, the use of `struct_size()` makes no sense: we don't need to allocate
any extra space for a flexible-array member. `sizeof(*tp_c)` is just fine.
So, `struct_size(tp_c, hlist->ht, 1)` translates to:
sizeof(*tp_c) + sizeof(tp_c->hlist->ht) ==
sizeof(struct tc_u_common) + sizeof(struct tc_u_knode *) ==
144 + 8 == 0x98 (byes)
^^^
|
unnecessary extra
allocation size
$ pahole -C tc_u_common net/sched/cls_u32.o
struct tc_u_common {
struct tc_u_hnode * hlist; /* 0 8 */
void * ptr; /* 8 8 */
int refcnt; /* 16 4 */
/* XXX 4 bytes hole, try to pack */
struct idr handle_idr; /* 24 96 */
/* --- cacheline 1 boundary (64 bytes) was 56 bytes ago --- */
struct hlist_node hnode; /* 120 16 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
long int knodes; /* 136 8 */
/* size: 144, cachelines: 3, members: 6 */
/* sum members: 140, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};
And with `sizeof(*tp_c)`, we have:
sizeof(*tp_c) == sizeof(struct tc_u_common) == 144 == 0x90 (bytes)
which is the correct and original allocation size.
Fix this issue by replacing `struct_size(tp_c, hlist->ht, 1)` with
`sizeof(*tp_c)`, and avoid allocating 8 too many bytes.
The following difference in binary output is expected and reflects the
desired change:
| net/sched/cls_u32.o
| @@ -6148,7 +6148,7 @@
| include/linux/slab.h:599
| 2cf5: mov 0x0(%rip),%rdi # 2cfc <u32_init+0xfc>
| 2cf8: R_X86_64_PC32 kmalloc_caches+0xc
|- 2cfc: mov $0x98,%edx
|+ 2cfc: mov $0x90,%edx
Reported-by: Alejandro Colomar <alx@kernel.org>
Closes: https://lore.kernel.org/lkml/09b4a2ce-da74-3a19-6961-67883f634d98@kernel.org/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marek Behún says:
====================
net: dsa: qca8k: fix qca8k driver for Turris 1.x
this is v2 of
https://lore.kernel.org/netdev/20231002104612.21898-1-kabel@kernel.org/
Changes since v1:
- fixed a typo in commit message noticed by Simon Horman
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Besides the QCA8337 switch the Turris 1.x device has on it's MDIO bus
also Micron ethernet PHY (dedicated to the WAN port).
We've been experiencing a strange behavior of the WAN ethernet
interface, wherein the WAN PHY started timing out the MDIO accesses, for
example when the interface was brought down and then back up.
Bisecting led to commit 2cd5485663 ("net: dsa: qca8k: add support for
phy read/write with mgmt Ethernet"), which added support to access the
QCA8337 switch's internal PHYs via management ethernet frames.
Connecting the MDIO bus pins onto an oscilloscope, I was able to see
that the MDIO bus was active whenever a request to read/write an
internal PHY register was done via an management ethernet frame.
My theory is that when the switch core always communicates with the
internal PHYs via the MDIO bus, even when externally we request the
access via ethernet. This MDIO bus is the same one via which the switch
and internal PHYs are accessible to the board, and the board may have
other devices connected on this bus. An ASCII illustration may give more
insight:
+---------+
+----| |
| | WAN PHY |
| +--| |
| | +---------+
| |
| | +----------------------------------+
| | | QCA8337 |
MDC | | | +-------+ |
------o-+--|--------o------------o--| | |
MDIO | | | | | PHY 1 |-|--to RJ45
--------o--|---o----+---------o--+--| | |
| | | | | +-------+ |
| +-------------+ | o--| | |
| | MDIO MDC | | | | PHY 2 |-|--to RJ45
eth1 | | | o--+--| | |
-----------|-|port0 | | | +-------+ |
| | | | o--| | |
| | switch core | | | | PHY 3 |-|--to RJ45
| +-------------+ o--+--| | |
| | | +-------+ |
| | o--| ... | |
+----------------------------------+
When we send a request to read an internal PHY register via an ethernet
management frame via eth1, the switch core receives the ethernet frame
on port 0 and then communicates with the internal PHY via MDIO. At this
time, other potential devices, such as the WAN PHY on Turris 1.x, cannot
use the MDIO bus, since it may cause a bus conflict.
Fix this issue by locking the MDIO bus even when we are accessing the
PHY registers via ethernet management frames.
Fixes: 2cd5485663 ("net: dsa: qca8k: add support for phy read/write with mgmt Ethernet")
Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c766e077d9 ("net: dsa: qca8k: convert to regmap read/write
API") introduced bulk read/write methods to qca8k's regmap.
The regmap bulk read/write methods get the register address in a buffer
passed as a void pointer parameter (the same buffer contains also the
read/written values). The register address occupies only as many bytes
as it requires at the beginning of this buffer. For example if the
.reg_bits member in regmap_config is 16 (as is the case for this
driver), the register address occupies only the first 2 bytes in this
buffer, so it can be cast to u16.
But the original commit implementing these bulk read/write methods cast
the buffer to u32:
u32 reg = *(u32 *)reg_buf & U16_MAX;
taking the first 4 bytes. This works on little endian systems where the
first 2 bytes of the buffer correspond to the low 16-bits, but it
obviously cannot work on big endian systems.
Fix this by casting the beginning of the buffer to u16 as
u32 reg = *(u16 *)reg_buf;
Fixes: c766e077d9 ("net: dsa: qca8k: convert to regmap read/write API")
Signed-off-by: Marek Behún <kabel@kernel.org>
Tested-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean says:
====================
Fixes for lynx-28g PHY driver
This series fixes some issues in the Lynx 28G SerDes driver, namely an
oops when unloading the module, a race between the periodic workqueue
and the PHY API, and a race between phy_set_mode_ext() calls on multiple
lanes on the same SerDes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The protocol converter configuration registers PCC8, PCCC, PCCD
(implemented by the driver), as well as others, control protocol
converters from multiple lanes (each represented as a different
struct phy). So, if there are simultaneous calls to phy_set_mode_ext()
to lanes sharing the same PCC register (either for the "old" or for the
"new" protocol), corruption of the values programmed to hardware is
possible, because lynx_28g_rmw() has no locking.
Add a spinlock in the struct lynx_28g_priv shared by all lanes, and take
the global spinlock from the phy_ops :: set_mode() implementation. There
are no other callers which modify PCC registers.
Fixes: 8f73b37cf3 ("phy: add support for the Layerscape SerDes 28G")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
lynx_28g_cdr_lock_check() runs once per second in a workqueue to reset
the lane receiver if the CDR has not locked onto bit transitions in the
RX stream. But the PHY consumer may do stuff with the PHY simultaneously,
and that isn't okay. Block concurrent generic PHY calls by holding the
PHY mutex from this workqueue.
Fixes: 8f73b37cf3 ("phy: add support for the Layerscape SerDes 28G")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The blamed commit added the CDR check work item but didn't cancel it on
the remove path. Fix this by adding a remove function which takes care
of it.
Fixes: 8f73b37cf3 ("phy: add support for the Layerscape SerDes 28G")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I didn't collect precise data but feels like we've got a lot
of 6.5 fixes here. WiFi fixes are most user-awaited.
Current release - regressions:
- Bluetooth: fix hci_link_tx_to RCU lock usage
Current release - new code bugs:
- bpf: mprog: fix maximum program check on mprog attachment
- eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
Previous releases - regressions:
- ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
- vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(),
it doesn't handle zero length like we expected
- wifi:
- cfg80211: fix cqm_config access race, fix crashes with brcmfmac
- iwlwifi: mvm: handle PS changes in vif_cfg_changed
- mac80211: fix mesh id corruption on 32 bit systems
- mt76: mt76x02: fix MT76x0 external LNA gain handling
- Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
- dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
- eth: stmmac: fix the incorrect parameter after refactoring
Previous releases - always broken:
- net: replace calls to sock->ops->connect() with kernel_connect(),
prevent address rewrite in kernel_bind(); otherwise BPF hooks may
modify arguments, unexpectedly to the caller
- tcp: fix delayed ACKs when reads and writes align with MSS
- bpf:
- verifier: unconditionally reset backtrack_state masks on global
func exit
- s390: let arch_prepare_bpf_trampoline return program size,
fix struct_ops offsets
- sockmap: fix accounting of available bytes in presence of PEEKs
- sockmap: reject sk_msg egress redirects to non-TCP sockets
- ipv4/fib: send netlink notify when delete source address routes
- ethtool: plca: fix width of reads when parsing netlink commands
- netfilter: nft_payload: rebuild vlan header on h_proto access
- Bluetooth: hci_codec: fix leaking memory of local_codecs
- eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
- eth: stmmac:
- dwmac-stm32: fix resume on STM32 MCU
- remove buggy and unneeded stmmac_poll_controller, depend on NAPI
- ibmveth: always recompute TCP pseudo-header checksum, fix use
of the driver with Open vSwitch
- wifi:
- rtw88: rtw8723d: fix MAC address offset in EEPROM
- mt76: fix lock dependency problem for wed_lock
- mwifiex: sanity check data reported by the device
- iwlwifi: ensure ack flag is properly cleared
- iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
- iwlwifi: mvm: fix incorrect usage of scan API
Misc:
- wifi: mac80211: work around Cisco AP 9115 VHT MPDU length
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmUe/hkACgkQMUZtbf5S
Iru94hAAlkVsHgGy7T8Te23m5Q5v33ZRj+7hbjFBN+be2TJu8cWo9qL6jGvrxBP6
D0X32ZvoWtX95Ua023Ibs1WtEc6ebROctTuDZpoIW35MYamFz9fYecoY4i+t+m1+
6Y/UVRu4eHWXUtZclRQUEUdMe5HAJms9uNlIkTxLivvFhERgmGKtAsca8nuU9wMo
lJJUAFxf4wJKx/338AUaa0yfsg6cQcgRpbm1csAR9VSa3mU1PbrIYzf7fMbWjRET
6DiijheeClb7biF4ZKqqHgZcProwVFxoFCq1GH+PCzaw8K1eIkGwXlpVEW89lgsC
Ukc1L9JgLJIBSObvNWiO2mu5w+V89+XR2rL9KM3tW6x+k5tEncNL3WOphqH69NPS
gfizGlvXjP+2aCJgHovvlQEaBn/7xNYWTAbqh1jVDleUC95Ur8ap4X42YB/3QvPN
X9l8hp4Htu8SclqjXKtMz9qt6Ug5Uyi88o+1U53BNE6C6ICKW9i4uApT1DsLBAK8
QM5WPTj/ChIBbQu7HWNW+Ux3NX5R6fFzZ5BfKrjbuNEHQKRauj2300gVtU6xGS7T
IFWiu8i00T34aXF2Vnfykc0zNRylhw/DHqtJFUxmJQOBQgyKlkjYacf2cYru5lnR
BWA8Zsg5wpapT5CWSGlSRid4sRMtcDiMsI7fnIqB5CPhJGnR6eI=
=JAtc
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, netfilter, BPF and WiFi.
I didn't collect precise data but feels like we've got a lot of 6.5
fixes here. WiFi fixes are most user-awaited.
Current release - regressions:
- Bluetooth: fix hci_link_tx_to RCU lock usage
Current release - new code bugs:
- bpf: mprog: fix maximum program check on mprog attachment
- eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
Previous releases - regressions:
- ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
- vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
doesn't handle zero length like we expected
- wifi:
- cfg80211: fix cqm_config access race, fix crashes with brcmfmac
- iwlwifi: mvm: handle PS changes in vif_cfg_changed
- mac80211: fix mesh id corruption on 32 bit systems
- mt76: mt76x02: fix MT76x0 external LNA gain handling
- Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
- dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
- eth: stmmac: fix the incorrect parameter after refactoring
Previous releases - always broken:
- net: replace calls to sock->ops->connect() with kernel_connect(),
prevent address rewrite in kernel_bind(); otherwise BPF hooks may
modify arguments, unexpectedly to the caller
- tcp: fix delayed ACKs when reads and writes align with MSS
- bpf:
- verifier: unconditionally reset backtrack_state masks on global
func exit
- s390: let arch_prepare_bpf_trampoline return program size, fix
struct_ops offsets
- sockmap: fix accounting of available bytes in presence of PEEKs
- sockmap: reject sk_msg egress redirects to non-TCP sockets
- ipv4/fib: send netlink notify when delete source address routes
- ethtool: plca: fix width of reads when parsing netlink commands
- netfilter: nft_payload: rebuild vlan header on h_proto access
- Bluetooth: hci_codec: fix leaking memory of local_codecs
- eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
- eth: stmmac:
- dwmac-stm32: fix resume on STM32 MCU
- remove buggy and unneeded stmmac_poll_controller, depend on NAPI
- ibmveth: always recompute TCP pseudo-header checksum, fix use of
the driver with Open vSwitch
- wifi:
- rtw88: rtw8723d: fix MAC address offset in EEPROM
- mt76: fix lock dependency problem for wed_lock
- mwifiex: sanity check data reported by the device
- iwlwifi: ensure ack flag is properly cleared
- iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
- iwlwifi: mvm: fix incorrect usage of scan API
Misc:
- wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"
* tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
MAINTAINERS: update Matthieu's email address
mptcp: userspace pm allow creating id 0 subflow
mptcp: fix delegated action races
net: stmmac: remove unneeded stmmac_poll_controller
net: lan743x: also select PHYLIB
net: ethernet: mediatek: disable irq before schedule napi
net: mana: Fix oversized sge0 for GSO packets
net: mana: Fix the tso_bytes calculation
net: mana: Fix TX CQE error handling
netlink: annotate data-races around sk->sk_err
sctp: update hb timer immediately after users change hb_interval
sctp: update transport state when processing a dupcook packet
tcp: fix delayed ACKs for MSS boundary condition
tcp: fix quick-ack counting to count actual ACKs of new data
page_pool: fix documentation typos
tipc: fix a potential deadlock on &tx->lock
net: stmmac: dwmac-stm32: fix resume on STM32 MCU
ipv4: Set offload_failed flag in fibmatch results
netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
netfilter: nf_tables: Deduplicate nft_register_obj audit logs
...
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCZR7WExQcem9oYXJAbGlu
dXguaWJtLmNvbQAKCRDLwZzRsCrn5dkMAQCH4MaQ8m9wJwoMnXMUQ0JU5hiwHNeG
eA8lyW2cFhFthAD+JH2phkC5Ka5shBIOusjHRNml/8d/gVhmuXUQKJADSwk=
=ggtG
-----END PGP SIGNATURE-----
Merge tag 'integrity-v6.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity fixes from Mimi Zohar:
"Two additional patches to fix the removal of the deprecated
IMA_TRUSTED_KEYRING Kconfig"
* tag 'integrity-v6.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: rework CONFIG_IMA dependency block
ima: Finish deprecation of IMA_TRUSTED_KEYRING Kconfig
- Potential build failure in CS42L43
- Device Tree bindings clean-up for a superseded patch
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEdrbJNaO+IJqU8IdIUa+KL4f8d2EFAmUe1V0ACgkQUa+KL4f8
d2FhOA/9G0L0+sCd5kYKInfNcZklY/R9XKNoX5g5DQA8vyVllWIycbPL6VlPli0d
IRyw7uQLdWba4mlNWXql5pU7wGWgBqr268VwFhCGK6/vbJLpMFYEWGmOq306x6wz
poSrKmAKJeFswb0GSEyLeDtt/3c0k9bYV+ZUxavNTwoYmLaApP0NKQmNuH0mptA6
pZ825PCEWbNS++FI5lv+dPfOzWDlh691n8ea41HRD/svGDdNCS+EpTLcX9L+81P+
EIx+GWtW7CUGNYP2jJhMU7WW4/+3Uxdyrojmx+t9vkiD6t6sZkclDBxMzf7x+vLt
NtcWpUdeHR0phAFTUxR74w7vIgfiHb2HWNdpmzNB7ZWuGsClQDIVU8KL2JtvDZ6X
hcByEKnz7QwOXaZo5qwfWmmc5rVQkZG5CKrVN23/DVHlseyFShmllI4yw7fjfAGn
JWAF5IwrfZFgVPAeTmmZwmuNFx4pZiPj4ajBLmEvF5OxuHsaNjE1W7WOZnITwRdn
22EL9Q/DRsdzDci4eW7AHDwqwMJhUW7HIVMWEOMknuS/aEdrtJhNB2X8TDC9/zBi
p50K+mxGs/RB/uvHDDCce/fVgDpTDhuovAJRncigGZmDvWgXHMR7+UIgNW9opHwl
i8833UANGOEfBi68quVcAr7EI3CnTsMtr0b/JTibmTcrV0vZb10=
=g4wJ
-----END PGP SIGNATURE-----
Merge tag 'mfd-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD fixes from Lee Jones:
"A couple of small fixes:
- Potential build failure in CS42L43
- Device Tree bindings clean-up for a superseded patch"
* tag 'mfd-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
dt-bindings: mfd: Revert "dt-bindings: mfd: maxim,max77693: Add USB connector"
mfd: cs42l43: Fix MFD_CS42L43 dependency on REGMAP_IRQ
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmUeSFQACgkQEVvVyTe/
1WrKdxAAoMtgK5/GcNO3BijxtU1M7onu70Jm4froOFR0R2hdDFG76vfHrPHpavha
gnSOAOpDZYNkkJTKNmN2b047dRvrtgYiZ1rrSxlLjXY+2ojVoyAFXddVPNpT4nSy
2xka8KqLeU8Mtzt0GQq/PVe0JLgjbEEejn4wFdoilSBPZhf4sik24ReQVyxhYG3B
aILG83GZAuqokzO3HkhYaMo+qgm0BVLcHwZ0zd5UQ404srB7nmE4JxS4+sh0mrsc
LQmZdlXnzEj46nJ9pluWvUH2bGAA1+rtSg182mbcrwnXOlct40x0TI8MzzFcnmaj
mue37V+iLHLRr6BeIzNghbTflF0SeUH8Qbn7mjYkv4+VTfomYU0OzV3Wy54yUqA9
VrugqHAVpgv6r4gNPgkvNpp4YKTeFUd14FI4sO+Ny/f0CHCA+TeEP5g3IH5QguxF
vf23Uww9QyNgRp/JaV0kQaJfRqbMtUylMOiTo7Kaou+XV3A6AfnHhyyeS2qVmYI8
bsXjG4QzYrlxmvcFuVBXrLuiigVEI8DcTHSwXZ5NqpqIb/dCmXexZ48loL73f0Qj
Us8U1w7JymQDscug74F4S47CE3JoUH4hsu25StQ1LTvfIc+leQy1GBU3tVJv/d7U
cgD7OkktwjuQ5ClS+8gdP9fQ9AzI9/vGEhQL9+peKW6FK5EfWPc=
=r3VL
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs fixes from Amir Goldstein:
- Fix for file reference leak regression
- Fix for NULL pointer deref regression
- Fixes for RCU-walk race regressions:
Two of the fixes were taken from Al's RCU pathwalk race fixes series
with his consent [1].
Note that unlike most of Al's series, these two patches are not about
racing with ->kill_sb() and they are also very recent regressions
from v6.5, so I think it's worth getting them into v6.5.y.
There is also a fix for an RCU pathwalk race with ->kill_sb(), which
may have been solved in vfs generic code as you suggested, but it
also rids overlayfs from a nasty hack, so I think it's worth anyway.
Link: https://lore.kernel.org/linux-fsdevel/20231003204749.GA800259@ZenIV/ [1]
* tag 'ovl-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: fix NULL pointer defer when encoding non-decodable lower fid
ovl: make use of ->layers safe in rcu pathwalk
ovl: fetch inode once in ovl_dentry_revalidate_common()
ovl: move freeing ovl_entry past rcu delay
ovl: fix file reference leak when submitting aio
Mat Martineau says:
====================
mptcp: Fixes and maintainer email update for v6.6
Patch 1 addresses a race condition in MPTCP "delegated actions"
infrastructure. Affects v5.19 and later.
Patch 2 removes an unnecessary restriction that did not allow additional
outgoing subflows using the local address of the initial MPTCP subflow.
v5.16 and later.
Patch 3 updates Matthieu's email address.
====================
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-0-28de4ac663ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use my kernel.org account instead.
The other one will bounce by the end of the year.
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-3-28de4ac663ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch drops id 0 limitation in mptcp_nl_cmd_sf_create() to allow
creating additional subflows with the local addr ID 0.
There is no reason not to allow additional subflows from this local
address: we should be able to create new subflows from the initial
endpoint. This limitation was breaking fullmesh support from userspace.
Fixes: 702c2f646d ("mptcp: netlink: allow userspace-driven subflow establishment")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/391
Cc: stable@vger.kernel.org
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-2-28de4ac663ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The delegated action infrastructure is prone to the following
race: different CPUs can try to schedule different delegated
actions on the same subflow at the same time.
Each of them will check different bits via mptcp_subflow_delegate(),
and will try to schedule the action on the related per-cpu napi
instance.
Depending on the timing, both can observe an empty delegated list
node, causing the same entry to be added simultaneously on two different
lists.
The root cause is that the delegated actions infra does not provide
a single synchronization point. Address the issue reserving an additional
bit to mark the subflow as scheduled for delegation. Acquiring such bit
guarantee the caller to own the delegated list node, and being able to
safely schedule the subflow.
Clear such bit only when the subflow scheduling is completed, ensuring
proper barrier in place.
Additionally swap the meaning of the delegated_action bitmask, to allow
the usage of the existing helper to set multiple bit at once.
Fixes: bcd9773431 ("mptcp: use delegate action to schedule 3rd ack retrans")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-1-28de4ac663ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using netconsole netpoll_poll_dev could be called from interrupt
context, thus using disable_irq() would cause the following kernel
warning with CONFIG_DEBUG_ATOMIC_SLEEP enabled:
BUG: sleeping function called from invalid context at kernel/irq/manage.c:137
in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 10, name: ksoftirqd/0
CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G W 5.15.42-00075-g816b502b2298-dirty #117
Hardware name: aml (r1) (DT)
Call trace:
dump_backtrace+0x0/0x270
show_stack+0x14/0x20
dump_stack_lvl+0x8c/0xac
dump_stack+0x18/0x30
___might_sleep+0x150/0x194
__might_sleep+0x64/0xbc
synchronize_irq+0x8c/0x150
disable_irq+0x2c/0x40
stmmac_poll_controller+0x140/0x1a0
netpoll_poll_dev+0x6c/0x220
netpoll_send_skb+0x308/0x390
netpoll_send_udp+0x418/0x760
write_msg+0x118/0x140 [netconsole]
console_unlock+0x404/0x500
vprintk_emit+0x118/0x250
dev_vprintk_emit+0x19c/0x1cc
dev_printk_emit+0x90/0xa8
__dev_printk+0x78/0x9c
_dev_warn+0xa4/0xbc
ath10k_warn+0xe8/0xf0 [ath10k_core]
ath10k_htt_txrx_compl_task+0x790/0x7fc [ath10k_core]
ath10k_pci_napi_poll+0x98/0x1f4 [ath10k_pci]
__napi_poll+0x58/0x1f4
net_rx_action+0x504/0x590
_stext+0x1b8/0x418
run_ksoftirqd+0x74/0xa4
smpboot_thread_fn+0x210/0x3c0
kthread+0x1fc/0x210
ret_from_fork+0x10/0x20
Since [0] .ndo_poll_controller is only needed if driver doesn't or
partially use NAPI. Because stmmac does so, stmmac_poll_controller
can be removed fixing the above warning.
[0] commit ac3d9dd034 ("netpoll: make ndo_poll_controller() optional")
Cc: <stable@vger.kernel.org> # 5.15.x
Fixes: 47dd7a540b ("net: add support for STMicroelectronics Ethernet controllers")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/1c156a6d8c9170bd6a17825f2277115525b4d50f.1696429960.git.repk@triplefau.lt
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While searching for possible refactor of napi_schedule_prep and
__napi_schedule it was notice that the mtk eth driver disable the
interrupt for rx and tx AFTER napi is scheduled.
While this is a very hard to repro case it might happen to have
situation where the interrupt is disabled and never enabled again as the
napi completes and the interrupt is enabled before.
This is caused by the fact that a napi driven by interrupt expect a
logic with:
1. interrupt received. napi prepared -> interrupt disabled -> napi
scheduled
2. napi triggered. ring cleared -> interrupt enabled -> wait for new
interrupt
To prevent this case, disable the interrupt BEFORE the napi is
scheduled.
Fixes: 656e705243 ("net-next: mediatek: add support for MT7623 ethernet")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://lore.kernel.org/r/20231002140805.568-1-ansuelsmth@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Handle the case when GSO SKB linear length is too large.
MANA NIC requires GSO packets to put only the header part to SGE0,
otherwise the TX queue may stop at the HW level.
So, use 2 SGEs for the skb linear part which contains more than the
packet header.
Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sizeof(struct hop_jumbo_hdr) is not part of tso_bytes, so remove
the subtraction from header size.
Cc: stable@vger.kernel.org
Fixes: bd7fc6e195 ("net: mana: Add new MANA VF performance counters for easier troubleshooting")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
For an unknown TX CQE error type (probably from a newer hardware),
still free the SKB, update the queue tail, etc., otherwise the
accounting will be wrong.
Also, TX errors can be triggered by injecting corrupted packets, so
replace the WARN_ONCE to ratelimited error logging.
Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- Timerlat is reporting thread interference time without thread noise
events occurrence. It was caused because the thread interference variable
was not reset after the analysis of a timerlat activation that did not
hit the threshold.
- The IRQ handler delay is estimated from the delta of the IRQ latency
reported by timerlat, and the timestamp from IRQ handler start event.
If the delta is near-zero, the drift from the external clock and the
trace event and/or the overhead can cause the value to be negative.
If the value is negative, print a zero-delay.
- IRQ handlers happening after the timerlat thread event but before
the stop tracing were being reported as IRQ that happened before the
*current* IRQ occurrence. Ignore Previous IRQ noise in this condition
because they are valid only for the *next* timerlat activation.
Timerlat user-space:
- Timerlat is stopping all user-space thread if a CPU becomes
offline. Do not stop the entire tool if a CPU is/become offline,
but only the thread of the unavailable CPU. Stop the tool only,
if all threads leave because the CPUs become/are offline.
man-pages:
- Fix command line example in timerlat hist man page.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEElZdCZGILCpueJPrSY3Tw0sBuFwAFAmURVMQTHGJyaXN0b3RA
a2VybmVsLm9yZwAKCRBjdPDSwG4XAJezD/0fJnrzJFVSUwAXbdu1K679ik5iqwTk
UE/ZHY3dBbES6DFswXomofe4LkimY1tnLvyPr5tHqCGW8cvnMkOpgDK68LEgyL5a
1FLR8D+07i2dsEcsXfcAAF8iVEeF/SzOfHwZuY1ZJyicwl3xtya/QDrXpq8LZR1n
4YEWE3Xx60bo/Q81hTXN3uS+275bfuV/N8DSOXwVVWhK5kxheitc1ESUGLV/g1HQ
muyv+k+fH1qnOfkPsokhnxMjgzy7Tqv13onoVY+KUSQ1Ui58p+c3zQSkceWxM8c4
wnbfR0spF1eCoBlO2/PYUZ2p2zEh/NS3eTQchys4J2lbgURW1IIVaxaK1S5xC2CE
tkYkBOaUJXlD3HzTCkPRNpOI0+8Ydo0MDzzPUqjHemfFE7zzHVoZTfmdInSyddUz
ViKLi0HS+kjyvZVGa02JuDgPJmjTPgwd1F8p6cujHmSCbifbs4Oml9VaYHQRioZX
bkIDAX6NMkqDpb0baGjsIzbmiWnsIeo8J1IDqdXnD3VY1J78D+kBNCISxGjXuTSF
Eg3iyZJHWy2JhGBQ2k4lyCw9FZZ1FZtkURPWvTn5/PbsPqz5bjPWUcwXsyqE6wBL
OPR3HUcjgaMv7gJrErbsAaAGXxwpgTOe0qMcWI2tR7n6SHzniOn9WlDjegVnwWp1
r4ognHxasRQUAQ==
=1BAc
-----END PGP SIGNATURE-----
Merge tag 'rtla-v6.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux
Pull rtla fixes from Daniel Bristot de Oliveira:
"rtla (Real-Time Linux Analysis) tool fixes.
Timerlat auto-analysis:
- Timerlat is reporting thread interference time without thread noise
events occurrence. It was caused because the thread interference
variable was not reset after the analysis of a timerlat activation
that did not hit the threshold.
- The IRQ handler delay is estimated from the delta of the IRQ
latency reported by timerlat, and the timestamp from IRQ handler
start event. If the delta is near-zero, the drift from the external
clock and the trace event and/or the overhead can cause the value
to be negative. If the value is negative, print a zero-delay.
- IRQ handlers happening after the timerlat thread event but before
the stop tracing were being reported as IRQ that happened before
the *current* IRQ occurrence. Ignore Previous IRQ noise in this
condition because they are valid only for the *next* timerlat
activation.
Timerlat user-space:
- Timerlat is stopping all user-space thread if a CPU becomes
offline. Do not stop the entire tool if a CPU is/become offline,
but only the thread of the unavailable CPU. Stop the tool only, if
all threads leave because the CPUs become/are offline.
man-pages:
- Fix command line example in timerlat hist man page"
* tag 'rtla-v6.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux:
rtla: fix a example in rtla-timerlat-hist.rst
rtla/timerlat: Do not stop user-space if a cpu is offline
rtla/timerlat_aa: Fix previous IRQ delay for IRQs that happens after thread sample
rtla/timerlat_aa: Fix negative IRQ delay
rtla/timerlat_aa: Zero thread sum after every sample analysis
Currently, when hb_interval is changed by users, it won't take effect
until the next expiry of hb timer. As the default value is 30s, users
have to wait up to 30s to wait its hb_interval update to work.
This becomes pretty bad in containers where a much smaller value is
usually set on hb_interval. This patch improves it by resetting the
hb timer immediately once the value of hb_interval is updated by users.
Note that we don't address the already existing 'problem' when sending
a heartbeat 'on demand' if one hb has just been sent(from the timer)
mentioned in:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.html
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During the 4-way handshake, the transport's state is set to ACTIVE in
sctp_process_init() when processing INIT_ACK chunk on client or
COOKIE_ECHO chunk on server.
In the collision scenario below:
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021]
when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state,
sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it
creates a new association and sets its transport to ACTIVE then updates
to the old association in sctp_assoc_update().
However, in sctp_assoc_update(), it will skip the transport update if it
finds a transport with the same ipaddr already existing in the old asoc,
and this causes the old asoc's transport state not to move to ACTIVE
after the handshake.
This means if DATA retransmission happens at this moment, it won't be able
to enter PF state because of the check 'transport->state == SCTP_ACTIVE'
in sctp_do_8_2_transport_strike().
This patch fixes it by updating the transport in sctp_assoc_update() with
sctp_assoc_add_peer() where it updates the transport state if there is
already a transport with the same ipaddr exists in the old asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.
The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:
(1) If an app reads data when > 1*MSS is unacknowledged, then
tcp_cleanup_rbuf() ACKs immediately because of:
tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
(2) If an app reads all received data, and the packets were < 1*MSS,
and either (a) the app is not ping-pong or (b) we received two
packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
of:
((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
!inet_csk_in_pingpong_mode(sk))) &&
(3) *However*: if an app reads exactly 1*MSS of data,
tcp_cleanup_rbuf() does not send an immediate ACK. This is true
even if the app is not ping-pong and the 1*MSS of data had the PSH
bit set, suggesting the sending application completed an
application write.
Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.
The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Guo <guoxin0309@gmail.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.
The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.
When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.
And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.
The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.
Fixes: fc6415bcb0 ("[TCP]: Fix quick-ack decrementing with TSO.")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmUdcDQNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AOf7EACipnPx/532GUk1pECg+iWGTfhFOu1jdHjAILzy
+Ft/kfTLvd8kfZg6DuKIb6KYfj7w7uQ/xcD6wfqV8HBcss0SOyilx2ZUgH8ThwDv
tSIsUsx1M1gOGkXK713GrD6h/PR5BBv3vVFymvr+MliYH4C2mmsGOGWk5D+s8IqU
q3XDMMMlsZpfqCA8QGKK7TkFhnvnHdeoHGhZhw9ywXik733Qa4OUbJ5tkxztDKrr
DKF/FhpYxWPKHURtPXaQpWuni7xbMjg+3lHYlWTRZkQRQOoPWidBuTumqJxvwT3Z
FYwlS7T7OBMiFByy4spBnBs0uGiA6rR3sZ2/Gn98o9HpYlCllxpZm53Ay0u8sZTL
RBhMkacOXTWN5n1fbIqHIZc6vs7Tm1crvT2V/CseAuhe9TDiD5cHkaz7hJUQif6h
dmF48QHCHuSgWGtyPmVbTDSZ02YF++R398zHuBM2TXkFz8B9vI5DRpbXw0yX4ktg
LZSKnBALOPN5Ye27+W+itNfNaMC3+Elto3Cv9IvpTaXWl8WpF8hnNagLObEXxJ1Q
3dLRKpSHDKJe7BLQoqm9ESFUE80bZr+S1Xleukz0z7AamCrM/rxQGKBwbTJs2NoE
1YezWzhw68+aQ7BY8eWigDAQKmtn1Oju3v5u5IekGKQVvXd5x97VGlJQRVxQvr2Z
jDHNFw==
=YLDi
-----END PGP SIGNATURE-----
Merge tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter patches for net
First patch resolves a regression with vlan header matching, this was
broken since 6.5 release. From myself.
Second patch fixes an ancient problem with sctp connection tracking in
case INIT_ACK packets are delayed. This comes with a selftest, both
patches from Xin Long.
Patch 4 extends the existing nftables audit selftest, from
Phil Sutter.
Patch 5, also from Phil, avoids a situation where nftables
would emit an audit record twice. This was broken since 5.13 days.
Patch 6, from myself, avoids spurious insertion failure if we encounter an
overlapping but expired range during element insertion with the
'nft_set_rbtree' backend. This problem exists since 6.2.
* tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
netfilter: nf_tables: Deduplicate nft_register_obj audit logs
selftests: netfilter: Extend nft_audit.sh
selftests: netfilter: test for sctp collision processing in nf_conntrack
netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp
netfilter: nft_payload: rebuild vlan header on h_proto access
====================
Link: https://lore.kernel.org/r/20231004141405.28749-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>