Use existing helpers and drop reason codes for RAW input path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use existing helpers and drop reason codes for RAW input path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev selftest callbacks and related code from leftover.c to
file dev.c. No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As all users of the struct devlink_info_req are already in dev.c, move
this struct from devl_internal.c to be local in dev.c.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev flash callbacks, helpers and other related code from
leftover.c to dev.c. No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev info callbacks, related drivers helpers functions and
other related code from leftover.c to dev.c. No functional change in
this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev eswitch callbacks and related code from leftover.c to
file dev.c. No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev reload callback and related code from leftover.c to
file dev.c. No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev get and dump callbacks and related dev code to new file
dev.c. This file shall include all callbacks that are specific on
devlink dev object.
No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that commit 028fb19c6b ("netlink: provide an ability to set
default extack message") provides a weak function that doesn't override
an existing extack message provided by the driver, it makes sense to use
it also for LAG and HSR offloading, not just for bridge offloading.
Also consistently put the message string on a separate line, to reduce
line length from 92 to 84 characters.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230202140354.3158129-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both synchronous early drop algorithm and asynchronous gc worker completely
ignore connections with IPS_OFFLOAD_BIT status bit set. With new
functionality that enabled UDP NEW connection offload in action CT
malicious user can flood the conntrack table with offloaded UDP connections
by just sending a single packet per 5tuple because such connections can no
longer be deleted by early drop algorithm.
To mitigate the issue allow both early drop and gc to consider offloaded
UDP connections for deletion.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify the offload algorithm of UDP connections to the following:
- Offload NEW connection as unidirectional.
- When connection state changes to ESTABLISHED also update the hardware
flow. However, in order to prevent act_ct from spamming offload add wq for
every packet coming in reply direction in this state verify whether
connection has already been updated to ESTABLISHED in the drivers. If that
it the case, then skip flow_table and let conntrack handle such packets
which will also allow conntrack to potentially promote the connection to
ASSURED.
- When connection state changes to ASSURED set the flow_table flow
NF_FLOW_HW_BIDIRECTIONAL flag which will cause refresh mechanism to offload
the reply direction.
All other protocols have their offload algorithm preserved and are always
offloaded as bidirectional.
Note that this change tries to minimize the load on flow_table add
workqueue. First, it tracks the last ctinfo that was offloaded by using new
flow 'NF_FLOW_HW_ESTABLISHED' flag and doesn't schedule the refresh for
reply direction packets when the offloads have already been updated with
current ctinfo. Second, when 'add' task executes on workqueue it always
update the offload with current flow state (by checking 'bidirectional'
flow flag and obtaining actual ctinfo/cookie through meta action instead of
caching any of these from the moment of scheduling the 'add' work)
preventing the need from scheduling more updates if state changed
concurrently while the 'add' work was pending on workqueue.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tcf_ct_flow_table_fill_actions() function assumes that only
established connections can be offloaded and always sets ctinfo to either
IP_CT_ESTABLISHED or IP_CT_ESTABLISHED_REPLY strictly based on direction
without checking actual connection state. To enable UDP NEW connection
offload set the ctinfo, metadata cookie and NF_FLOW_HW_ESTABLISHED
flow_offload flags bit based on ct->status value.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify flow table offload to cache the last ct info status that was passed
to the driver offload callbacks by extending enum nf_flow_flags with new
"NF_FLOW_HW_ESTABLISHED" flag. Set the flag if ctinfo was 'established'
during last act_ct meta actions fill call. This infrastructure change is
necessary to optimize promoting of UDP connections from 'new' to
'established' in following patches in this series.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify flow table offload to support unidirectional connections by
extending enum nf_flow_flags with new "NF_FLOW_HW_BIDIRECTIONAL" flag. Only
offload reply direction when the flag is set. This infrastructure change is
necessary to support offloading UDP NEW connections in original direction
in following patches in series.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently flow_offload_fixup_ct() function assumes that only replied UDP
connections can be offloaded and hardcodes UDP_CT_REPLIED timeout value. To
enable UDP NEW connection offload in following patches extract the actual
connections state from ct->status and set the timeout according to it.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A summary of the flags being set for various drivers is given below.
Note that XDP_F_REDIRECT_TARGET and XDP_F_FRAG_TARGET are features
that can be turned off and on at runtime. This means that these flags
may be set and unset under RTNL lock protection by the driver. Hence,
READ_ONCE must be used by code loading the flag value.
Also, these flags are not used for synchronization against the availability
of XDP resources on a device. It is merely a hint, and hence the read
may race with the actual teardown of XDP resources on the device. This
may change in the future, e.g. operations taking a reference on the XDP
resources of the driver, and in turn inhibiting turning off this flag.
However, for now, it can only be used as a hint to check whether device
supports becoming a redirection target.
Turn 'hw-offload' feature flag on for:
- netronome (nfp)
- netdevsim.
Turn 'native' and 'zerocopy' features flags on for:
- intel (i40e, ice, ixgbe, igc)
- mellanox (mlx5).
- stmmac
- netronome (nfp)
Turn 'native' features flags on for:
- amazon (ena)
- broadcom (bnxt)
- freescale (dpaa, dpaa2, enetc)
- funeth
- intel (igb)
- marvell (mvneta, mvpp2, octeontx2)
- mellanox (mlx4)
- mtk_eth_soc
- qlogic (qede)
- sfc
- socionext (netsec)
- ti (cpsw)
- tap
- tsnep
- veth
- xen
- virtio_net.
Turn 'basic' (tx, pass, aborted and drop) features flags on for:
- netronome (nfp)
- cavium (thunder)
- hyperv.
Turn 'redirect_target' feature flag on for:
- amanzon (ena)
- broadcom (bnxt)
- freescale (dpaa, dpaa2)
- intel (i40e, ice, igb, ixgbe)
- ti (cpsw)
- marvell (mvneta, mvpp2)
- sfc
- socionext (netsec)
- qlogic (qede)
- mellanox (mlx5)
- tap
- veth
- virtio_net
- xen
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Marek Majtyka <alardam@gmail.com>
Link: https://lore.kernel.org/r/3eca9fafb308462f7edb1f58e451d59209aa07eb.1675245258.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a Netlink spec-compatible family for netdevs.
This is a very simple implementation without much
thought going into it.
It allows us to reap all the benefits of Netlink specs,
one can use the generic client to issue the commands:
$ ./cli.py --spec netdev.yaml --dump dev_get
[{'ifindex': 1, 'xdp-features': set()},
{'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}},
{'ifindex': 3, 'xdp-features': {'rx-sg'}}]
the generic python library does not have flags-by-name
support, yet, but we also don't have to carry strings
in the messages, as user space can get the names from
the spec.
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Co-developed-by: Marek Majtyka <alardam@gmail.com>
Signed-off-by: Marek Majtyka <alardam@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmPbg4YTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRC+UBxKKooE6AuHB/46pXdhKmIwJlyioOUM7L3TBOHN5WDR
PqOo8l1VoiU2mjL934+QwrKd45WKyuzoMudaNzzzoB2hF+PTCzvoQS13n/iSEbxk
DhEHrkVeBi7fuQObgcCyYXXDq/igiGawlPRtuIej9MUEtfzjO61BvMgo6HVaxNSi
Q8uZxjDcUnkr3qBRQPPPgcdJmUbYzLUMoWg6lAlab5alHjFAuoQpYuiunEE+xFDU
PN9gofkrdXDcy0aoFPPdLs1AIF7MlmGAFeh9DEkpBF4gHQVGulOW8yEcQFBLekNN
hOz3chd2eTdDZvpIiIIHQursA25nWK0X9PqAarUENMz0jhG6WSJIyEDe
=qIgG
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-6.2-20230202' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
can 2023-02-02
The first patch is by Ziyang Xuan and removes a errant WARN_ON_ONCE()
in the CAN J1939 protocol.
The next 3 patches are by Oliver Hartkopp. The first 2 target the CAN
ISO-TP protocol and fix the state machine with respect to signals and
a regression found by the syzbot.
The last patch is by me an missing assignment during the ethtool ring
configuration callback.
* tag 'linux-can-fixes-for-6.2-20230202' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: mcp251xfd: mcp251xfd_ring_set_ringparam(): assign missing tx_obj_num_coalesce_irq
can: isotp: split tx timer into transmission and timeout
can: isotp: handle wait_event_interruptible() return values
can: raw: fix CAN FD frame transmissions over CAN XL devices
can: j1939: fix errant WARN_ON_ONCE in j1939_session_deactivate
====================
Link: https://lore.kernel.org/r/20230202094135.2293939-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To send CAN traffic back to the incoming interface a special flag has to
be set. When creating a routing job for identical interfaces without this
flag the rule is created but has no effect.
This patch adds an error return value in the case that the CAN interfaces
are identical but the CGW_FLAGS_CAN_IIF_TX_OK flag was not set.
Reported-by: Jannik Hartung <jannik.hartung@tu-bs.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20230125055407.2053-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.
Signed-off-by: Bo Liu <liubo03@inspur.com>
Link: https://lore.kernel.org/r/20230201081438.3151-1-liubo03@inspur.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Remove the check for a negative number of keys as
this cannot ever happen
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The software pedit action didn't get the same love as some of the
other actions and it's still using spinlocks and shared stats in the
datapath.
Transition the action to rcu and percpu stats as this improves the
action's performance dramatically on multiple cpu deployments.
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmPZRFAACgkQ+7dXa6fL
C2sf2w/+JPdtpughFCFQEztVYrw6pDr3EU4iDGrbOW1GDGHppTP+l6SaGXIG3uLD
ahd/3PrTCDNwSEu7acY7RADKtse8DT2nzEutoOStTSFISpD49d/K1h7jWfvCaDO8
p18rmRPnn/MTHmWOBazivBuKiA7W8tiCm+CSIEXldfuBZ1cHTY2qjR/MdSjEnMQX
T2cuCONkb7yuT0K8BCFm2Oj3LPi7EoMorie18dY9zdEsBK22unqssQ/hWzbYHxX9
JC2Xqx8G76xCDKymejpf75V3NrO/OYjr0X8JItb77uc6Fc7i944cK5WM+1ikPnox
kFfPauT/vIOZFBm0bXxzVN7x4VOtfO6wc8VNNtuCYKKJFwqwbbT82bLXfVGwEFNf
ZC7KeU6y3dN2iEqcPpytYIocIUE3SkYwEk8xBWGK3Ufm4MK06QqqcFowUPJbDAnQ
9RXdO6JPdswUmDtrgaqtx9K/+BWH7eRFtmbySG2+ZTIVEfthFPPr0ZGtFSGukdBr
0VGF5oUywAyW36pPGgTosOysC+wZlDdKKKlVzhWxbzhvF/+U+QGW7W0BZH6aqN8C
ZslReMY4xauzluAMdyyX/ZewBCjdlX2wkOPIgieZPO0LL+ptB+gpCXXwmzVzO6ln
2i9+se4MOiWMSVENdvYRVN5tcZwf9u/r8asX4QsqtRBhjdNI5n8=
=s0Uq
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-next-20230131' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
Here's the fifth part of patches in the process of moving rxrpc from doing
a lot of its stuff in softirq context to doing it in an I/O thread in
process context and thereby making it easier to support a larger SACK
table.
The full description is in the description for the first part[1] which is
now upstream. The second and third parts are also upstream[2]. A subset
of the original fourth part[3] got applied as a fix for a race[4].
The fifth part includes some cleanups:
(1) Miscellaneous trace header cleanups: fix a trace string, display the
security index in rx_packet rather than displaying the type twice,
remove some whitespace to make checkpatch happier and remove some
excess tabulation.
(2) Convert ->recvmsg_lock to a spinlock as it's only ever locked
exclusively.
(3) Make ->ackr_window and ->ackr_nr_unacked non-atomic as they're only
used in the I/O thread.
(4) Don't use call->tx_lock to access ->tx_buffer as that is only accessed
inside the I/O thread. sendmsg() loads onto ->tx_sendmsg and the I/O
thread decants from that to the buffer.
(5) Remove local->defrag_sem as DATA packets are transmitted serially by
the I/O thread.
(6) Remove the service connection bundle is it was only used for its
channel_lock - which has now gone.
And some more significant changes:
(7) Add a debugging option to allow a delay to be injected into packet
reception to help investigate the behaviour over longer links than
just a few cm.
(8) Generate occasional PING ACKs to probe for RTT information during a
receive heavy call.
(9) Simplify the SACK table maintenance and ACK generation. Now that both
parts are done in the same thread, there's no possibility of a race
and no need to try and be cunning to avoid taking a BH spinlock whilst
copying the SACK table (which in the future will be up to 2K) and no
need to rotate the copy to fit the ACK packet table.
(10) Use SKB_CONSUMED when freeing received DATA packets (stop dropwatch
complaining).
* tag 'rxrpc-next-20230131' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
rxrpc: Kill service bundle
rxrpc: Change rx_packet tracepoint to display securityIndex not type twice
rxrpc: Show consumed and freed packets as non-dropped in dropwatch
rxrpc: Remove local->defrag_sem
rxrpc: Don't lock call->tx_lock to access call->tx_buffer
rxrpc: Simplify ACK handling
rxrpc: De-atomic call->ackr_window and call->ackr_nr_unacked
rxrpc: Generate extra pings for RTT during heavy-receive call
rxrpc: Allow a delay to be injected into packet reception
rxrpc: Convert call->recvmsg_lock to a spinlock
rxrpc: Shrink the tabulation in the rxrpc trace header a bit
rxrpc: Remove whitespace before ')' in trace header
rxrpc: Fix trace string
====================
Link: https://lore.kernel.org/all/20230131171227.3912130-1-dhowells@redhat.com/
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The timer for the transmission of isotp PDUs formerly had two functions:
1. send two consecutive frames with a given time gap
2. monitor the timeouts for flow control frames and the echo frames
This led to larger txstate checks and potentially to a problem discovered
by syzbot which enabled the panic_on_warn feature while testing.
The former 'txtimer' function is split into 'txfrtimer' and 'txtimer'
to handle the two above functionalities with separate timer callbacks.
The two simplified timers now run in one-shot mode and make the state
transitions (especially with isotp_rcv_echo) better understandable.
Fixes: 866337865f ("can: isotp: fix tx state handling for echo tx processing")
Reported-by: syzbot+5aed6c3aaba661f5b917@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org # >= v6.0
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20230104145701.2422-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When wait_event_interruptible() has been interrupted by a signal the
tx.state value might not be ISOTP_IDLE. Force the state machines
into idle state to inhibit the timer handlers to continue working.
Fixes: 866337865f ("can: isotp: fix tx state handling for echo tx processing")
Cc: stable@vger.kernel.org
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20230112192347.1944-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
A CAN XL device is always capable to process CAN FD frames. The former
check when sending CAN FD frames relied on the existence of a CAN FD
device and did not check for a CAN XL device that would be correct
too.
With this patch the CAN FD feature is enabled automatically when CAN
XL is switched on - and CAN FD cannot be switch off while CAN XL is
enabled.
This precondition also leads to a clean up and reduction of checks in
the hot path in raw_rcv() and raw_sendmsg(). Some conditions are
reordered to handle simple checks first.
changes since v1: https://lore.kernel.org/all/20230131091012.50553-1-socketcan@hartkopp.net
- fixed typo: devive -> device
changes since v2: https://lore.kernel.org/all/20230131091824.51026-1-socketcan@hartkopp.net/
- reorder checks in if statements to handle simple checks first
Fixes: 626332696d ("can: raw: add CAN XL support")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20230131105613.55228-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The conclusion "j1939_session_deactivate() should be called with a
session ref-count of at least 2" is incorrect. In some concurrent
scenarios, j1939_session_deactivate can be called with the session
ref-count less than 2. But there is not any problem because it
will check the session active state before session putting in
j1939_session_deactivate_locked().
Here is the concurrent scenario of the problem reported by syzbot
and my reproduction log.
cpu0 cpu1
j1939_xtp_rx_eoma
j1939_xtp_rx_abort_one
j1939_session_get_by_addr [kref == 2]
j1939_session_get_by_addr [kref == 3]
j1939_session_deactivate [kref == 2]
j1939_session_put [kref == 1]
j1939_session_completed
j1939_session_deactivate
WARN_ON_ONCE(kref < 2)
=====================================================
WARNING: CPU: 1 PID: 21 at net/can/j1939/transport.c:1088 j1939_session_deactivate+0x5f/0x70
CPU: 1 PID: 21 Comm: ksoftirqd/1 Not tainted 5.14.0-rc7+ #32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
RIP: 0010:j1939_session_deactivate+0x5f/0x70
Call Trace:
j1939_session_deactivate_activate_next+0x11/0x28
j1939_xtp_rx_eoma+0x12a/0x180
j1939_tp_recv+0x4a2/0x510
j1939_can_recv+0x226/0x380
can_rcv_filter+0xf8/0x220
can_receive+0x102/0x220
? process_backlog+0xf0/0x2c0
can_rcv+0x53/0xf0
__netif_receive_skb_one_core+0x67/0x90
? process_backlog+0x97/0x2c0
__netif_receive_skb+0x22/0x80
Fixes: 0c71437dd5 ("can: j1939: j1939_session_deactivate(): clarify lifetime of session object")
Reported-by: syzbot+9981a614060dcee6eeca@syzkaller.appspotmail.com
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/all/20210906094200.95868-1-william.xuanziyang@huawei.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
In netdev common pattern, extack pointer is forwarded to the drivers
to be filled with error message. However, the caller can easily
overwrite the filled message.
Instead of adding multiple "if (!extack->_msg)" checks before any
NL_SET_ERR_MSG() call, which appears after call to the driver, let's
add new macro to common code.
[1] https://lore.kernel.org/all/Y9Irgrgf3uxOjwUm@unreal
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/6993fac557a40a1973dfa0095107c3d03d40bec1.1675171790.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When set to zero, the neighbor sysctl proxy_delay value
does not cause an immediate reply for ARP/ND requests
as expected, it instead causes a random delay between
[0, U32_MAX). Looking at this comment from
__get_random_u32_below() explains the reason:
/*
* This function is technically undefined for ceil == 0, and in fact
* for the non-underscored constant version in the header, we build bug
* on that. But for the non-constant case, it's convenient to have that
* evaluate to being a straight call to get_random_u32(), so that
* get_random_u32_inclusive() can work over its whole range without
* undefined behavior.
*/
Added helper function that does not call get_random_u32_below()
if proxy_delay is zero and just uses the current value of
jiffies instead, causing pneigh_enqueue() to respond
immediately.
Also added definition of proxy_delay to ip-sysctl.txt since
it was missing.
Signed-off-by: Brian Haley <haleyb.dev@gmail.com>
Link: https://lore.kernel.org/r/20230130171428.367111-1-haleyb.dev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
Firstly, allow sk->sk_gso_max_size to be set to a value greater than
GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
for IPv4 TCP sockets.
Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
in __ip_local_out() to allow to send BIG TCP packets, and this implies
that skb->len is the length of a IPv4 packet; On RX path, use skb->len
as the length of the IPv4 packet when the IP header tot_len is 0 and
skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
need to update these APIs.
Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
GRO complete, set IP header tot_len to 0 when the merged packet size
greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
on RX path.
Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
packets.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
per device and adds netlink attributes for them, so that IPV4
BIG TCP can be guarded by a separate tunable in the next patch.
To not break the old application using "gso/gro_max_size" for
IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
in netif_set_gso/gro_max_size() if the new size isn't greater
than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
userspace doesn't realize the new netlink attributes.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce TP_STATUS_GSO_TCP tp_status flag to tell the af_packet user
that this is a TCP GSO packet. When parsing IPv4 BIG TCP packets in
tcpdump/libpcap, it can use tp_len as the IPv4 packet len when this
flag is set, as iph tot_len is set to 0 for IPv4 BIG TCP packets.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
the iph->tot_len update should use iph_set_totlen().
Note that for these non GSO packets, the new iph tot_len with extra
iph option len added may become greater than 65535, the old process
will cast it and set iph->tot_len to it, which is a bug. In theory,
iph options shouldn't be added for these big packets in here, a fix
may be needed here in the future. For now this patch is only to set
iph->tot_len to 0 when it happens.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are also quite some places in netfilter that may process IPv4 TCP
GSO packets, we need to replace them too.
In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen()
return value, otherwise it may overflow and mismatch. This change will
also help us add selftest for IPv4 BIG TCP in the following patch.
Note that we don't need to replace the one in tcpmss_tg4(), as it will
return if there is data after tcphdr in tcpmss_mangle_packet(). The
same in mangle_contents() in nf_nat_helper.c, it returns false when
skb->len + extra > 65535 in enlarge_skb().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are 1 action and 1 qdisc that may process IPv4 TCP GSO packets
and access iph->tot_len, replace them with skb_ip_totlen() and
iph_totlen() accordingly.
Note that we don't need to replace the one in tcf_csum_ipv4(), as it
will return for TCP GSO packets in tcf_csum_ipv4_tcp().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPv4 GSO packets may get processed in ovs_skb_network_trim(),
and we need to use skb_ip_totlen() to get iph totlen.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These 3 places in bridge netfilter are called on RX path after GRO
and IPv4 TCP GSO packets may come through, so replace iph tot_len
accessing with skb_ip_totlen() in there.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Swap is a function interface that provides exchange function. To avoid
code duplication, we can use swap function.
./net/ipv6/icmp.c:344:25-26: WARNING opportunity for swap().
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3896
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230131063456.76302-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We recently found that our non-point-to-point tunnels were not
generating any IPv6 link local address and instead generating an
IPv6 compat address, breaking IPv6 communication on the tunnel.
Previously, addrconf_gre_config always would call addrconf_addr_gen
and generate a EUI64 link local address for the tunnel.
Then commit e5dd729460 changed the code path so that add_v4_addrs
is called but this only generates a compat IPv6 address for
non-point-to-point tunnels.
I assume the compat address is specifically for SIT tunnels so
have kept that only for SIT - GRE tunnels now always generate link
local addresses.
Fixes: e5dd729460 ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address")
Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For our point-to-point GRE tunnels, they have IN6_ADDR_GEN_MODE_NONE
when they are created then we set IN6_ADDR_GEN_MODE_EUI64 when they
come up to generate the IPv6 link local address for the interface.
Recently we found that they were no longer generating IPv6 addresses.
This issue would also have affected SIT tunnels.
Commit e5dd729460 changed the code path so that GRE tunnels
generate an IPv6 address based on the tunnel source address.
It also changed the code path so GRE tunnels don't call addrconf_addr_gen
in addrconf_dev_config which is called by addrconf_sysctl_addr_gen_mode
when the IN6_ADDR_GEN_MODE is changed.
This patch aims to fix this issue by moving the code in addrconf_notify
which calls the addr gen for GRE and SIT into a separate function
and calling it in the places that expect the IPv6 address to be
generated.
The previous addrconf_dev_config is renamed to addrconf_eth_config
since it only expected eth type interfaces and follows the
addrconf_gre/sit_config format.
A part of this changes means that the loopback address will be
attempted to be configured when changing addr_gen_mode for lo.
This should not be a problem because the address should exist anyway
and if does already exist then no error is produced.
Fixes: e5dd729460 ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address")
Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kfuncs are allowed to be static, or not use one or more of their
arguments. For example, bpf_xdp_metadata_rx_hash() in net/core/xdp.c is
meant to be implemented by drivers, with the default implementation just
returning -EOPNOTSUPP. As described in [0], such kfuncs can have their
arguments elided, which can cause BTF encoding to be skipped. The new
__bpf_kfunc macro should address this, and this patch adds a selftest
which verifies that a static kfunc with at least one unused argument can
still be encoded and invoked by a BPF program.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230201173016.342758-5-void@manifault.com
Now that we have the __bpf_kfunc tag, we should use add it to all
existing kfuncs to ensure that they'll never be elided in LTO builds.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com
In order to maintain naming consistency, rename and reorder all usages
of struct struct devlink_cmd in the following way:
1) Remove "gen" and replace it with "cmd" to match the struct name
2) Order devl_cmds[] and the header file to match the order
of enum devlink_command
3) Move devl_cmd_rate_get among the peers
4) Remove "inst" for DEVLINK_CMD_GET
5) Add "_get" suffix to all to match DEVLINK_CMD_*_GET (only rate had it
done correctly)
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No need to have "gen" inside name of the structure for devlink commands.
Remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To have the name of the function consistent with the struct cb name,
rename devlink_nl_instance_iter_dump() to
devlink_nl_instance_iter_dumpit().
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Release bridge info once packet escapes the br_netfilter path,
from Florian Westphal.
2) Revert incorrect fix for the SCTP connection tracking chunk
iterator, also from Florian.
First path fixes a long standing issue, the second path addresses
a mistake in the previous pull request for net.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
Revert "netfilter: conntrack: fix bug in for_each_sctp_chunk"
netfilter: br_netfilter: disable sabotage_in hook after first suppression
====================
Link: https://lore.kernel.org/r/20230131133158.4052-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It tries to avoid the frequently hb_timer refresh in commit ba6f5e33bd
("sctp: avoid refreshing heartbeat timer too often"), and it only allows
mod_timer when the new expires is after hb_timer.expires. It means even
a much shorter interval for hb timer gets applied, it will have to wait
until the current hb timer to time out.
In sctp_do_8_2_transport_strike(), when a transport enters PF state, it
expects to update the hb timer to resend a heartbeat every rto after
calling sctp_transport_reset_hb_timer(), which will not work as the
change mentioned above.
The frequently hb_timer refresh was caused by sctp_transport_reset_timers()
called in sctp_outq_flush() and it was already removed in the commit above.
So we don't have to check hb_timer.expires when resetting hb_timer as it is
now not called very often.
Fixes: ba6f5e33bd ("sctp: avoid refreshing heartbeat timer too often")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/d958c06985713ec84049a2d5664879802710179a.1675095933.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that the bundle->channel_lock has been eliminated, we don't need the
dummy service bundle anymore. It's purpose was purely to provide the
channel_lock for service connections.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Set a reason when freeing a packet that has been consumed such that
dropwatch doesn't complain that it has been dropped.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
We no longer need local->defrag_sem as all DATA packet transmission is now
done from one thread, so remove it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
call->tx_buffer is now only accessed within the I/O thread (->tx_sendmsg is
the way sendmsg passes packets to the I/O thread) so there's no need to
lock around it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Now that general ACK transmission is done from the same thread as incoming
DATA packet wrangling, there's no possibility that the SACK table will be
being updated by the latter whilst the former is trying to copy it to an
ACK.
This means that we can safely rotate the SACK table whilst updating it
without having to take a lock, rather than keeping all the bits inside it
in fixed place and copying and then rotating it in the transmitter.
Therefore, simplify SACK handing by keeping track of starting point in the
ring and rotate slots down as we consume them.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
call->ackr_window doesn't need to be atomic as ACK generation and ACK
transmission are now done in the same thread, so drop the atomic64 handling
and split it into two separate members.
Similarly, call->ackr_nr_unacked doesn't need to be atomic now either.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When doing a call that has a single transmitted data packet and a massive
amount of received data packets, we only ping for one RTT sample, which
means we don't get a good reading on it.
Fix this by converting occasional IDLE ACKs into PING ACKs to elicit a
response.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
If CONFIG_AF_RXRPC_DEBUG_RX_DELAY=y, then a delay is injected between
packets and errors being received and them being made available to the
processing code, thereby allowing the RTT to be artificially increased.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Convert call->recvmsg_lock to a spinlock as it's only ever write-locked.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
There is no bug. If sch->length == 0, this would result in an infinite
loop, but first caller, do_basic_checks(), errors out in this case.
After this change, packets with bogus zero-length chunks are no longer
detected as invalid, so revert & add comment wrt. 0 length check.
Fixes: 98ee007745 ("netfilter: conntrack: fix bug in for_each_sctp_chunk")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When using a xfrm interface in a bridged setup (the outgoing device is
bridged), the incoming packets in the xfrm interface are only tracked
in the outgoing direction.
$ brctl show
bridge name interfaces
br_eth1 eth1
$ conntrack -L
tcp 115 SYN_SENT src=192... dst=192... [UNREPLIED] ...
If br_netfilter is enabled, the first (encrypted) packet is received onR
eth1, conntrack hooks are called from br_netfilter emulation which
allocates nf_bridge info for this skb.
If the packet is for local machine, skb gets passed up the ip stack.
The skb passes through ip prerouting a second time. br_netfilter
ip_sabotage_in supresses the re-invocation of the hooks.
After this, skb gets decrypted in xfrm layer and appears in
network stack a second time (after decryption).
Then, ip_sabotage_in is called again and suppresses netfilter
hook invocation, even though the bridge layer never called them
for the plaintext incarnation of the packet.
Free the bridge info after the first suppression to avoid this.
I was unable to figure out where the regression comes from, as far as i
can see br_netfilter always had this problem; i did not expect that skb
is looped again with different headers.
Fixes: c4b0e771f9 ("netfilter: avoid using skb->nf_bridge directly")
Reported-and-tested-by: Wolfgang Nothdurft <wolfgang@linogate.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Nothing was explicitly bounds checking the priority index used to access
clpriop[]. WARN and bail out early if it's pathological. Seen with GCC 13:
../net/sched/sch_htb.c: In function 'htb_activate_prios':
../net/sched/sch_htb.c:437:44: warning: array subscript [0, 31] is outside array bounds of 'struct htb_prio[8]' [-Warray-bounds=]
437 | if (p->inner.clprio[prio].feed.rb_node)
| ~~~~~~~~~~~~~~~^~~~~~
../net/sched/sch_htb.c:131:41: note: while referencing 'clprio'
131 | struct htb_prio clprio[TC_HTB_NUMPRIO];
| ^~~~~~
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20230127224036.never.561-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Stefan Schmidt says:
====================
ieee802154 for net 2023-01-30
Only one fix this time around.
Miquel Raynal fixed a potential double free spotted by Dan Carpenter.
* tag 'ieee802154-for-net-2023-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan:
mac802154: Fix possible double free upon parsing error
====================
Link: https://lore.kernel.org/r/20230130095646.301448-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_is_tx_ready() checks that list_first_entry() does not return NULL.
This condition can never happen. For empty lists, list_first_entry()
returns the list_entry() of the head, which is a type confusion.
Use list_first_entry_or_null() which returns NULL in case of empty
lists.
Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Link: https://lore.kernel.org/r/20230128-list-entry-null-check-tls-v1-1-525bbfe6f0d0@diag.uniroma1.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix a trace string to indicate that it's discarding the local endpoint for
a preallocated peer, not a preallocated connection.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When copying the DSCP bits for decap-dscp into IPv6 don't assume the
outer encap is always IPv6. Instead, as with the inner IPv4 case, copy
the DSCP bits from the correctly saved "tos" value in the control block.
Fixes: 227620e295 ("[IPSEC]: Separate inner/outer mode processing on input")
Signed-off-by: Christian Hopps <chopps@chopps.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Devlink features were introduced to disallow devlink reload calls of
userspace before the devlink was fully initialized. The reason for this
workaround was the fact that devlink reload was originally called
without devlink instance lock held.
However, with recent changes that converted devlink reload to be
performed under devlink instance lock, this is redundant so remove
devlink features entirely.
Note that mlx5 used this to enable devlink reload conditionally only
when device didn't act as multi port slave. Move the multi port check
into mlx5_devlink_reload_down() callback alongside with the other
checks preventing the device from reload in certain states.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the notifications are only sent for params. People who
introduced other objects forgot to add the reload notifications here.
To make sure all notifications happen according to existing comment,
benefit from existence of devlink_notify_register/unregister() helpers
and use them in reload code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This effectively reverts commit 05a7f4a8df ("devlink: Break parameter
notification sequence to be before/after unload/load driver").
Cited commit resolved a problem in mlx5 params implementation,
when param notification code accessed memory previously freed
during reload.
Now, when the params can be registered and unregistered when devlink
instance is registered, mlx5 code unregisters the problematic param
during devlink reload. The fix is therefore no longer needed.
Current behavior is a it problematic, as it sends DEL notifications even
in potential case when reload_down() call fails which might confuse
userspace notifications listener.
So move the reload notifications back where they were originally in
between reload_down() and reload_up() calls.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- drop prandom.h includes, by Sven Eckelmann
- fix mailing list address, by Sven Eckelmann
- multicast feature preparation, by Linus Lüssing (2 patches)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmPTpRsWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoZWoEAC7F92qmyNAye5ylC4O7XeG5KXL
tPHzOSp5n5Q98HOmLGrqgOTEgfoX8q/GkAQh4lauiG5LPhs8iOYfHhiz5u0rXXnv
Dm26PCLTKxOd+nif2H28rh4ahzgZUSizQgtg3S6tiYAvvTc8TM67UbJpPkU3sXf6
Xh9gT7W5YvZ8lvEQF/l/otdhuE3jpiBgjzyfDKTyLlVweWgli4XZSZ9BqIhv8c/O
gSa6vp72FGyJEZJCHyXh4jD1yNU/S+CRFpj6Gy1OsqLVMnNVlJviwEyVJ+ZuzSw2
KSOeeUXxN4XjeDulIOPgvvxFRsAPCeNoGqOzHtlLgEugyZGowKY9q5Bh33K63A7t
YzayKnfq2CKAVg9VujePurGkiMTY5gjrtuS7KlcWxXn30V8rry3bN6Y8XLphr80A
p1fVYjPQtFhbjjrrmsKbGSCmJiNo+jx2Rs6y1aQ86ykNVK3Et6rSUrISaMRjZEEG
OIMGOkzE0ESH2yK99sdaHOKiauZIMPMHTK7rB53cSMCL2eyP+6wP1jGQFwiytaA6
9GzMD8w8LVrJQc7PjkYP6f75iFKxtWzpf8RzwePhZFiRGVBROW0+6ECv6NAx2ImG
PoMYcqy0EV3ffwUhNnWtML+f4keqBaOKXa63GyWGBKymHLr5yLBzyvjgsr9jMJNb
3Sj/qf7JQy908jSxQQ==
=yGYm
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-pullrequest-20230127' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- drop prandom.h includes, by Sven Eckelmann
- fix mailing list address, by Sven Eckelmann
- multicast feature preparation, by Linus Lüssing (2 patches)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
s390x ABI requires the caller to zero- or sign-extend the arguments.
eBPF already deals with zero-extension (by definition of its ABI), but
not with sign-extension.
Add a test to cover that potentially problematic area.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-15-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We may have pending skbs in the receive queue when the sk is being
destroyed; add a destructor to purge the queue.
MCTP doesn't use the error queue, so only the receive_queue is purged.
Fixes: 833ef3b91d ("mctp: Populate socket implementation")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/20230126064551.464468-1-jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Function radix_tree_insert() returns errors if the node hasn't
been initialized and added to the tree.
"kfree(node)" and return value "NULL" of node_get() help
to avoid using unclear node in other calls.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Cc: <stable@vger.kernel.org> # 5.7
Fixes: 0c2204a4ad ("net: qrtr: Migrate nameservice to kernel from userspace")
Signed-off-by: Natalia Petrova <n.petrova@fintech.ru>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://lore.kernel.org/r/20230125134831.8090-1-n.petrova@fintech.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If you call listen() and accept() on an already connect()ed
rose socket, accept() can successfully connect.
This is because when the peer socket sends data to sendmsg,
the skb with its own sk stored in the connected socket's
sk->sk_receive_queue is connected, and rose_accept() dequeues
the skb waiting in the sk->sk_receive_queue.
This creates a child socket with the sk of the parent
rose socket, which can cause confusion.
Fix rose_listen() to return -EINVAL if the socket has
already been successfully connected, and add lock_sock
to prevent this issue.
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230125105944.GA133314@ubuntu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RqJgAKCRDbK58LschI
gw2IAP9G5uhFO5abBzYLupp6SY3T5j97MUvPwLfFqUEt7EXmuwEA2lCUEWeW0KtR
QX+QmzCa6iHxrW7WzP4DUYLue//FJQY=
=yYqA
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next 2023-01-28
We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).
The main changes are:
1) Implement XDP hints via kfuncs with initial support for RX hash and
timestamp metadata kfuncs, from Stanislav Fomichev and
Toke Høiland-Jørgensen.
Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk
2) Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.
3) Significantly reduce the search time for module symbols by livepatch
and BPF, from Jiri Olsa and Zhen Lei.
4) Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs
in different time intervals, from David Vernet.
5) Fix several issues in the dynptr processing such as stack slot liveness
propagation, missing checks for PTR_TO_STACK variable offset, etc,
from Kumar Kartikeya Dwivedi.
6) Various performance improvements, fixes, and introduction of more
than just one XDP program to XSK selftests, from Magnus Karlsson.
7) Big batch to BPF samples to reduce deprecated functionality,
from Daniel T. Lee.
8) Enable struct_ops programs to be sleepable in verifier,
from David Vernet.
9) Reduce pr_warn() noise on BTF mismatches when they are expected under
the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.
10) Describe modulo and division by zero behavior of the BPF runtime
in BPF's instruction specification document, from Dave Thaler.
11) Several improvements to libbpf API documentation in libbpf.h,
from Grant Seltzer.
12) Improve resolve_btfids header dependencies related to subcmd and add
proper support for HOSTCC, from Ian Rogers.
13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
helper along with BPF selftests, from Ziyang Xuan.
14) Simplify the parsing logic of structure parameters for BPF trampoline
in the x86-64 JIT compiler, from Pu Lehui.
15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
Rust compilation units with pahole, from Martin Rodriguez Reboredo.
16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
from Kui-Feng Lee.
17) Disable stack protection for BPF objects in bpftool given BPF backends
don't support it, from Holger Hoffstätte.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
selftest/bpf: Make crashes more debuggable in test_progs
libbpf: Add documentation to map pinning API functions
libbpf: Fix malformed documentation formatting
selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
bpf/selftests: Verify struct_ops prog sleepable behavior
bpf: Pass const struct bpf_prog * to .check_member
libbpf: Support sleepable struct_ops.s section
bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
selftests/bpf: Fix vmtest static compilation error
tools/resolve_btfids: Alter how HOSTCC is forced
tools/resolve_btfids: Install subcmd headers
bpf/docs: Document the nocast aliasing behavior of ___init
bpf/docs: Document how nested trusted fields may be defined
bpf/docs: Document cpumask kfuncs in a new file
selftests/bpf: Add selftest suite for cpumask kfuncs
selftests/bpf: Add nested trust selftests suite
bpf: Enable cpumasks to be queried and used as kptrs
bpf: Disallow NULLable pointers for trusted kfuncs
...
====================
Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RCwAAKCRDbK58LschI
g7drAQDfMPc1Q2CE4LZ9oh2wu1Nt2/85naTDK/WirlCToKs0xwD+NRQOyO3hcoJJ
rOCwfjOlAm+7uqtiwwodBvWgTlDgVAM=
=0fC+
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
bpf 2023-01-27
We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 10 files changed, 170 insertions(+), 59 deletions(-).
The main changes are:
1) Fix preservation of register's parent/live fields when copying
range-info, from Eduard Zingerman.
2) Fix an off-by-one bug in bpf_mem_cache_idx() to select the right
cache, from Hou Tao.
3) Fix stack overflow from infinite recursion in sock_map_close(),
from Jakub Sitnicki.
4) Fix missing btf_put() in register_btf_id_dtor_kfuncs()'s error path,
from Jiri Olsa.
5) Fix a splat from bpf_setsockopt() via lsm_cgroup/socket_sock_rcv_skb,
from Kui-Feng Lee.
6) Fix bpf_send_signal[_thread]() helpers to hold a reference on the task,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix the kernel crash caused by bpf_setsockopt().
selftests/bpf: Cover listener cloning with progs attached to sockmap
selftests/bpf: Pass BPF skeleton to sockmap_listen ops tests
bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener
bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself
bpf: Add missing btf_put to register_btf_id_dtor_kfuncs
selftests/bpf: Verify copy_register_state() preserves parent/live fields
bpf: Fix to preserve reg parent/live fields when copying range info
bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpers
bpf: Fix off-by-one error in bpf_mem_cache_idx()
====================
Link: https://lore.kernel.org/r/20230127215820.4993-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch removes the msleep(4s) during netpoll_setup() if the carrier
appears instantly.
Here are some scenarios where this workaround is counter-productive in
modern ages:
Servers which have BMC communicating over NC-SI via the same NIC as gets
used for netconsole. BMC will keep the PHY up, hence the carrier
appearing instantly.
The link is fibre, SERDES getting sync could happen within 0.1Hz, and
the carrier also appears instantly.
Other than that, if a driver is reporting instant carrier and then
losing it, this is probably a driver bug.
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230125185230.3574681-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
GSO should not merge page pool recycled frames with standard reference
counted frames. Traditionally this didn't occur, at least not often.
However as we start looking at adding support for wireless adapters there
becomes the potential to mix the two due to A-MSDU repartitioning frames in
the receive path. There are possibly other places where this may have
occurred however I suspect they must be few and far between as we have not
seen this issue until now.
Fixes: 53e0961da1 ("page_pool: add frag page recycling support in page pool")
Reported-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/167475990764.1934330.11960904198087757911.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 1d18bb1a4d ("devlink: allow registering parameters after
the instance") as the subject implies introduced possibility to register
devlink params even for already registered devlink instance. This is a
bit problematic, as the consistency or params list was originally
secured by the fact it is static during devlink lifetime. So in order to
protect the params list, take devlink instance lock during the params
operations. Introduce unlocked function variants and use them in drivers
in locked context. Put lock assertions to appropriate places.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Put couple of WARN_ONs in devlink_param_driverinit_value_get() function
to clearly indicate, that it is a driver bug if used without reload
support or for non-driverinit param.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink_param_driverinit_value_set() currently returns int with possible
error, but no user is checking it anyway. The only reason for a fail is
a driver bug. So convert the function to return void and put WARN_ONs
on error paths.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a WARN_ON checking the param_item for being NULL when the param
is not inserted in the list. That indicates a driver BUG. Instead of
continuing to work with NULL pointer with its consequences, return.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no user outside the devlink code, so remove the export and make
the functions static. Move them before callers to avoid forward
declarations.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all SET commands where new common code is applicable.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most ethtool SET callbacks follow the same general structure.
ethnl_parse_header_dev_get()
rtnl_lock()
ethnl_ops_begin()
... do stuff ...
ethtool_notify()
ethnl_ops_complete()
rtnl_unlock()
ethnl_parse_header_dev_put()
This leads to a lot of copy / pasted code an bugs when people
mis-handle the error path.
Add a generic implementation of this pattern with a .set callback
in struct ethnl_request_ops called to "do stuff".
Also add an optional .set_validate which is called before
ethnl_ops_begin() -- a lot of implementations do basic request
capability / sanity checking at that point.
Because we want to avoid generating the notification when
no change happened - adopt a slightly hairy return values:
- 0 means nothing to do (no notification)
- 1 means done / continue
- negative error codes on error
Reuse .hdr_attr from struct ethnl_request_ops, GET and SET
use the same attr spaces in all cases.
Convert pause as an example (and to avoid unused function warnings).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Number of files depend on linux/splice.h getting included
by linux/skbuff.h which soon will no longer be the case.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Number of files depend on linux/sched/clock.h getting included
by linux/skbuff.h which soon will no longer be the case.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This include was added for skb_find_text() but all we need there
is a forward declaration of struct ts_config.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some reason, blamed commit did the right thing in xfrm_policy_timer()
but did not in xfrm_timer_handler()
Fixes: 386c5680e2 ("xfrm: use time64_t for in-kernel timestamps")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.
Let's use extack to return this information to users.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.
Let's use extack to return this information to users.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Like in all other functions in this file, a single point of exit is used
when extra operations are needed: unlock, decrement refcount, etc.
There is no functional change for the moment but it is better to do the
same here to make sure all cleanups are done in case of intermediate
errors.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Usually, attributes are propagated to subflows as well.
Here, if subflows are created by other ways than the MPTCP path-manager,
it is important to make sure they are in v6 if it is asked by the
userspace.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently the in-kernel PM arbitrary enforces that created subflow's
family must match the main MPTCP socket while the RFC allows mixing
IPv4 and IPv6 subflows.
This patch changes the in-kernel PM logic to create subflows matching
the currently selected source (or destination) address. IPv4 sockets
can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets
not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as
long as the source and destination matches.
A helper, previously introduced is used to ease family matching checks,
taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are multiple ICMP rate limiting mechanisms:
* Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec
* v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask
* v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask
However, when ICMP output is limited, there is no way to tell
which limit has been hit or even if the limits are responsible
for the lack of ICMP output.
Add counters for each of the cases above. As we are within
local_bh_disable(), use the __INC stats variant.
Example output:
# nstat -sz "*RateLimit*"
IcmpOutRateLimitGlobal 134 0.0
IcmpOutRateLimitHost 770 0.0
Icmp6OutRateLimitHost 84 0.0
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Suggested-by: Abhishek Rawal <rawal.abhishek92@gmail.com>
Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.
A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:
1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.
An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:
1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.
This approach has a couple of downsides:
a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.
Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.
# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port
In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].
b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.
IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.
2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.
The per-netns setting affects all sockets, so this approach can be used
only if:
- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.
For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:
system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.
Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.
To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.
To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.
The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.
UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.
PORT_LO = 40_000
PORT_HI = 40_511
s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.
[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Resolve an issue when calling sol_tcp_sockopt() on a socket with ktls
enabled. Prior to this patch, sol_tcp_sockopt() would only allow calls
if the function pointer of setsockopt of the socket was set to
tcp_setsockopt(). However, any socket with ktls enabled would have its
function pointer set to tls_setsockopt(). To resolve this issue, the
patch adds a check of the protocol of the linux socket and allows
bpf_setsockopt() to be called if ktls is initialized on the linux
socket. This ensures that calls to sol_tcp_sockopt() will succeed on
sockets with ktls enabled.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Link: https://lore.kernel.org/r/20230125201608.908230-2-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In a set of prior changes, we added the ability for struct_ops programs
to be sleepable. This patch enhances the dummy_st_ops selftest suite to
validate this behavior by adding a new sleepable struct_ops entry to
dummy_st_ops.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125164735.785732-5-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The .check_member field of struct bpf_struct_ops is currently passed the
member's btf_type via const struct btf_type *t, and a const struct
btf_member *member. This allows the struct_ops implementation to check
whether e.g. an ops is supported, but it would be useful to also enforce
that the struct_ops prog being loaded for that member has other
qualities, like being sleepable (or not). This patch therefore updates
the .check_member() callback to also take a const struct bpf_prog *prog
argument.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125164735.785732-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Once a socket has been unhashed, we want to prevent it from being
re-used in a sk_key entry as part of a routing operation.
This change marks the sk as SOCK_DEAD on unhash, which prevents addition
into the net's key list.
We need to do this during the key add path, rather than key lookup, as
we release the net keys_lock between those operations.
Fixes: 4a992bbd36 ("mctp: Implement message fragmentation & reassembly")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we have a race where we look up a sock through a "general"
(ie, not directly associated with the (src,dest,tag) tuple) key, then
drop the key reference while still holding the key's sock.
This change expands the key reference until we've finished using the
sock, and hence the sock reference too.
Commit message changes from Jeremy Kerr <jk@codeconstruct.com.au>.
Reported-by: Noam Rathaus <noamr@ssd-disclosure.com>
Fixes: 73c618456d ("mctp: locking, lifetime and validity changes for sk_keys")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we delete the key expiry timer (in sk->close) before
unhashing the sk. This means that another thread may find the sk through
its presence on the key list, and re-queue the timer.
This change moves the timer deletion to the unhash, after we have made
the key no longer observable, so the timer cannot be re-queued.
Fixes: 7b14e15ae6 ("mctp: Implement a timeout for tags")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we correlate the mctp_sk_key lifetime to the sock lifetime
through the sock hash/unhash operations, but this is pretty tenuous, and
there are cases where we may have a temporary reference to an unhashed
sk.
This change makes the reference more explicit, by adding a hold on the
sock when it's associated with a mctp_sk_key, released on final key
unref.
Fixes: 73c618456d ("mctp: locking, lifetime and validity changes for sk_keys")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the following call path:
ethnl_default_dumpit
-> ethnl_default_dump_one
-> ctx->ops->prepare_data
-> pause_prepare_data
struct genl_info *info will be passed as NULL, and pause_prepare_data()
dereferences it while getting the extended ack pointer.
To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.
The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.
Fixes: 04692c9020 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
In the following call path:
ethnl_default_dumpit
-> ethnl_default_dump_one
-> ctx->ops->prepare_data
-> stats_prepare_data
struct genl_info *info will be passed as NULL, and stats_prepare_data()
dereferences it while getting the extended ack pointer.
To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.
The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.
Fixes: 04692c9020 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When listen() and accept() are called on an x25 socket
that connect() succeeds, accept() succeeds immediately.
This is because x25_connect() queues the skb to
sk->sk_receive_queue, and x25_accept() dequeues it.
This creates a child socket with the sk of the parent
x25 socket, which can cause confusion.
Fix x25_listen() to return -EINVAL if the socket has
already been successfully connect()ed to avoid this issue.
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct device for ISM devices was part of struct smcd_dev. Move to
struct ism_dev, provide a new API call in struct smcd_ops, and convert
existing SMCD code accordingly.
Furthermore, remove struct smcd_dev from struct ism_dev.
This is the final part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ism module had SMC-D-specific code sprinkled across the entire module.
We are now consolidating the SMC-D-specific parts into the latter parts
of the module, so it becomes more clear what code is intended for use with
ISM, and which parts are glue code for usage in the context of SMC-D.
This is the fourth part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We separate the code implementing the struct smcd_ops API in the ISM
device driver from the functions that may be used by other exploiters of
ISM devices.
Note: We start out small, and don't offer the whole breadth of the ISM
device for public use, as many functions are specific to or likely only
ever used in the context of SMC-D.
This is the third part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register the smc module with the new ism device driver API.
This is the second part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new API that allows other drivers to concurrently access ISM devices.
To do so, we introduce a new API that allows other modules to register for
ISM device usage. Furthermore, we move the GID to struct ism, where it
belongs conceptually, and rename and relocate struct smcd_event to struct
ism_event.
This is the first part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removing an ISM device prior to terminating its associated connections
doesn't end well.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A listening socket linked to a sockmap has its sk_prot overridden. It
points to one of the struct proto variants in tcp_bpf_prots. The variant
depends on the socket's family and which sockmap programs are attached.
A child socket cloned from a TCP listener initially inherits their sk_prot.
But before cloning is finished, we restore the child's proto to the
listener's original non-tcp_bpf_prots one. This happens in
tcp_create_openreq_child -> tcp_bpf_clone.
Today, in tcp_bpf_clone we detect if the child's proto should be restored
by checking only for the TCP_BPF_BASE proto variant. This is not
correct. The sk_prot of listening socket linked to a sockmap can point to
to any variant in tcp_bpf_prots.
If the listeners sk_prot happens to be not the TCP_BPF_BASE variant, then
the child socket unintentionally is left if the inherited sk_prot by
tcp_bpf_clone.
This leads to issues like infinite recursion on close [1], because the
child state is otherwise not set up for use with tcp_bpf_prot operations.
Adjust the check in tcp_bpf_clone to detect all of tcp_bpf_prots variants.
Note that it wouldn't be sufficient to check the socket state when
overriding the sk_prot in tcp_bpf_update_proto in order to always use the
TCP_BPF_BASE variant for listening sockets. Since commit
b8b8315e39 ("bpf, sockmap: Remove unhash handler for BPF sockmap usage")
it is possible for a socket to transition to TCP_LISTEN state while already
linked to a sockmap, e.g. connect() -> insert into map ->
connect(AF_UNSPEC) -> listen().
[1]: https://lore.kernel.org/all/00000000000073b14905ef2e7401@google.com/
Fixes: e80251555f ("tcp_bpf: Don't let child socket inherit parent protocol ops on copy")
Reported-by: syzbot+04c21ed96d861dccc5cd@syzkaller.appspotmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230113-sockmap-fix-v2-2-1e0ee7ac2f90@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Build bot detects that err may be returned uninitialized in
devlink_fmsg_prepare_skb(). This is not really true because
all fmsgs users should create at least one outer nest, and
therefore fmsg can't be completely empty.
That said the assumption is not trivial to confirm, so let's
follow the bots advice, anyway.
This code does not seem to have changed since its inception in
commit 1db64e8733 ("devlink: Add devlink formatted message (fmsg) API")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230124035231.787381-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Perform SCTP vtag verification for ABORT/SHUTDOWN_COMPLETE according
to RFC 9260, Sect 8.5.1.
2) Fix infinite loop if SCTP chunk size is zero in for_each_sctp_chunk().
And remove useless check in this macro too.
3) Revert DATA_SENT state in the SCTP tracker, this was applied in the
previous merge window. Next patch in this series provides a more
simple approach to multihoming support.
4) Unify HEARTBEAT_ACKED and ESTABLISHED states for SCTP multihoming
support, use default ESTABLISHED of 210 seconds based on
heartbeat timeout * maximum number of retransmission + round-trip timeout.
Otherwise, SCTP conntrack entry that represents secondary paths
remain stale in the table for up to 5 days.
This is a slightly large batch with fixes for the SCTP connection
tracking helper, all patches from Sriram Yagnaraman.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: unify established states for SCTP paths
Revert "netfilter: conntrack: add sctp DATA_SENT state"
netfilter: conntrack: fix bug in for_each_sctp_chunk
netfilter: conntrack: fix vtag checks for ABORT/SHUTDOWN_COMPLETE
====================
Link: https://lore.kernel.org/r/20230124183933.4752-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, if you bind the socket to something like:
servaddr.sin6_family = AF_INET6;
servaddr.sin6_port = htons(0);
servaddr.sin6_scope_id = 0;
inet_pton(AF_INET6, "::1", &servaddr.sin6_addr);
And then request a connect to:
connaddr.sin6_family = AF_INET6;
connaddr.sin6_port = htons(20000);
connaddr.sin6_scope_id = if_nametoindex("lo");
inet_pton(AF_INET6, "fe88::1", &connaddr.sin6_addr);
What the stack does is:
- bind the socket
- create a new asoc
- to handle the connect
- copy the addresses that can be used for the given scope
- try to connect
But the copy returns 0 addresses, and the effect is that it ends up
trying to connect as if the socket wasn't bound, which is not the
desired behavior. This unexpected behavior also allows KASLR leaks
through SCTP diag interface.
The fix here then is, if when trying to copy the addresses that can
be used for the scope used in connect() it returns 0 addresses, bail
out. This is what TCP does with a similar reproducer.
Reported-by: Pietro Borrello <borrello@diag.uniroma1.it>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/9fcd182f1099f86c6661f3717f63712ddd1c676c.1674496737.git.marcelo.leitner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 2c7bc10d0f ("netlink: add macro for checking dump ctx size")
misspelled the name of the assert as asset, missing an R.
Reported-by: Ido Schimmel <idosch@idosch.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20230123222224.732338-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Generate and plug in the spec-based tables.
A little bit of renaming is needed in the FOU code.
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We'll need to link two objects together to form the fou module.
This means the source can't be called fou, the build system expects
fou.o to be the combined object.
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
William reports kernel soft-lockups on some OVS topologies when TC mirred
egress->ingress action is hit by local TCP traffic [1].
The same can also be reproduced with SCTP (thanks Xin for verifying), when
client and server reach themselves through mirred egress to ingress, and
one of the two peers sends a "heartbeat" packet (from within a timer).
Enqueueing to backlog proved to fix this soft lockup; however, as Cong
noticed [2], we should preserve - when possible - the current mirred
behavior that counts as "overlimits" any eventual packet drop subsequent to
the mirred forwarding action [3]. A compromise solution might use the
backlog only when tcf_mirred_act() has a nest level greater than one:
change tcf_mirred_forward() accordingly.
Also, add a kselftest that can reproduce the lockup and verifies TC mirred
ability to account for further packet drops after TC mirred egress->ingress
(when the nest level is 1).
[1] https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/
[2] https://lore.kernel.org/netdev/Y0w%2FWWY60gqrtGLp@pop-os.localdomain/
[3] such behavior is not guaranteed: for example, if RPS or skb RX
timestamping is enabled on the mirred target device, the kernel
can defer receiving the skb and return NET_RX_SUCCESS inside
tcf_mirred_forward().
Reported-by: William Zhao <wizhao@redhat.com>
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
with commit e2ca070f89 ("net: sched: protect against stack overflow in
TC act_mirred"), act_mirred protected itself against excessive stack growth
using per_cpu counter of nested calls to tcf_mirred_act(), and capping it
to MIRRED_RECURSION_LIMIT. However, such protection does not detect
recursion/loops in case the packet is enqueued to the backlog (for example,
when the mirred target device has RPS or skb timestamping enabled). Change
the wording from "recursion" to "nesting" to make it more clear to readers.
CC: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
An SCTP endpoint can start an association through a path and tear it
down over another one. That means the initial path will not see the
shutdown sequence, and the conntrack entry will remain in ESTABLISHED
state for 5 days.
By merging the HEARTBEAT_ACKED and ESTABLISHED states into one
ESTABLISHED state, there remains no difference between a primary or
secondary path. The timeout for the merged ESTABLISHED state is set to
210 seconds (hb_interval * max_path_retrans + rto_max). So, even if a
path doesn't see the shutdown sequence, it will expire in a reasonable
amount of time.
With this change in place, there is now more than one state from which
we can transition to ESTABLISHED, COOKIE_ECHOED and HEARTBEAT_SENT, so
handle the setting of ASSURED bit whenever a state change has happened
and the new state is ESTABLISHED. Removed the check for dir==REPLY since
the transition to ESTABLISHED can happen only in the reply direction.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commit (bff3d05348: "netfilter: conntrack: add sctp
DATA_SENT state")
Using DATA/SACK to detect a new connection on secondary/alternate paths
works only on new connections, while a HEARTBEAT is required on
connection re-use. It is probably consistent to wait for HEARTBEAT to
create a secondary connection in conntrack.
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
skb_header_pointer() will return NULL if offset + sizeof(_sch) exceeds
skb->len, so this offset < skb->len test is redundant.
if sch->length == 0, this will end up in an infinite loop, add a check
for sch->length > 0
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
RFC 9260, Sec 8.5.1 states that for ABORT/SHUTDOWN_COMPLETE, the chunk
MUST be accepted if the vtag of the packet matches its own tag and the
T bit is not set OR if it is set to its peer's vtag and the T bit is set
in chunk flags. Otherwise the packet MUST be silently dropped.
Update vtag verification for ABORT/SHUTDOWN_COMPLETE based on the above
description.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
LAN937x family of switches has 8 queues per port where the KSZ switches
has 4 queues per port. By default, only one queue per port is enabled.
The queues are configurable in 2, 4 or 8. This patch add 8 number of
queues for LAN937x and 4 for other switches.
In the tag_ksz.c file, prioirty of the packet is queried using the skb
buffer and the corresponding value is updated in the tag.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The spin_lock irqsave/restore API variant in skb_defer_free_flush can
be replaced with the faster spin_lock irq variant, which doesn't need
to read and restore the CPU flags.
Using the unconditional irq "disable/enable" API variant is safe,
because the skb_defer_free_flush() function is only called during
NAPI-RX processing in net_rx_action(), where it is known the IRQs
are enabled.
Expected gain is 14 cycles from avoiding reading and restoring CPU
flags in a spin_lock_irqsave/restore operation, measured via a
microbencmark kernel module[1] on CPU E5-1650 v4 @ 3.60GHz.
Microbenchmark overhead of spin_lock+unlock:
- spin_lock_unlock_irq cost: 34 cycles(tsc) 9.486 ns
- spin_lock_unlock_irqsave cost: 48 cycles(tsc) 13.567 ns
We don't expect to see a measurable packet performance gain, as
skb_defer_free_flush() is called infrequently once per NIC device NAPI
bulk cycle and conditionally only if SKBs have been deferred by other
CPUs via skb_attempt_defer_free().
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/167421646327.1321776.7390743166998776914.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix overlap detection in rbtree set backend: Detect overlap by going
through the ordered list of valid tree nodes. To shorten the number of
visited nodes in the list, this algorithm descends the tree to search
for an existing element greater than the key value to insert that is
greater than the new element.
2) Fix for the rbtree set garbage collector: Skip inactive and busy
elements when checking for expired elements to avoid interference
with an ongoing transaction from control plane.
This is a rather large fix coming at this stage of the 6.2-rc. Since
33c7aba0b4 ("netfilter: nf_tables: do not set up extensions for end
interval"), bogus overlap errors in the rbtree set occur more frequently.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_set_rbtree: skip elements in transaction from garbage collection
netfilter: nft_set_rbtree: Switch to node list walk for overlap detection
====================
Link: https://lore.kernel.org/r/20230123211601.292930-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A bug was introduced by commit eedade12f4 ("net: kfree_skb_list use
kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
invoking skb_mark_not_on_list().
In this patch we choose to remove the skb_mark_not_on_list() call as it
isn't necessary. It would be possible and correct to call
skb_mark_not_on_list() only when __kfree_skb_reason() returns true,
meaning the SKB is ready to be free'ed, as it calls/check skb_unref().
This fix is needed as kfree_skb_list() is also invoked on skb_shared_info
frag_list (skb_drop_fraglist() calling kfree_skb_list()). A frag_list can
have SKBs with elevated refcnt due to cloning via skb_clone_fraglist(),
which takes a reference on all SKBs in the list. This implies the
invariant that all SKBs in the list must have the same refcnt, when using
kfree_skb_list().
Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Fixes: eedade12f4 ("net: kfree_skb_list use kmem_cache_free_bulk")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/167421088417.1125894.9761158218878962159.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
if (!type)
continue;
if (type > RTAX_MAX)
return false;
...
fi_val = fi->fib_metrics->metrics[type - 1];
@type being used as an array index, we need to prevent
cpu speculation or risk leaking kernel memory content.
Fixes: 5f9ae3d9e7 ("ipv4: do metrics match when looking up and deleting a route")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230120133140.3624204-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
if (!type)
continue;
if (type > RTAX_MAX)
return -EINVAL;
...
metrics[type - 1] = val;
@type being used as an array index, we need to prevent
cpu speculation or risk leaking kernel memory content.
Fixes: 6cf9dfd3bd ("net: fib: move metrics parsing to a helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230120133040.3623463-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netlink_getsockbyportid() reads sk_state while a concurrent
netlink_connect() can change its value.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netlink_getname(), netlink_sendmsg() and netlink_getsockbyportid()
can read nlk->dst_portid and nlk->dst_group while another
thread is changing them.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
First set of patches for v6.3. The most important change here is that
the old Wireless Extension user space interface is not supported on
Wi-Fi 7 devices at all. We also added a warning if anyone with modern
drivers (ie. cfg80211 and mac80211 drivers) tries to use Wireless
Extensions, everyone should switch to using nl80211 interface instead.
Static WEP support is removed, there wasn't any driver using that
anyway so there's no user impact. Otherwise it's smaller features and
fixes as usual.
Note: As mt76 had tricky conflicts due to the fixes in wireless tree,
we decided to merge wireless into wireless-next to solve them easily.
There should not be any merge problems anymore.
Major changes:
cfg80211
* remove never used static WEP support
* warn if Wireless Extention interface is used with cfg80211/mac80211 drivers
* stop supporting Wireless Extensions with Wi-Fi 7 devices
* support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting
rfkill
* add GPIO DT support
bitfield
* add FIELD_PREP_CONST()
mt76
* per-PHY LED support
rtw89
* support new Bluetooth co-existance version
rtl8xxxu
* support RTL8188EU
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmPOYeQRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZvSlAf/Y5ZY5xLEytUma7fBkBObXEfP/7tlBBsu
RoRKVx77D1LGfGu0WXG9PCdvyY70e2QtrkdeLHF3gfzLYpNZIyB/eOFhwzCtbJrD
ls2yXhdTm9OwDOHAdvXLXx3fmF4bXni7dYdi78VrGCFOnU6XE6X5JpnZYU1SmQ1U
8Ro7H6D9yp8MKfh5Ct19PYSTS5hmHB09vfJ4rbkjHp7kEGvJjYNbvAqGsxatPnh9
Zw35TEIwmhZO4GsXxsG12g6LZa8W8RO8uCwepHxtFM8oGsF68Yb/lkLcdtMiuN6V
WdB6qn24faEWjdmt5BzJGueA3Td8KI6t5cHhGbQVKjyFD8lAC+IJQA==
=Nq9U
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2023-01-23' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.3
First set of patches for v6.3. The most important change here is that
the old Wireless Extension user space interface is not supported on
Wi-Fi 7 devices at all. We also added a warning if anyone with modern
drivers (ie. cfg80211 and mac80211 drivers) tries to use Wireless
Extensions, everyone should switch to using nl80211 interface instead.
Static WEP support is removed, there wasn't any driver using that
anyway so there's no user impact. Otherwise it's smaller features and
fixes as usual.
Note: As mt76 had tricky conflicts due to the fixes in wireless tree,
we decided to merge wireless into wireless-next to solve them easily.
There should not be any merge problems anymore.
Major changes:
cfg80211
- remove never used static WEP support
- warn if Wireless Extention interface is used with cfg80211/mac80211 drivers
- stop supporting Wireless Extensions with Wi-Fi 7 devices
- support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting
rfkill
- add GPIO DT support
bitfield
- add FIELD_PREP_CONST()
mt76
- per-PHY LED support
rtw89
- support new Bluetooth co-existance version
rtl8xxxu
- support RTL8188EU
* tag 'wireless-next-2023-01-23' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (123 commits)
wifi: wireless: deny wireless extensions on MLO-capable devices
wifi: wireless: warn on most wireless extension usage
wifi: mac80211: drop extra 'e' from ieeee80211... name
wifi: cfg80211: Deduplicate certificate loading
bitfield: add FIELD_PREP_CONST()
wifi: mac80211: add kernel-doc for EHT structure
mac80211: support minimal EHT rate reporting on RX
wifi: mac80211: Add HE MU-MIMO related flags in ieee80211_bss_conf
wifi: mac80211: Add VHT MU-MIMO related flags in ieee80211_bss_conf
wifi: cfg80211: Use MLD address to indicate MLD STA disconnection
wifi: cfg80211: Support 32 bytes KCK key in GTK rekey offload
wifi: cfg80211: Fix extended KCK key length check in nl80211_set_rekey_data()
wifi: cfg80211: remove support for static WEP
wifi: rtl8xxxu: Dump the efuse only for untested devices
wifi: rtl8xxxu: Print the ROM version too
wifi: rtw88: Use non-atomic sta iterator in rtw_ra_mask_info_update()
wifi: rtw88: Use rtw_iterate_vifs() for rtw_vif_watch_dog_iter()
wifi: rtw88: Move register access from rtw_bf_assoc() outside the RCU
wifi: rtl8xxxu: Use a longer retry limit of 48
wifi: rtl8xxxu: Report the RSSI to the firmware
...
====================
Link: https://lore.kernel.org/r/20230123103338.330CBC433EF@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Skip interference with an ongoing transaction, do not perform garbage
collection on inactive elements. Reset annotated previous end interval
if the expired element is marked as busy (control plane removed the
element right before expiration).
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
...instead of a tree descent, which became overly complicated in an
attempt to cover cases where expired or inactive elements would affect
comparisons with the new element being inserted.
Further, it turned out that it's probably impossible to cover all those
cases, as inactive nodes might entirely hide subtrees consisting of a
complete interval plus a node that makes the current insertion not
overlap.
To speed up the overlap check, descent the tree to find a greater
element that is closer to the key value to insert. Then walk down the
node list for overlap detection. Starting the overlap check from
rb_first() unconditionally is slow, it takes 10 times longer due to the
full linear traversal of the list.
Moreover, perform garbage collection of expired elements when walking
down the node list to avoid bogus overlap reports.
For the insertion operation itself, this essentially reverts back to the
implementation before commit 7c84d41416 ("netfilter: nft_set_rbtree:
Detect partial overlaps on insertion"), except that cases of complete
overlap are already handled in the overlap detection phase itself, which
slightly simplifies the loop to find the insertion point.
Based on initial patch from Stefano Brivio, including text from the
original patch description too.
Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible
XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not
supported by the target device, the default implementation is called instead.
The verifier, at load time, replaces a call to the generic kfunc with a call
to the per-device one. Per-device kfunc pointers are stored in separate
struct xdp_metadata_ops.
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
to associate a netdev with a BPF program at load time.
netdevsim checks are dropped in favor of generic check in dev_xdp_attach.
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
BPF offloading infra will be reused to implement
bound-but-not-offloaded bpf programs. Rename existing
helpers for clarity. No functional changes.
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The DSA core is in charge of the ethtool_ops of the net devices
associated with switch ports, so in case a hardware driver supports the
MAC merge layer, DSA must pass the callbacks through to the driver.
Add support for precisely that.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a pMAC exists but the driver is unable to atomically query the
aggregate eMAC+pMAC statistics, the user should be given back at least
the sum of eMAC and pMAC counters queried separately.
This is a generic problem, so add helpers in ethtool to do this
operation, if the driver doesn't have a better way to report aggregate
stats. Do this in a way that does not require changes to these functions
when new stats are added (basically treat the structures as an array of
u64 values, except for the first element which is the stats source).
In include/linux/ethtool.h, there is already a section where helper
function prototypes should be placed. The trouble is, this section is
too early, before the definitions of struct ethtool_eth_mac_stats et.al.
Move that section at the end and append these new helpers to it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.3-2018 clause 99 defines a MAC Merge sublayer which contains an
Express MAC and a Preemptible MAC. Both MACs are hidden to higher and
lower layers and visible as a single MAC (packet classification to eMAC
or pMAC on TX is done based on priority; classification on RX is done
based on SFD).
For devices which support a MAC Merge sublayer, it is desirable to
retrieve individual packet counters from the eMAC and the pMAC, as well
as aggregate statistics (their sum).
Introduce a new ETHTOOL_A_STATS_SRC attribute which is part of the
policy of ETHTOOL_MSG_STATS_GET and, and an ETHTOOL_A_PAUSE_STATS_SRC
which is part of the policy of ETHTOOL_MSG_PAUSE_GET (accepted when
ETHTOOL_FLAG_STATS is set in the common ethtool header). Both of these
take values from enum ethtool_mac_stats_src, defaulting to "aggregate"
in the absence of the attribute.
Existing drivers do not need to pay attention to this enum which was
added to all driver-facing structures, just the ones which report the
MAC merge layer as supported.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MAC merge sublayer (IEEE 802.3-2018 clause 99) is one of 2
specifications (the other being Frame Preemption; IEEE 802.1Q-2018
clause 6.7.2), which work together to minimize latency caused by frame
interference at TX. The overall goal of TSN is for normal traffic and
traffic with a bounded deadline to be able to cohabitate on the same L2
network and not bother each other too much.
The standards achieve this (partly) by introducing the concept of
preemptible traffic, i.e. Ethernet frames that have a custom value for
the Start-of-Frame-Delimiter (SFD), and these frames can be fragmented
and reassembled at L2 on a link-local basis. The non-preemptible frames
are called express traffic, they are transmitted using a normal SFD, and
they can preempt preemptible frames, therefore having lower latency,
which can matter at lower (100 Mbps) link speeds, or at high MTUs (jumbo
frames around 9K). Preemption is not recursive, i.e. a P frame cannot
preempt another P frame. Preemption also does not depend upon priority,
or otherwise said, an E frame with prio 0 will still preempt a P frame
with prio 7.
In terms of implementation, the standards talk about the presence of an
express MAC (eMAC) which handles express traffic, and a preemptible MAC
(pMAC) which handles preemptible traffic, and these MACs are multiplexed
on the same MII by a MAC merge layer.
To support frame preemption, the definition of the SFD was generalized
to SMD (Start-of-mPacket-Delimiter), where an mPacket is essentially an
Ethernet frame fragment, or a complete frame. Stations unaware of an SMD
value different from the standard SFD will treat P frames as error
frames. To prevent that from happening, a negotiation process is
defined.
On RX, packets are dispatched to the eMAC or pMAC after being filtered
by their SMD. On TX, the eMAC/pMAC classification decision is taken by
the 802.1Q spec, based on packet priority (each of the 8 user priority
values may have an admin-status of preemptible or express).
The MAC Merge layer and the Frame Preemption parameters have some degree
of independence in terms of how software stacks are supposed to deal
with them. The activation of the MM layer is supposed to be controlled
by an LLDP daemon (after it has been communicated that the link partner
also supports it), after which a (hardware-based or not) verification
handshake takes place, before actually enabling the feature. So the
process is intended to be relatively plug-and-play. Whereas FP settings
are supposed to be coordinated across a network using something
approximating NETCONF.
The support contained here is exclusively for the 802.3 (MAC Merge)
portions and not for the 802.1Q (Frame Preemption) parts. This API is
sufficient for an LLDP daemon to do its job. The FP adminStatus variable
from 802.1Q is outside the scope of an LLDP daemon.
I have taken a few creative licenses and augmented the Linux kernel UAPI
compared to the standard managed objects recommended by IEEE 802.3.
These are:
- ETHTOOL_A_MM_PMAC_ENABLED: According to Figure 99-6: Receive
Processing state diagram, a MAC Merge layer is always supposed to be
able to receive P frames. However, this implies keeping the pMAC
powered on, which will consume needless power in applications where FP
will never be used. If LLDP is used, the reception of an Additional
Ethernet Capabilities TLV from the link partner is sufficient
indication that the pMAC should be enabled. So my proposal is that in
Linux, we keep the pMAC turned off by default and that user space
turns it on when needed.
- ETHTOOL_A_MM_VERIFY_ENABLED: The IEEE managed object is called
aMACMergeVerifyDisableTx. I opted for consistency (positive logic) in
the boolean netlink attributes offered, so this is also positive here.
Other than the meaning being reversed, they correspond to the same
thing.
- ETHTOOL_A_MM_MAX_VERIFY_TIME: I found it most reasonable for a LLDP
daemon to maximize the verifyTime variable (delay between SMD-V
transmissions), to maximize its chances that the LP replies. IEEE says
that the verifyTime can range between 1 and 128 ms, but the NXP ENETC
stupidly keeps this variable in a 7 bit register, so the maximum
supported value is 127 ms. I could have chosen to hardcode this in the
LLDP daemon to a lower value, but why not let the kernel expose its
supported range directly.
- ETHTOOL_A_MM_TX_MIN_FRAG_SIZE: the standard managed object is called
aMACMergeAddFragSize, and expresses the "additional" fragment size
(on top of ETH_ZLEN), whereas this expresses the absolute value of the
fragment size.
- ETHTOOL_A_MM_RX_MIN_FRAG_SIZE: there doesn't appear to exist a managed
object mandated by the standard, but user space clearly needs to know
what is the minimum supported fragment size of our local receiver,
since LLDP must advertise a value no lower than that.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations. For example:
<...>
iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>
Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When proxying IPv6 NDP requests, the adverts to the initial multicast
solicits are correct and working. On the other hand, when later a
reachability confirmation is requested (on unicast), no reply is sent.
This causes the neighbor entry expiring on the sending node, which is
mostly a non-issue, as a new multicast request is sent. There are
routers, where the multicast requests are intentionally delayed, and in
these environments the current implementation causes periodic packet
loss for the proxied endpoints.
The root cause is the erroneous decrease of the hop limit, as this
is checked in ndisc.c and no answer is generated when it's 254 instead
of the correct 255.
Cc: stable@vger.kernel.org
Fixes: 46c7655f0b ("ipv6: decrease hop limit counter in ip6_forward()")
Signed-off-by: Gergely Risko <gergely.risko@gmail.com>
Tested-by: Gergely Risko <gergely.risko@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the fact that the kernel-side data structures have been carried
over from the ioctl-based ethtool, we are now in the situation where we
have an ethnl_update_bool32() function, but the plain function that
operates on a boolean value kept in an actual u8 netlink attribute
doesn't exist.
With new ethtool features that are exposed solely over netlink, the
kernel data structures will use the "bool" type, so we will need this
kind of helper. Introduce it now; it's needed for things like
verify-disabled for the MAC merge configuration.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
int type = nla_type(nla);
if (type > XFRMA_MAX) {
return -EOPNOTSUPP;
}
@type is then used as an array index and can be used
as a Spectre v1 gadget.
if (nla_len(nla) < compat_policy[type].len) {
array_index_nospec() can be used to prevent leaking
content of kernel memory to malicious users.
Fixes: 5106f4a8ac ("xfrm/compat: Add 32=>64-bit messages translator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Prepare TVLV infrastructure for more packet types, in particular the
upcoming batman-adv multicast packet type.
For that swap the OGM vs. unicast-tvlv packet boolean indicator to an
explicit unsigned integer packet type variable. And provide the skb
to a call to batadv_tvlv_containers_process(), as later the multicast
packet's TVLV handler will need to have access not only to the TVLV but
the full skb for forwarding. Forwarding will be invoked from the
multicast packet's TVLVs' contents later.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The multicast code to send a multicast packet via multiple batman-adv
unicast packets is not only capable of sending to multiple but also to a
single node. Therefore we can safely remove the old, specialized, now
redundant multicast-to-single-unicast code.
The only functional change of this simplification is that the edge case
of allowing a multicast packet with an unsnoopable destination address
(224.0.0.0/24 or ff02::1) where only a single node has signaled interest
in it via the batman-adv want-all-unsnoopables multicast flag is now
transmitted via a batman-adv broadcast instead of a batman-adv unicast
packet. Maintaining this edge case feature does not seem worth the extra
lines of code and people should just not expect to be able to snoop and
optimize such unsnoopable multicast addresses when bridges are involved.
While at it also renaming a few items in the batadv_forw_mode enum to
prepare for the new batman-adv multicast packet type.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
If net_assign_generic() fails, the current error path in ops_init() tries
to clear the gen pointer slot. Anyway, in such error path, the gen pointer
itself has not been modified yet, and the existing and accessed one is
smaller than the accessed index, causing an out-of-bounds error:
BUG: KASAN: slab-out-of-bounds in ops_init+0x2de/0x320
Write of size 8 at addr ffff888109124978 by task modprobe/1018
CPU: 2 PID: 1018 Comm: modprobe Not tainted 6.2.0-rc2.mptcp_ae5ac65fbed5+ #1641
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc37 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x9f
print_address_description.constprop.0+0x86/0x2b5
print_report+0x11b/0x1fb
kasan_report+0x87/0xc0
ops_init+0x2de/0x320
register_pernet_operations+0x2e4/0x750
register_pernet_subsys+0x24/0x40
tcf_register_action+0x9f/0x560
do_one_initcall+0xf9/0x570
do_init_module+0x190/0x650
load_module+0x1fa5/0x23c0
__do_sys_finit_module+0x10d/0x1b0
do_syscall_64+0x58/0x80
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f42518f778d
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 73 01 c3 48 8b 0d cb 56 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007fff96869688 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 00005568ef7f7c90 RCX: 00007f42518f778d
RDX: 0000000000000000 RSI: 00005568ef41d796 RDI: 0000000000000003
RBP: 00005568ef41d796 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
R13: 00005568ef7f7d30 R14: 0000000000040000 R15: 0000000000000000
</TASK>
This change addresses the issue by skipping the gen pointer
de-reference in the mentioned error-path.
Found by code inspection and verified with explicit error injection
on a kasan-enabled kernel.
Fixes: d266935ac4 ("net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/cec4e0f3bb2c77ac03a6154a8508d3930beb5f0f.1674154348.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initial default value of 0 for tp->rate_app_limited was incorrect,
since a flow is indeed application-limited until it first sends
data. Fixing the default to be 1 is generally correct but also
specifically will help user-space applications avoid using the initial
tcpi_delivery_rate value of 0 that persists until the connection has
some non-zero bandwidth sample.
Fixes: eb8329e0a0 ("tcp: export data delivery rate")
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two new helper functions to retrieve a mapping of priority to PCP
and DSCP bitmasks, where each bitmap contains ones in positions that
match a rewrite entry.
dcb_ieee_getrewr_prio_dscp_mask_map() reuses the dcb_ieee_app_prio_map,
as this struct is already used for a similar mapping in the app table.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new rewrite table and all the required functions, offload hooks and
bookkeeping for maintaining it. The rewrite table reuses the app struct,
and the entire set of app selectors. As such, some bookeeping code can
be shared between the rewrite- and the APP table.
New functions for getting, setting and deleting entries has been added.
Apart from operating on the rewrite list, these functions do not emit a
DCB_APP_EVENT when the list os modified. The new dcb_getrewr does a
lookup based on selector and priority and returns the protocol, so that
mappings from priority to protocol, for a given selector and ifindex is
obtained.
Also, a new nested attribute has been added, that encapsulates one or
more app structs. This attribute is used to distinguish the two tables.
The dcb_lock used for the APP table is reused for the rewrite table.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for DCB rewrite. Add a new function for setting and
deleting both app and rewrite entries. Moving this into a separate
function reduces duplicate code, as both type of entries requires the
same set of checks. The function will now iterate through a configurable
nested attribute (app or rewrite attr), validate each attribute and call
the appropriate set- or delete function.
Note that this function always checks for nla_len(attr_itr) <
sizeof(struct dcb_app), which was only done in dcbnl_ieee_set and not in
dcbnl_ieee_del prior to this patch. This means, that any userspace tool
that used to shove in data < sizeof(struct dcb_app) would now receive
-ERANGE.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to DCB rewrite. Modify dcb_app_add to take new struct
list_head * as parameter, to make the used list configurable. This is
done to allow reusing the function for adding rewrite entries to the
rewrite table, which is introduced in a later patch.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After region and linecard lock removals, this helper is always supposed
to be called with instance lock held. So put the assertion here and
remove the comment which is no longer accurate.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
devlink_dump_for_each_instance_get() is currently called from
a single place in netlink.c. As there is no need to use
this helper anywhere else in the future, remove it and
call devlinks_xa_find_get() directly from while loop
in devlink_nl_instance_iter_dump(). Also remove redundant
idx clear on loop end as it is already done
in devlink_nl_instance_iter_dump().
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Benefit from recently introduced instance iteration and convert
reporters .dumpit generic netlink callback to use it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Benefit from recently introduced instance iteration and convert
linecards .dumpit generic netlink callback to use it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As long as the reporter life time is protected by devlink instance
lock, the reference counting is no longer needed. Remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove port-specific health reporter destroy function as it is
currently the same as the instance one so no longer needed. Inline
__devlink_health_reporter_destroy() as it is no longer called from
multiple places.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to other devlink objects, rely on devlink instance lock
and remove object specific reporters_lock.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to other devlink objects, protect the reporters list
by devlink instance lock. Alongside add unlocked versions
of health reporter create/destroy functions and use them in drivers
on call paths where the instance lock is held.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As long as the linecard life time is protected by devlink instance
lock, the reference counting is no longer needed. Remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to other devlink objects, convert the linecards list to be
protected by devlink instance lock. Alongside with that rename the
create/destroy() functions to devl_* to indicate the devlink instance
lock needs to be held while calling them.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These are WiFi 7 devices that will be introduced into the market
in 2023, with new drivers. Wireless extensions haven't been in
real development since 2006. Since wireless has evolved a lot,
and continues to evolve significantly with Multi-Link Operation,
there's really no good way to still support wireless extensions
for devices that do MLO.
Stop supporting wireless extensions for new devices. We don't
consider this a regression since no such devices (apart from
hwsim) exist yet.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230118105152.45f85078a1e0.Ib9eabc2ec5bf6b0244e4d973e93baaa3d8c91bd8@changeid
With WiFi 7 (802.11ax, MLO/EHT) around the corner, we're going to
remove support for wireless extensions with new devices since MLO
(multi-link operation) cannot be properly indicated using them.
Add a warning to indicate which processes are still using wireless
extensions, if being used with modern (i.e. cfg80211) drivers.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230118105152.a7158a929a6f.Ifcf30eeeb8fc7019e4dcf2782b04515254d165e1@changeid
The referenced commit changed the error code returned by the kernel
when preventing a non-established socket from attaching the ktls
ULP. Before to such a commit, the user-space got ENOTCONN instead
of EINVAL.
The existing self-tests depend on such error code, and the change
caused a failure:
RUN global.non_established ...
tls.c:1673:non_established:Expected errno (22) == ENOTCONN (107)
non_established: Test failed at step #3
FAIL global.non_established
In the unlikely event existing applications do the same, address
the issue by restoring the prior error code in the above scenario.
Note that the only other ULP performing similar checks at init
time - smc_ulp_ops - also fails with ENOTCONN when trying to attach
the ULP to a non-established socket.
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Fixes: 2c02d41d71 ("net/ulp: prevent ULP without clone op from entering the LISTEN status")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/7bb199e7a93317fb6f8bf8b9b2dc71c18f337cde.1674042685.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Somehow an extra 'e' slipped in there without anyone noticing,
drop that from ieeee80211_obss_color_collision_notify().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
load_keys_from_buffer() in net/wireless/reg.c duplicates
x509_load_certificate_list() in crypto/asymmetric_keys/x509_loader.c
for no apparent reason.
Deduplicate it. No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/e7280be84acda02634bc7cb52c97656182b9c700.1673197326.git.lukas@wunner.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
While one cpu is working on looking up the right socket from ehash
table, another cpu is done deleting the request socket and is about
to add (or is adding) the big socket from the table. It means that
we could miss both of them, even though it has little chance.
Let me draw a call trace map of the server side.
CPU 0 CPU 1
----- -----
tcp_v4_rcv() syn_recv_sock()
inet_ehash_insert()
-> sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
-> __sk_nulls_add_node_rcu(sk, list)
Notice that the CPU 0 is receiving the data after the final ack
during 3-way shakehands and CPU 1 is still handling the final ack.
Why could this be a real problem?
This case is happening only when the final ack and the first data
receiving by different CPUs. Then the server receiving data with
ACK flag tries to search one proper established socket from ehash
table, but apparently it fails as my map shows above. After that,
the server fetches a listener socket and then sends a RST because
it finds a ACK flag in the skb (data), which obeys RST definition
in RFC 793.
Besides, Eric pointed out there's one more race condition where it
handles tw socket hashdance. Only by adding to the tail of the list
before deleting the old one can we avoid the race if the reader has
already begun the bucket traversal and it would possibly miss the head.
Many thanks to Eric for great help from beginning to end.
Fixes: 5e0724d027 ("tcp/dccp: fix hashdance race for passive sessions")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/lkml/20230112065336.41034-1-kerneljasonxing@gmail.com/
Link: https://lore.kernel.org/r/20230118015941.1313-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Third set of fixes for v6.2. This time most of them are for drivers,
only one revert for mac80211. For an important mt76 fix we had to
cherry pick two commits from wireless-next.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmPHoUcRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZtaCgf9G16rPowxI6AD+EliFbArdiwrV+mJHDyN
6eVniHDKgiFnvbyqvh+sGpbuYMwqqfERdxU3qi4+YhVGWyNNQYdJlntggKsTVRKK
gtE6h4zAo2DC6F+/zYt/FkQ6mCK6UQsaHDktGEqRP0vxH8Kdk85+yXEuwklI+L1L
w5ZTZ3HRxdtMhF9AmjCVOUrEEFXosanYTwSZ+1nlMEZ8vc5Wg5TH9wgue3Eg+9vx
vUjfRrrjOAlvGCcb9lVvPseH7n0m/U2JbugQkebuEUvzo4Fxcl2mR9pFXLGGbtAM
gseuNUfJftKVyYlTLc8brI7XpSSx6pV75h1EmvrHPkjiw1oSGeK8ig==
=13z+
-----END PGP SIGNATURE-----
Merge tag 'wireless-2023-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.2
Third set of fixes for v6.2. This time most of them are for drivers,
only one revert for mac80211. For an important mt76 fix we had to
cherry pick two commits from wireless-next.
* tag 'wireless-2023-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
Revert "wifi: mac80211: fix memory leak in ieee80211_if_add()"
wifi: mt76: dma: fix a regression in adding rx buffers
wifi: mt76: handle possible mt76_rx_token_consume failures
wifi: mt76: dma: do not increment queue head if mt76_dma_add_buf fails
wifi: rndis_wlan: Prevent buffer overflow in rndis_query_oid
wifi: brcmfmac: fix regression for Broadcom PCIe wifi devices
wifi: brcmfmac: avoid NULL-deref in survey dump for 2G only device
wifi: brcmfmac: avoid handling disabled channels for survey dump
====================
Link: https://lore.kernel.org/r/20230118073749.AF061C433EF@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add minimal support for RX EHT rate reporting, not yet
adding (modifying) any radiotap headers, just statistics
for cfg80211.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Adding flags for SU Beamformer, SU Beamformee, MU Beamformer and Full
Bandwidth UL MU-MIMO for HE. This is utilized to pass MU-MIMO
configurations from user space to driver in AP mode.
Signed-off-by: Muna Sinada <quic_msinada@quicinc.com>
Link: https://lore.kernel.org/r/1665006886-23874-2-git-send-email-quic_msinada@quicinc.com
[fixed indentation, removed redundant !!]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Adding flags for SU Beamformer, SU Beamformee, MU Beamformer and
MU Beamformee for VHT. This is utilized to pass MU-MIMO
configurations from user space to driver in AP mode.
Signed-off-by: Muna Sinada <quic_msinada@quicinc.com>
Link: https://lore.kernel.org/r/1665006886-23874-1-git-send-email-quic_msinada@quicinc.com
[fixed indentation, removed redundant !!]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, maximum KCK key length supported for GTK rekey offload is 24
bytes but with some newer AKMs the KCK key length can be 32 bytes. e.g.,
00-0F-AC:24 AKM suite with SAE finite cyclic group 21. Add support to
allow 32 bytes KCK keys in GTK rekey offload.
Signed-off-by: Shivani Baranwal <quic_shivbara@quicinc.com>
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Link: https://lore.kernel.org/r/20221206143715.1802987-3-quic_vjakkam@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The extended KCK key length check wrongly using the KEK key attribute
for validation. Due to this GTK rekey offload is failing when the KCK
key length is 24 bytes even though the driver advertising
WIPHY_FLAG_SUPPORTS_EXT_KEK_KCK flag. Use correct attribute to fix the
same.
Fixes: 093a48d2aa ("cfg80211: support bigger kek/kck key length")
Signed-off-by: Shivani Baranwal <quic_shivbara@quicinc.com>
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Link: https://lore.kernel.org/r/20221206143715.1802987-2-quic_vjakkam@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This reverts commit b8676221f0 ("cfg80211: Add support for
static WEP in the driver") since no driver ever ended up using
it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pablo Niera Ayuso says:
====================
The following patchset contains Netfilter fixes for net:
1) Fix syn-retransmits until initiator gives up when connection is re-used
due to rst marked as invalid, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal says:
====================
Netfilter updates for net-next
following patch set includes netfilter updates for your *net-next* tree.
1. Replace pr_debug use with nf_log infra for debugging in sctp
conntrack.
2. Remove pr_debug calls, they are either useless or we have better
options in place.
3. Avoid repeated load of ct->status in some spots.
Some bit-flags cannot change during the lifeetime of
a connection, so no need to re-fetch those.
4. Avoid uneeded nesting of rcu_read_lock during tuple lookup.
5. Remove the CLUSTERIP target. Marked as obsolete for years,
and we still have WARN splats wrt. races of the out-of-band
/proc interface installed by this target.
6. Add static key to nf_tables to avoid the retpoline mitigation
if/else if cascade provided the cpu doesn't need the retpoline thunk.
7. add nf_tables objref calls to the retpoline mitigation workaround.
8. Split parts of nft_ct.c that do not need symbols exported by
the conntrack modules and place them in nf_tables directly.
This allows to avoid indirect call for 'ct status' checks.
9. Add 'destroy' commands to nf_tables. They are identical
to the existing 'delete' commands, but do not indicate
an error if the referenced object (set, chain, rule...)
did not exist, from Fernando.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Per cpu entries are no longer used in consideration
for doing gc or not. Remove the extra per cpu entries
pull to directly check for time and perform gc.
Signed-off-by: Tanmay Bhushan <007047221b@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce NFT_MSG_DESTROY* message type. The destroy operation performs a
delete operation but ignoring the ENOENT errors.
This is useful for the transaction semantics, where failing to delete an
object which does not exist results in aborting the transaction.
This new command allows the transaction to proceed in case the object
does not exist.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
nft_ct expression cannot be made builtin to nf_tables without also
forcing the conntrack itself to be builtin.
However, this can be avoided by splitting retrieval of a few
selector keys that only need to access the nf_conn structure,
i.e. no function calls to nf_conntrack code.
Many rulesets start with something like
"ct status established,related accept"
With this change, this no longer requires an indirect call, which
gives about 1.8% more throughput with a simple conntrack-enabled
forwarding test (retpoline thunk used).
Signed-off-by: Florian Westphal <fw@strlen.de>
If CONFIG_RETPOLINE is enabled nf_tables avoids indirect calls for
builtin expressions.
On newer cpus indirect calls do not go through the retpoline thunk
anymore, even for RETPOLINE=y builds.
Just like with the new tc retpoline wrappers:
Add a static key to skip the if / else if cascade if the cpu
does not require retpolines.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Marked as 'to be removed soon' since kernel 4.1 (2015).
Functionality was superseded by the 'cluster' match, added in kernel
2.6.30 (2009).
clusterip_tg_check still has races that can give
proc_dir_entry 'ipt_CLUSTERIP/10.1.1.2' already registered
followed by a WARN splat.
Remove it instead of trying to fix this up again.
clusterip uapi header is left as-is for now.
Signed-off-by: Florian Westphal <fw@strlen.de>
Move rcu_read_lock/unlock to nf_conntrack_find_get(), this avoids
nested rcu_read_lock call from resolve_normal_ct().
Signed-off-by: Florian Westphal <fw@strlen.de>
Compiler can't merge the two test_bit() calls, so load ct->status
once and use non-atomic accesses.
This is fine because IPS_EXPECTED or NAT_CLASH are either set at ct
creation time or not at all, but compiler can't know that.
Signed-off-by: Florian Westphal <fw@strlen.de>
Those are all useless or dubious.
getorigdst() is called via setsockopt, so return value/errno will
already indicate an appropriate error.
For other pr_debug calls there are better replacements, such as
slab/slub debugging or 'conntrack -E' (ctnetlink events).
Signed-off-by: Florian Westphal <fw@strlen.de>
The conntrack logging facilities include useful info such as in/out
interface names and packet headers.
Use those in more places instead of pr_debug calls.
Furthermore, several pr_debug calls can be removed, they are useless
on production machines due to the sheer volume of log messages.
Signed-off-by: Florian Westphal <fw@strlen.de>
syzbot reports a possible deadlock in rfcomm_sk_state_change [1].
While rfcomm_sock_connect acquires the sk lock and waits for
the rfcomm lock, rfcomm_sock_release could have the rfcomm
lock and hit a deadlock for acquiring the sk lock.
Here's a simplified flow:
rfcomm_sock_connect:
lock_sock(sk)
rfcomm_dlc_open:
rfcomm_lock()
rfcomm_sock_release:
rfcomm_sock_shutdown:
rfcomm_lock()
__rfcomm_dlc_close:
rfcomm_k_state_change:
lock_sock(sk)
This patch drops the sk lock before calling rfcomm_dlc_open to
avoid the possible deadlock and holds sk's reference count to
prevent use-after-free after rfcomm_dlc_open completes.
Reported-by: syzbot+d7ce59...@syzkaller.appspotmail.com
Fixes: 1804fdf6e4 ("Bluetooth: btintel: Combine setting up MSFT extension")
Link: https://syzkaller.appspot.com/bug?extid=d7ce59b06b3eb14fd218 [1]
Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes the following trace caused by attempting to lock
cmd_sync_work_lock while holding the rcu_read_lock:
kworker/u3:2/212 is trying to lock:
ffff888002600910 (&hdev->cmd_sync_work_lock){+.+.}-{3:3}, at:
hci_cmd_sync_queue+0xad/0x140
other info that might help us debug this:
context-{4:4}
4 locks held by kworker/u3:2/212:
#0: ffff8880028c6530 ((wq_completion)hci0#2){+.+.}-{0:0}, at:
process_one_work+0x4dc/0x9a0
#1: ffff888001aafde0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0},
at: process_one_work+0x4dc/0x9a0
#2: ffff888002600070 (&hdev->lock){+.+.}-{3:3}, at:
hci_cc_le_set_cig_params+0x64/0x4f0
#3: ffffffffa5994b00 (rcu_read_lock){....}-{1:2}, at:
hci_cc_le_set_cig_params+0x2f9/0x4f0
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When hci_cmd_sync_queue() failed in hci_update_adv_data(), inst_ptr is
not freed, which will cause memory leak, convert to use ERR_PTR/PTR_ERR
to pass the instance to callback so no memory needs to be allocated.
Fixes: 651cd3d65b ("Bluetooth: convert hci_update_adv_data to hci_sync")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When hci_cmd_sync_queue() failed in hci_le_terminate_big() or
hci_le_big_terminate(), the memory pointed by variable d is not freed,
which will cause memory leak. Add release process to error path.
Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Don't try to use HCI_OP_LE_READ_BUFFER_SIZE_V2 if controller don't
support ISO channels, but in order to check if ISO channels are
supported HCI_OP_LE_READ_LOCAL_FEATURES needs to be done earlier so the
features bits can be checked on hci_le_read_buffer_size_sync.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216817
Fixes: c1631dbc00 ("Bluetooth: hci_sync: Fix hci_read_buffer_size_sync")
Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Smatch Warning:
net/bluetooth/mgmt_util.c:375 mgmt_mesh_add() error: __memcpy()
'mesh_tx->param' too small (48 vs 50)
Analysis:
'mesh_tx->param' is array of size 48. This is the destination.
u8 param[sizeof(struct mgmt_cp_mesh_send) + 29]; // 19 + 29 = 48.
But in the caller 'mesh_send' we reject only when len > 50.
len > (MGMT_MESH_SEND_SIZE + 31) // 19 + 31 = 50.
Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When a connection is re-used, following can happen:
[ connection starts to close, fin sent in either direction ]
> syn # initator quickly reuses connection
< ack # peer sends a challenge ack
> rst # rst, sequence number == ack_seq of previous challenge ack
> syn # this syn is expected to pass
Problem is that the rst will fail window validation, so it gets
tagged as invalid.
If ruleset drops such packets, we get repeated syn-retransmits until
initator gives up or peer starts responding with syn/ack.
Before the commit indicated in the "Fixes" tag below this used to work:
The challenge-ack made conntrack re-init state based on the challenge
ack itself, so the following rst would pass window validation.
Add challenge-ack support: If we get ack for syn, record the ack_seq,
and then check if the rst sequence number matches the last ack number
seen in reverse direction.
Fixes: c7aab4f170 ("netfilter: nf_conntrack_tcp: re-init for syn packets only")
Reported-by: Michal Tesar <mtesar@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Due to the two cherry picked commits from wireless to wireless-next we have
several conflicts in mt76. To avoid any bugs with conflicts merge wireless into
wireless-next.
96f134dc19 wifi: mt76: handle possible mt76_rx_token_consume failures
fe13dad899 wifi: mt76: dma: do not increment queue head if mt76_dma_add_buf fails
__inet_hash_connect() has a fast path taken if sk_head(&tb->owners) is
equal to the sk parameter.
sk_head() returns the hlist_entry() with respect to the sk_node field.
However entries in the tb->owners list are inserted with respect to the
sk_bind_node field with sk_add_bind_node().
Thus the check would never pass and the fast path never execute.
This fast path has never been executed or tested as this bug seems
to be present since commit 1da177e4c3 ("Linux-2.6.12-rc2"), thus
remove it to reduce code complexity.
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230112-inet_hash_connect_bind_head-v3-1-b591fd212b93@diag.uniroma1.it
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The kfree_skb_list function walks SKB (via skb->next) and frees them
individually to the SLUB/SLAB allocator (kmem_cache). It is more
efficient to bulk free them via the kmem_cache_free_bulk API.
This patches create a stack local array with SKBs to bulk free while
walking the list. Bulk array size is limited to 16 SKBs to trade off
stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
for SLUB the optimal bulk free case is 32 objects belonging to same
slab, but runtime this isn't likely to occur.
The expected gain from using kmem_cache bulk alloc and free API
have been assessed via a microbencmark kernel module[1].
The module 'slab_bulk_test01' results at bulk 16 element:
kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
kmem-bulk Per elem: 64 cycles(tsc) 17.905 ns (step:16)
More detailed description of benchmarks avail in [2].
[1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
[2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org
V2: rename function to kfree_skb_add_bulk.
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The SKB drop reason uses __builtin_return_address(0) to give the call
"location" to trace_kfree_skb() tracepoint skb:kfree_skb.
To keep this stable for compilers kfree_skb_reason() is annotated with
__fix_address (noinline __noclone) as fixed in commit c205cc7534
("net: skb: prevent the split of kfree_skb_reason() by gcc").
The function kfree_skb_list_reason() invoke kfree_skb_reason(), which
cause the __builtin_return_address(0) "location" to report the
unexpected address of kfree_skb_list_reason.
Example output from 'perf script':
kpktgend_0 1337 [000] 81.002597: skb:kfree_skb: skbaddr=0xffff888144824700 protocol=2048 location=kfree_skb_list_reason+0x1e reason: QDISC_DROP
Patch creates an __always_inline __kfree_skb_reason() helper call that
is called from both kfree_skb_list() and kfree_skb_list_reason().
Suggestions for solutions that shares code better are welcome.
As preparation for next patch move __kfree_skb() invocation out of
this helper function.
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This code checks if (attrs[DEVLINK_ATTR_TRAP_POLICER_ID]) twice. Once
at the start of the function and then a couple lines later. Delete the
second check since that one must be true.
Because the second condition is always true, it means the:
policer_item = group_item->policer_item;
assignment is immediately over-written. Delete that as well.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/Y8EJz8oxpMhfiPUb@kili
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We will report extack message if there is an error via netlink_ack(). But
if the rule is not to be exclusively executed by the hardware, extack is not
passed along and offloading failures don't get logged.
In commit 81c7288b17 ("sched: cls: enable verbose logging") Marcelo
made cls could log verbose info for offloading failures, which helps
improving Open vSwitch debuggability when using flower offloading.
It would also be helpful if userspace monitor tools, like "tc monitor",
could log this kind of message, as it doesn't require vswitchd log level
adjusment. Let's add a new tc attributes to report the extack message so
the monitor program could receive the failures. e.g.
# tc monitor
added chain dev enp3s0f1np1 parent ffff: chain 0
added filter dev enp3s0f1np1 ingress protocol all pref 49152 flower chain 0 handle 0x1
ct_state +trk+new
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 1 ref 1 bind 1
Warning: mlx5_core: matching on ct_state +new isn't supported.
In this patch I only report the extack message on add/del operations.
It doesn't look like we need to report the extack message on get/dump
operations.
Note this message not only reporte to multicast groups, it could also
be reported unicast, which may affect the current usersapce tool's behaivor.
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230113034353.2766735-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The code in l2tp_tunnel_register() is racy in several ways:
1. It modifies the tunnel socket _after_ publishing it.
2. It calls setup_udp_tunnel_sock() on an existing socket without
locking.
3. It changes sock lock class on fly, which triggers many syzbot
reports.
This patch amends all of them by moving socket initialization code
before publishing and under sock lock. As suggested by Jakub, the
l2tp lockdep class is not necessary as we can just switch to
bh_lock_sock_nested().
Fixes: 37159ef2c1 ("l2tp: fix a lockdep splat")
Fixes: 6b9f34239b ("l2tp: fix races in tunnel creation")
Reported-by: syzbot+52866e24647f9a23403f@syzkaller.appspotmail.com
Reported-by: syzbot+94cc2a66fc228b23f360@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guillaume Nault <gnault@redhat.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp uses l2tp_tunnel_list to track all registered tunnels and
to allocate tunnel ID's. IDR can do the same job.
More importantly, with IDR we can hold the ID before a successful
registration so that we don't need to worry about late error
handling, it is not easy to rollback socket changes.
This is a preparation for the following fix.
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guillaume Nault <gnault@redhat.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit changes virtio/vsock to use sk_buff instead of
virtio_vsock_pkt. Beyond better conforming to other net code, using
sk_buff allows vsock to use sk_buff-dependent features in the future
(such as sockmap) and improves throughput.
This patch introduces the following performance changes:
Tool: Uperf
Env: Phys Host + L1 Guest
Payload: 64k
Threads: 16
Test Runs: 10
Type: SOCK_STREAM
Before: commit b7bfaa761d ("Linux 6.2-rc3")
Before
------
g2h: 16.77Gb/s
h2g: 10.56Gb/s
After
-----
g2h: 21.04Gb/s
h2g: 10.76Gb/s
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported a nasty crash [1] in net_tx_action() which
made little sense until we got a repro.
This repro installs a taprio qdisc, but providing an
invalid TCA_RATE attribute.
qdisc_create() has to destroy the just initialized
taprio qdisc, and taprio_destroy() is called.
However, the hrtimer used by taprio had already fired,
therefore advance_sched() called __netif_schedule().
Then net_tx_action was trying to use a destroyed qdisc.
We can not undo the __netif_schedule(), so we must wait
until one cpu serviced the qdisc before we can proceed.
Many thanks to Alexander Potapenko for his help.
[1]
BUG: KMSAN: uninit-value in queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline]
BUG: KMSAN: uninit-value in do_raw_spin_trylock include/linux/spinlock.h:191 [inline]
BUG: KMSAN: uninit-value in __raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
BUG: KMSAN: uninit-value in _raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138
queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline]
do_raw_spin_trylock include/linux/spinlock.h:191 [inline]
__raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
_raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138
spin_trylock include/linux/spinlock.h:359 [inline]
qdisc_run_begin include/net/sch_generic.h:187 [inline]
qdisc_run+0xee/0x540 include/net/pkt_sched.h:125
net_tx_action+0x77c/0x9a0 net/core/dev.c:5086
__do_softirq+0x1cc/0x7fb kernel/softirq.c:571
run_ksoftirqd+0x2c/0x50 kernel/softirq.c:934
smpboot_thread_fn+0x554/0x9f0 kernel/smpboot.c:164
kthread+0x31b/0x430 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
Uninit was created at:
slab_post_alloc_hook mm/slab.h:732 [inline]
slab_alloc_node mm/slub.c:3258 [inline]
__kmalloc_node_track_caller+0x814/0x1250 mm/slub.c:4970
kmalloc_reserve net/core/skbuff.c:358 [inline]
__alloc_skb+0x346/0xcf0 net/core/skbuff.c:430
alloc_skb include/linux/skbuff.h:1257 [inline]
nlmsg_new include/net/netlink.h:953 [inline]
netlink_ack+0x5f3/0x12b0 net/netlink/af_netlink.c:2436
netlink_rcv_skb+0x55d/0x6c0 net/netlink/af_netlink.c:2507
rtnetlink_rcv+0x30/0x40 net/core/rtnetlink.c:6108
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0xf3b/0x1270 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x1288/0x1440 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0xabc/0xe90 net/socket.c:2482
___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2536
__sys_sendmsg net/socket.c:2565 [inline]
__do_sys_sendmsg net/socket.c:2574 [inline]
__se_sys_sendmsg net/socket.c:2572 [inline]
__x64_sys_sendmsg+0x367/0x540 net/socket.c:2572
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
CPU: 0 PID: 13 Comm: ksoftirqd/0 Not tainted 6.0.0-rc2-syzkaller-47461-gac3859c02d7f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pable Neira Ayuso says:
====================
The following patchset contains Netfilter fixes for net:
1) Increase timeout to 120 seconds for netfilter selftests to fix
nftables transaction tests, from Florian Westphal.
2) Fix overflow in bitmap_ip_create() due to integer arithmetics
in a 64-bit bitmask, from Gavrilov Ilia.
3) Fix incorrect arithmetics in nft_payload with double-tagged
vlan matching.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After switching to TCP_ESTABLISHED or TCP_LISTEN sk_state, alive SOCK_STREAM
and SOCK_SEQPACKET sockets can't change it anymore (since commit 3ff8bff704
"unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()").
Thus, we do not need to take lock here.
Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ipip6 and ip6ip decap support for bpf_skb_adjust_room().
Main use case is for using cls_bpf on ingress hook to decapsulate
IPv4 over IPv6 and IPv6 over IPv4 tunnel packets.
Add two new flags BPF_F_ADJ_ROOM_DECAP_L3_IPV{4,6} to indicate the
new IP header version after decapsulating the outer IP header.
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/b268ec7f0ff9431f4f43b1b40ab856ebb28cb4e1.1673574419.git.william.xuanziyang@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Peek at old qdisc and graft only when deleting a leaf class in the htb,
rather than when deleting the htb itself. Do not peek at the qdisc of the
netdev queue when destroying the htb. The caller may already have grafted a
new qdisc that is not part of the htb structure being destroyed.
This fix resolves two use cases.
1. Using tc to destroy the htb.
- Netdev was being prematurely activated before the htb was fully
destroyed.
2. Using tc to replace the htb with another qdisc (which also leads to
the htb being destroyed).
- Premature netdev activation like previous case. Newly grafted qdisc
was also getting accidentally overwritten when destroying the htb.
Fixes: d03b195b5a ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230113005528.302625-1-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If an MPTCP socket has been created with AF_INET6 and the IPV6_V6ONLY
option has been set, the userspace PM would allow creating subflows
using IPv4 addresses, e.g. mapped in v6.
The kernel side of userspace PM will also accept creating subflows with
local and remote addresses having different families. Depending on the
subflow socket's family, different behaviours are expected:
- If AF_INET is forced with a v6 address, the kernel will take the last
byte of the IP and try to connect to that: a new subflow is created
but to a non expected address.
- If AF_INET6 is forced with a v4 address, the kernel will try to
connect to a v4 address (v4-mapped-v6). A -EBADF error from the
connect() part is then expected.
It is then required to check the given families can be accepted. This is
done by using a new helper for addresses family matching, taking care of
IPv4 vs IPv4-mapped-IPv6 addresses. This helper will be re-used later by
the in-kernel path-manager to use mixed IPv4 and IPv6 addresses.
While at it, a clear error message is now reported if there are some
conflicts with the families that have been passed by the userspace.
Fixes: 702c2f646d ("mptcp: netlink: allow userspace-driven subflow establishment")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Let the caller specify the to-be-created subflow family.
For a given MPTCP socket created with the AF_INET6 family, the current
userspace PM can already ask the kernel to create subflows in v4 and v6.
If "plain" IPv4 addresses are passed to the kernel, they are
automatically mapped in v6 addresses "by accident". This can be
problematic because the userspace will need to pass different addresses,
now the v4-mapped-v6 addresses to destroy this new subflow.
On the other hand, if the MPTCP socket has been created with the AF_INET
family, the command to create a subflow in v6 will be accepted but the
result will not be the one as expected as new subflow will be created in
IPv4 using part of the v6 addresses passed to the kernel: not creating
the expected subflow then.
No functional change intended for the in-kernel PM where an explicit
enforcement is currently in place. This arbitrary enforcement will be
leveraged by other patches in a future version.
Fixes: 702c2f646d ("mptcp: netlink: allow userspace-driven subflow establishment")
Cc: stable@vger.kernel.org
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Revert a wrong fix that was done during the review process. The
intention was to substitute "if(ret < 0)" with "if(ret)".
Unfortunately, the intended fix did not meet the code.
Besides, after additional review, it was decided that "if(ret < 0)"
was actually the right thing to do.
Fixes: 8580e16c28 ("net/ethtool: add netlink interface for the PLCA RS")
Signed-off-by: Piergiorgio Beruto <piergiorgio.beruto@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/f2277af8951a51cfee2fb905af8d7a812b7beaf4.1673616357.git.piergiorgio.beruto@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix a use-after-free that occurs in kfree_skb() called from
local_cleanup(). This could happen when killing nfc daemon (e.g. neard)
after detaching an nfc device.
When detaching an nfc device, local_cleanup() called from
nfc_llcp_unregister_device() frees local->rx_pending and decreases
local->ref by kref_put() in nfc_llcp_local_put().
In the terminating process, nfc daemon releases all sockets and it leads
to decreasing local->ref. After the last release of local->ref,
local_cleanup() called from local_release() frees local->rx_pending
again, which leads to the bug.
Setting local->rx_pending to NULL in local_cleanup() could prevent
use-after-free when local_cleanup() is called twice.
Found by a modified version of syzkaller.
BUG: KASAN: use-after-free in kfree_skb()
Call Trace:
dump_stack_lvl (lib/dump_stack.c:106)
print_address_description.constprop.0.cold (mm/kasan/report.c:306)
kasan_check_range (mm/kasan/generic.c:189)
kfree_skb (net/core/skbuff.c:955)
local_cleanup (net/nfc/llcp_core.c:159)
nfc_llcp_local_put.part.0 (net/nfc/llcp_core.c:172)
nfc_llcp_local_put (net/nfc/llcp_core.c:181)
llcp_sock_destruct (net/nfc/llcp_sock.c:959)
__sk_destruct (net/core/sock.c:2133)
sk_destruct (net/core/sock.c:2181)
__sk_free (net/core/sock.c:2192)
sk_free (net/core/sock.c:2203)
llcp_sock_release (net/nfc/llcp_sock.c:646)
__sock_release (net/socket.c:650)
sock_close (net/socket.c:1365)
__fput (fs/file_table.c:306)
task_work_run (kernel/task_work.c:179)
ptrace_notify (kernel/signal.c:2354)
syscall_exit_to_user_mode_prepare (kernel/entry/common.c:278)
syscall_exit_to_user_mode (kernel/entry/common.c:296)
do_syscall_64 (arch/x86/entry/common.c:86)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:106)
Allocated by task 4719:
kasan_save_stack (mm/kasan/common.c:45)
__kasan_slab_alloc (mm/kasan/common.c:325)
slab_post_alloc_hook (mm/slab.h:766)
kmem_cache_alloc_node (mm/slub.c:3497)
__alloc_skb (net/core/skbuff.c:552)
pn533_recv_response (drivers/nfc/pn533/usb.c:65)
__usb_hcd_giveback_urb (drivers/usb/core/hcd.c:1671)
usb_giveback_urb_bh (drivers/usb/core/hcd.c:1704)
tasklet_action_common.isra.0 (kernel/softirq.c:797)
__do_softirq (kernel/softirq.c:571)
Freed by task 1901:
kasan_save_stack (mm/kasan/common.c:45)
kasan_set_track (mm/kasan/common.c:52)
kasan_save_free_info (mm/kasan/genericdd.c:518)
__kasan_slab_free (mm/kasan/common.c:236)
kmem_cache_free (mm/slub.c:3809)
kfree_skbmem (net/core/skbuff.c:874)
kfree_skb (net/core/skbuff.c:931)
local_cleanup (net/nfc/llcp_core.c:159)
nfc_llcp_unregister_device (net/nfc/llcp_core.c:1617)
nfc_unregister_device (net/nfc/core.c:1179)
pn53x_unregister_nfc (drivers/nfc/pn533/pn533.c:2846)
pn533_usb_disconnect (drivers/nfc/pn533/usb.c:579)
usb_unbind_interface (drivers/usb/core/driver.c:458)
device_release_driver_internal (drivers/base/dd.c:1279)
bus_remove_device (drivers/base/bus.c:529)
device_del (drivers/base/core.c:3665)
usb_disable_device (drivers/usb/core/message.c:1420)
usb_disconnect (drivers/usb/core.c:2261)
hub_event (drivers/usb/core/hub.c:5833)
process_one_work (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:212 include/trace/events/workqueue.h:108 kernel/workqueue.c:2281)
worker_thread (include/linux/list.h:282 kernel/workqueue.c:2423)
kthread (kernel/kthread.c:319)
ret_from_fork (arch/x86/entry/entry_64.S:301)
Fixes: 3536da06db ("NFC: llcp: Clean local timers and works when removing a device")
Signed-off-by: Jisoo Jang <jisoo.jang@yonsei.ac.kr>
Link: https://lore.kernel.org/r/20230111131914.3338838-1-jisoo.jang@yonsei.ac.kr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The details of the iov_iter types are appropriately abstracted, so
there's no need to check for specific type fields. Just let the
abstractions handle it.
This is preparing for io_uring/net's io_send to utilize the more
efficient ITER_UBUF.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20230111184245.3784393-1-kbusch@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add 2 tracepoints to monitor the tcp/udp traffic
of per process and per cgroup.
Regarding monitoring the tcp/udp traffic of each process, there are two
existing solutions, the first one is https://www.atoptool.nl/netatop.php.
The second is via kprobe/kretprobe.
Netatop solution is implemented by registering the hook function at the
hook point provided by the netfilter framework.
These hook functions may be in the soft interrupt context and cannot
directly obtain the pid. Some data structures are added to bind packets
and processes. For example, struct taskinfobucket, struct taskinfo ...
Every time the process sends and receives packets it needs multiple
hashmaps,resulting in low performance and it has the problem fo inaccurate
tcp/udp traffic statistics(for example: multiple threads share sockets).
We can obtain the information with kretprobe, but as we know, kprobe gets
the result by trappig in an exception, which loses performance compared
to tracepoint.
We compared the performance of tracepoints with the above two methods, and
the results are as follows:
ab -n 1000000 -c 1000 -r http://127.0.0.1/index.html
without trace:
Time per request: 39.660 [ms] (mean)
Time per request: 0.040 [ms] (mean, across all concurrent requests)
netatop:
Time per request: 50.717 [ms] (mean)
Time per request: 0.051 [ms] (mean, across all concurrent requests)
kr:
Time per request: 43.168 [ms] (mean)
Time per request: 0.043 [ms] (mean, across all concurrent requests)
tracepoint:
Time per request: 41.004 [ms] (mean)
Time per request: 0.041 [ms] (mean, across all concurrent requests
It can be seen that tracepoint has better performance.
Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com>
Signed-off-by: Xiongchun Duan <duanxiongchun@bytedance.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following ethtool tx aggregation parameters:
ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES
Maximum size in bytes of a tx aggregated block of frames.
ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
Maximum number of frames that can be aggregated into a block.
ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS
Time in usecs after the first packet arrival in an aggregated
block for the block to be sent.
Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For PDelay_Resp messages we will likely have a negative value in the
correction field. The switch hardware cannot correctly update such
values (produces an off by one error in the UDP checksum), so it must be
moved to the time stamp field in the tail tag. Format of the correction
field is 48 bit ns + 16 bit fractional ns. After updating the
correction field, clone is no longer required hence it is freed.
Signed-off-by: Christian Eggers <ceggers@arri.de>
Co-developed-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the routines for transmission of ptp packets. When the
ptp pdelay_req packet to be transmitted, it uses the deferred xmit
worker to schedule the packets.
During irq_setup, interrupt for Sync, Pdelay_req and Pdelay_rsp are
enabled. So interrupt is triggered for all three packets. But for
p2p1step, we require only time stamp of Pdelay_req packet. Hence to
avoid posting of the completion from ISR routine for Sync and
Pdelay_resp packets, ts_en flag is introduced. This controls which
packets need to processed for timestamp.
After the packet is transmitted, ISR is triggered. The time at which
packet transmitted is recorded to separate register.
This value is reconstructed to absolute time and posted to the user
application through socket error queue.
Signed-off-by: Christian Eggers <ceggers@arri.de>
Co-developed-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rx Timestamping is done through 4 additional bytes in tail tag.
Whenever the ptp packet is received, the 4 byte hardware time stamped
value is added before 1 byte tail tag. Also, bit 7 in tail tag indicates
it as PTP frame. This 4 byte value is extracted from the tail tag and
reconstructed to absolute time and assigned to skb hwtstamp.
If the packet received in PDelay_Resp, then partial ingress timestamp
is subtracted from the correction field. Since user space tools expects
to be done in hardware.
Signed-off-by: Christian Eggers <ceggers@arri.de>
Co-developed-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the PTP is enabled in hardware bit 6 of PTP_MSG_CONF1 register, the
transmit frame needs additional 4 bytes before the tail tag. It is
needed for all the transmission packets irrespective of PTP packets or
not.
The 4-byte timestamp field is 0 for frames other than Pdelay_Resp. For
the one-step Pdelay_Resp, the switch needs the receive timestamp of the
Pdelay_Req message so that it can put the turnaround time in the
correction field.
Since PTP has to be enabled for both Transmission and reception
timestamping, driver needs to track of the tx and rx setting of the all
the user ports in the switch.
Two flags hw_tx_en and hw_rx_en are added in ksz_port to track the
timestampping setting of each port. When any one of ports has tx or rx
timestampping enabled, bit 6 of PTP_MSG_CONF1 is set and it is indicated
to tag_ksz.c through tagger bytes. This flag adds 4 additional bytes to
the tail tag. When tx and rx timestamping of all the ports are disabled,
then 4 bytes are not added.
Tested using hwstamp -i <interface>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com> # mostly api
Signed-off-by: David S. Miller <davem@davemloft.net>
Current code for RSS_GET ethtool command includes netlink attributes
in reply message to user space even if they are null. Added checks
to include netlink attribute in reply message only if a value is
received from driver. Drivers might return null for RSS indirection
table or hash key. Instead of including attributes with empty value
in the reply message, add netlink attribute only if there is content.
Fixes: 7112a04664 ("ethtool: add netlink based get rss support")
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/20230111235607.85509-1-sudheer.mogilappagari@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix rxrpc_connect_call() to return -ENOMEM rather than 0 if it fails to
look up a peer.
This generated a smatch warning:
net/rxrpc/call_object.c:303 rxrpc_connect_call() warn: missing error code 'ret'
I think this also fixes a syzbot-found bug:
rxrpc: Assertion failed - 1(0x1) == 11(0xb) is false
------------[ cut here ]------------
kernel BUG at net/rxrpc/call_object.c:645!
where the call being put is in the wrong state - as would be the case if we
failed to clear up correctly after the error in rxrpc_connect_call().
Fixes: 9d35d880e0 ("rxrpc: Move client call connection to the I/O thread")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Reported-and-tested-by: syzbot+4bb6356bb29d6299360e@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/202301111153.9eZRYLf1-lkp@intel.com/
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/2438405.1673460435@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>