Commit Graph

1267966 Commits

Author SHA1 Message Date
Eric Dumazet
9cf621bd5f rtnetlink: allow rtnl_fill_link_netnsid() to run under RCU protection
We want to be able to run rtnl_fill_ifinfo() under RCU protection
instead of RTNL in the future.

All rtnl_link_ops->get_link_net() methods already using dev_net()
are ready. I added READ_ONCE() annotations on others.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet
979aad40da rtnetlink: do not depend on RTNL in rtnl_xdp_prog_skb()
dev->xdp_prog is protected by RCU, we can lift RTNL requirement
from rtnl_xdp_prog_skb().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet
6890ab31d1 rtnetlink: do not depend on RTNL in rtnl_fill_proto_down()
Change dev_change_proto_down() and dev_change_proto_down_reason()
to write once on dev->proto_down and dev->proto_down_reason.

Then rtnl_fill_proto_down() can use READ_ONCE() annotations
and run locklessly.

rtnl_proto_down_size() should assume worst case,
because readng dev->proto_down_reason multiple
times would be racy without RTNL in the future.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet
6747a5d499 rtnetlink: do not depend on RTNL for many attributes
Following device fields can be read locklessly
in rtnl_fill_ifinfo() :

type, ifindex, operstate, link_mode, mtu, min_mtu, max_mtu, group,
promiscuity, allmulti, num_tx_queues, gso_max_segs, gso_max_size,
gro_max_size, gso_ipv4_max_size, gro_ipv4_max_size, tso_max_size,
tso_max_segs, num_rx_queues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet
55a2c86c8d net: write once on dev->allmulti and dev->promiscuity
In the following patch we want to read dev->allmulti
and dev->promiscuity locklessly from rtnl_fill_ifinfo()

In this patch I change __dev_set_promiscuity() and
__dev_set_allmulti() to write these fields (and dev->flags)
only if they succeed, with WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet
ad13b5b0d1 rtnetlink: do not depend on RTNL for IFLA_TXQLEN output
rtnl_fill_ifinfo() can read dev->tx_queue_len locklessly,
granted we add corresponding READ_ONCE()/WRITE_ONCE() annotations.

Add missing READ_ONCE(dev->tx_queue_len) in teql_enqueue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet
8a58268133 rtnetlink: do not depend on RTNL for IFLA_IFNAME output
We can use netdev_copy_name() to no longer rely on RTNL
to fetch dev->name.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet
698419ffb6 rtnetlink: do not depend on RTNL for IFLA_QDISC output
dev->qdisc can be read using RCU protection.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Paolo Abeni
25010156d2 Merge branch 'net-qede-don-t-restrict-error-codes'
says:

====================
net: qede: don't restrict error codes

This series fixes the qede driver, so that when a helper function fails,
then the callee should return the returned error code, instead just
assuming that the error is eg. -EINVAL.

The patches in this series, reduces the change of future bugs, so new
error codes can be returned from the helpers, without having to update
the call sites.

This is a follow-up to my recent series "net: qede: avoid overruling
error codes", which fixed the cases where the implicit assumption of
failing with specific error codes had been broken.
https://lore.kernel.org/netdev/20240426091227.78060-1-ast@fiberby.net/

Asbjørn Sloth Tønnesen (3):
  net: qede: use return from qede_parse_actions() for flow_spec
  net: qede: use return from qede_flow_spec_validate_unused()
  net: qede: use return from qede_flow_parse_ports()

 .../net/ethernet/qlogic/qede/qede_filter.c    | 27 ++++++++++++-------
 1 file changed, 18 insertions(+), 9 deletions(-)
====================

Link: https://lore.kernel.org/r/20240503105505.839342-1-ast@fiberby.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:08:16 +02:00
Asbjørn Sloth Tønnesen
c0c66eba63 net: qede: use return from qede_flow_parse_ports()
When calling qede_flow_parse_ports(), then the
return code was only used for a non-zero check,
and then -EINVAL was returned.

qede_flow_parse_ports() can currently fail with:
* -EINVAL

This patch changes qede_flow_parse_v{4,6}_common() to
use the actual return code from qede_flow_parse_ports(),
so it's no longer assumed that all errors are -EINVAL.

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:08:14 +02:00
Asbjørn Sloth Tønnesen
e5ed2f0349 net: qede: use return from qede_flow_spec_validate_unused()
When calling qede_flow_spec_validate_unused() then
the return code was only used for a non-zero check,
and then -EOPNOTSUPP was returned.

qede_flow_spec_validate_unused() can currently fail with:
* -EOPNOTSUPP

This patch changes qede_flow_spec_to_rule() to use the
actual return code from qede_flow_spec_validate_unused(),
so it's no longer assumed that all errors are -EOPNOTSUPP.

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:08:14 +02:00
Asbjørn Sloth Tønnesen
146817ec32 net: qede: use return from qede_parse_actions() for flow_spec
In qede_flow_spec_to_rule(), when calling
qede_parse_actions() then the return code
was only used for a non-zero check, and then
-EINVAL was returned.

qede_parse_actions() can currently fail with:
* -EINVAL
* -EOPNOTSUPP

Commit 319a1d1947 ("flow_offload: check for
basic action hw stats type") broke the implicit
assumption that it could only fail with -EINVAL,
by changing it to return -EOPNOTSUPP, when hardware
stats are requested.

However AFAICT it's not possible to trigger
qede_parse_actions() to return -EOPNOTSUPP, when
called from qede_flow_spec_to_rule(), as hardware
stats can't be requested by ethtool_rx_flow_rule_create().

This patch changes qede_flow_spec_to_rule() to use
the actual return code from qede_parse_actions(),
so it's no longer assumed that all errors are -EINVAL.

Only compile tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:08:14 +02:00
Jakub Kicinski
179a6f5df8 ipsec-next-2024-05-03
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmY0mEIACgkQrB3Eaf9P
 W7fU7g//bQyydwei/4Vo+cNCPp82k8wL/qhDY3IjN10PfJOSNmeCAcSgkuuHTRSx
 g/hoxZEVzLrQT5bt+Sb38JxADFiL787GjdGEUy1gzF7CnDKcGT5KnydYNjDqDVGt
 nOv9kAGfWIkMKdNqrhHifddPMWd+ZqvpUcFz5olvqIE2mpNgMwy2i3NID9bNAV31
 v5AEvNINa1LKOhX9cEka8iPQXwp+I6yTLqyOd4VciOuFr8dPg0FQqFaYR+OtMsV0
 kIxdGTVfmRWaNgq/Tsg4z/2rXEwmEjTWzAhNVGu8o8L3JozXOvbjIrDG7Ws6qB3V
 XTFl8ueRMk0UCTlY/QAfip5H7IlAo+H0FUBC45FNP1UhHeWisXT4D5rqAEqQTlZR
 bddtuueLZyKclFpXRNi+/8vdDrXhhEzeNINkc52Ef33rUTtZJR8bXrEUKzaYCIuF
 ldub0PA3+e5wvwIxq5/Chc/+MIaIHnXBMUmbCJSPnMrupBQtO+i6arPQcbtaBAgS
 YyVGTRk9YN0UAjSriIuiViLlgUCMsvsWgfSz9rd0PE54MFBrvcLPeCtPxKZ+sTVG
 Y2iSZ8d3ThvsMiQVNU8gj3SlTY1oTvuaijDDGjnR0nWkxV9LMJHCPKfIzsbOKLJe
 +ee5hKP4TOFygnV58BkqdGK/LavNpouTIbrM43hgmJ0IX9kSt4o=
 =QiGZ
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2024-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2024-05-03

1) Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support.
   This was defined by an early version of an IETF draft
   that did not make it to a standard.

2) Introduce direction attribute for xfrm states.
   xfrm states have a direction, a stsate can be used
   either for input or output packet processing.
   Add a direction to xfrm states to make it clear
   for what a xfrm state is used.

* tag 'ipsec-next-2024-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: Restrict SA direction attribute to specific netlink message types
  xfrm: Add dir validation to "in" data path lookup
  xfrm: Add dir validation to "out" data path lookup
  xfrm: Add Direction to the SA in or out
  udpencap: Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support
====================

Link: https://lore.kernel.org/r/20240503082732.2835810-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06 19:14:56 -07:00
Shi-Sheng Yang
46a5d3abed mptcp: fix typos in comments
This patch fixes the spelling mistakes in comments.
The changes were generated using codespell and reviewed manually.

eariler -> earlier
greceful -> graceful

Signed-off-by: Shi-Sheng Yang <fourcolor4c@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240502154740.249839-1-fourcolor4c@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06 19:12:08 -07:00
Simon Horman
6bee694225 octeontx2-pf: Treat truncation of IRQ name as an error
According to GCC, the constriction of irq_name in otx2_open()
may, theoretically, be truncated.

This patch takes the approach of treating such a situation as an error
which it detects by making use of the return value of snprintf, which is
the total number of bytes, excluding the trailing '\0', that would have
been written.

Based on the approach taken to a similar problem in
commit 54b909436e ("rtc: fix snprintf() checking in is_rtc_hctosys()")

Flagged by gcc-13 W=1 builds as:

.../otx2_pf.c:1933:58: warning: 'snprintf' output may be truncated before the last format character [-Wformat-truncation=]
 1933 |                 snprintf(irq_name, NAME_SIZE, "%s-rxtx-%d", pf->netdev->name,
      |                                                          ^
.../otx2_pf.c:1933:17: note: 'snprintf' output between 8 and 33 bytes into a destination of size 32
 1933 |                 snprintf(irq_name, NAME_SIZE, "%s-rxtx-%d", pf->netdev->name,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 1934 |                          qidx);
      |                          ~~~~~

Compile tested only.

Tested-by: Geetha sowjanya <gakula@marvell.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240503-octeon2-pf-irq_name-truncation-v2-1-91099177b942@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06 19:07:34 -07:00
Dr. David Alan Gilbert
ad3c9f0e62 atm/fore200e: Delete unused 'fore200e_boards'
This list looks like it's been unused since the OF conversion in
2008 in

commit 826b6cfcd5 ("fore200e: Convert over to pure OF driver.")

This also means we can remove the 'entry' member for the list.

Build tested only.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240503001822.183061-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06 18:26:47 -07:00
Shailend Chand
c93462b914 gve: Implement queue api
The new netdev queue api is implemented for gve.

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Link: https://lore.kernel.org/all/20240501232549.1327174-11-shailend@google.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06 18:23:05 -07:00
Paolo Abeni
8c4e479812 Merge branch 'add-tcp-fraglist-gro-support'
Felix Fietkau says:

====================
Add TCP fraglist GRO support

When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.

When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established
socket in the same netns as the receiving device. While this may not
cover all relevant use cases in multi-netns configurations, it should be
good enough for most configurations that need this.

Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.

rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on:  770 Mbit/s, CPU 40% idle

Changes since v4:
 - add likely() to prefer the non-fraglist path in check

Changes since v3:
 - optimize __tcpv4_gso_segment_csum
 - add unlikely()
 - reorder dev_net/skb_gro_network_header calls after NETIF_F_GRO_FRAGLIST
   check
 - add support for ipv6 nat
 - drop redundant pskb_may_pull check

Changes since v2:
 - create tcp_gro_header_pull helper function to pull tcp header only once
 - optimize __tcpv4_gso_segment_list_csum, drop obsolete flags check

Changes since v1:
 - revert bogus tcp flags overwrite on segmentation
 - fix kbuild issue with !CONFIG_IPV6
 - only perform socket lookup for the first skb in the GRO train

Changes since RFC:
 - split up patches
 - handle TCP flags mutations
====================

Link: https://lore.kernel.org/r/20240502084450.44009-1-nbd@nbd.name
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:08 +02:00
Felix Fietkau
c9d1d23e52 net: add heuristic for enabling TCP fraglist GRO
When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.

When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established
socket in the same netns as the receiving device. While this may not
cover all relevant use cases in multi-netns configurations, it should be
good enough for most configurations that need this.

Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.

rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on:  770 Mbit/s, CPU 40% idle

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:04 +02:00
Felix Fietkau
7516b27c55 net: create tcp_gro_header_pull helper function
Pull the code out of tcp_gro_receive in order to access the tcp header
from tcp4/6_gro_receive.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:04 +02:00
Felix Fietkau
80e85fbdf1 net: create tcp_gro_lookup helper function
This pulls the flow port matching out of tcp_gro_receive, so that it can be
reused for the next change, which adds the TCP fraglist GRO heuristic.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:04 +02:00
Felix Fietkau
8d95dc474f net: add code for TCP fraglist GRO
This implements fraglist GRO similar to how it's handled in UDP, however
no functional changes are added yet. The next change adds a heuristic for
using fraglist GRO instead of regular GRO.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:04 +02:00
Felix Fietkau
bee88cd5bd net: add support for segmenting TCP fraglist GSO packets
Preparation for adding TCP fraglist GRO support. It expects packets to be
combined in a similar way as UDP fraglist GSO packets.
For IPv4 packets, NAT is handled in the same way as UDP fraglist GSO.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:03 +02:00
Felix Fietkau
8928756d53 net: move skb_gro_receive_list from udp to core
This helper function will be used for TCP fraglist GRO support

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:03 +02:00
Rengarajan S
b1de3c0df7 net: microchip: lan743x: Reduce PTP timeout on HW failure
The PTP_CMD_CTL is a self clearing register which controls the PTP clock
values. In the current implementation driver waits for a duration of 20
sec in case of HW failure to clear the PTP_CMD_CTL register bit. This
timeout of 20 sec is very long to recognize a HW failure, as it is
typically cleared in one clock(<16ns). Hence reducing the timeout to 1 sec
would be sufficient to conclude if there is any HW failure observed. The
usleep_range will sleep somewhere between 1 msec to 20 msec for each
iteration. By setting the PTP_CMD_CTL_TIMEOUT_CNT to 50 the max timeout
is extended to 1 sec.

Signed-off-by: Rengarajan S <rengarajan.s@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240502050300.38689-1-rengarajan.s@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:41:32 +02:00
David S. Miller
cdc74c9d06 Merge branch 'gve-queue-api'
Shailend Chand says:

====================
gve: Implement queue api

Following the discussion on
https://patchwork.kernel.org/project/linux-media/patch/20240305020153.2787423-2-almasrymina@google.com/,
the queue api defined by Mina is implemented for gve.

The first patch is just Mina's introduction of the api. The rest of the
patches make surgical changes in gve to enable it to work correctly with
only a subset of queues present (thus far it had assumed that either all
queues are up or all are down). The final patch has the api
implementation.

Changes since v1: clang warning fixes, kdoc warning fix, and addressed
review comments.
====================

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:48 +01:00
Shailend Chand
ee24284e2a gve: Alloc and free QPLs with the rings
Every tx and rx ring has its own queue-page-list (QPL) that serves as
the bounce buffer. Previously we were allocating QPLs for all queues
before the queues themselves were allocated and later associating a QPL
with a queue. This is avoidable complexity: it is much more natural for
each queue to allocate and free its own QPL.

Moreover, the advent of new queue-manipulating ndo hooks make it hard to
keep things as is: we would need to transfer a QPL from an old queue to
a new queue, and that is unpleasant.

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:34 +01:00
Shailend Chand
af9bcf910b gve: Account for stopped queues when reading NIC stats
We now account for the fact that the NIC might send us stats for a
subset of queues. Without this change, gve_get_ethtool_stats might make
an invalid access on the priv->stats_report->stats array.

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:34 +01:00
Shailend Chand
770f52d5a0 gve: Reset Rx ring state in the ring-stop funcs
This does not fix any existing bug. In anticipation of the ndo queue api
hooks that alloc/free/start/stop a single Rx queue, the already existing
per-queue stop functions are being made more robust. Specifically for
this use case: rx_queue_n.stop() + rx_queue_n.start()

Note that this is not the use case being used in devmem tcp (the first
place these new ndo hooks would be used). There the usecase is:
new_queue.alloc() + old_queue.stop() + new_queue.start() + old_queue.free()

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:34 +01:00
Shailend Chand
9a5e0776d1 gve: Avoid rescheduling napi if on wrong cpu
In order to make possible the implementation of per-queue ndo hooks,
gve_turnup was changed in a previous patch to account for queues already
having some unprocessed descriptors: it does a one-off napi_schdule to
handle them. If conditions of consistent high traffic persist in the
immediate aftermath of this, the poll routine for a queue can be "stuck"
on the cpu on which the ndo hooks ran, instead of the cpu its irq has
affinity with.

This situation is exacerbated by the fact that the ndo hooks for all the
queues are invoked on the same cpu, potentially causing all the napi
poll routines to be residing on the same cpu.

A self correcting mechanism in the poll method itself solves this
problem.

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:34 +01:00
Shailend Chand
864616d97a gve: Make gve_turnup work for nonempty queues
gVNIC has a requirement that all queues have to be quiesced before any
queue is operated on (created or destroyed). To enable the
implementation of future ndo hooks that work on a single queue, we need
to evolve gve_turnup to account for queues already having some
unprocessed descriptors in the ring.

Say rxq 4 is being stopped and started via the queue api. Due to gve's
requirement of quiescence, queues 0 through 3 are not processing their
rings while queue 4 is being toggled. Once they are made live, these
queues need to be poked to cause them to check their rings for
descriptors that were written during their brief period of quiescence.

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:34 +01:00
Shailend Chand
5abc37bdcb gve: Make gve_turn(up|down) ignore stopped queues
Currently the queues are either all live or all dead, toggling from one
state to the other via the ndo open and stop hooks. The future addition
of single-queue ndo hooks changes this, and thus gve_turnup and
gve_turndown should evolve to account for a state where some queues are
live and some aren't.

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:34 +01:00
Shailend Chand
242f30fe69 gve: Add adminq funcs to add/remove a single Rx queue
This allows for implementing future ndo hooks that act on a single
queue.

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:34 +01:00
Shailend Chand
dcecfcf21b gve: Make the GQ RX free queue funcs idempotent
Although this is not fixing any existing double free bug, making these
functions idempotent allows for a simpler implementation of future ndo
hooks that act on a single queue.

Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:33 +01:00
Mina Almasry
087b24de5c queue_api: define queue api
This API enables the net stack to reset the queues used for devmem TCP.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-05 14:35:33 +01:00
Mina Almasry
173e7622cc Revert "net: mirror skb frag ref/unref helpers"
This reverts commit a580ea994f.

This revert is to resolve Dragos's report of page_pool leak here:
https://lore.kernel.org/lkml/20240424165646.1625690-2-dtatulea@nvidia.com/

The reverted patch interacts very badly with commit 2cc3aeb5ec ("skbuff:
Fix a potential race while recycling page_pool packets"). The reverted
commit hopes that the pp_recycle + is_pp_page variables do not change
between the skb_frag_ref and skb_frag_unref operation. If such a change
occurs, the skb_frag_ref/unref will not operate on the same reference type.
In the case of Dragos's report, the grabbed ref was a pp ref, but the unref
was a page ref, because the pp_recycle setting on the skb was changed.

Attempting to fix this issue on the fly is risky. Lets revert and I hope
to reland this with better understanding and testing to ensure we don't
regress some edge case while streamlining skb reffing.

Fixes: a580ea994f ("net: mirror skb frag ref/unref helpers")
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240502175423.2456544-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 16:05:53 -07:00
David Wei
5bfadc5737 bnxt: fix bnxt_get_avail_msix() returning negative values
Current net-next/main does not boot for older chipsets e.g. Stratus.

Sample dmesg:
[   11.368315] bnxt_en 0000:02:00.0 (unnamed net_device) (uninitialized): Able to reserve only 0 out of 9 requested RX rings
[   11.390181] bnxt_en 0000:02:00.0 (unnamed net_device) (uninitialized): Unable to reserve tx rings
[   11.438780] bnxt_en 0000:02:00.0 (unnamed net_device) (uninitialized): 2nd rings reservation failed.
[   11.487559] bnxt_en 0000:02:00.0 (unnamed net_device) (uninitialized): Not enough rings available.
[   11.506012] bnxt_en 0000:02:00.0: probe with driver bnxt_en failed with error -12

This is caused by bnxt_get_avail_msix() returning a negative value for
these chipsets not using the new resource manager i.e. !BNXT_NEW_RM.
This in turn causes hwr.cp in __bnxt_reserve_rings() to be set to 0.

In the current call stack, __bnxt_reserve_rings() is called from
bnxt_set_dflt_rings() before bnxt_init_int_mode(). Therefore,
bp->total_irqs is always 0 and for !BNXT_NEW_RM bnxt_get_avail_msix()
always returns a negative number.

Historically, MSIX vectors were requested by the RoCE driver during
run-time and bnxt_get_avail_msix() was used for this purpose. Today,
RoCE MSIX vectors are statically allocated. bnxt_get_avail_msix() should
only be called for the BNXT_NEW_RM() case to reserve the MSIX ahead of
time for RoCE use.

bnxt_get_avail_msix() is also be simplified to handle the BNXT_NEW_RM()
case only.

Fixes: d630624ebd ("bnxt_en: Utilize ulp client resources if RoCE is not registered")
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240502203757.3761827-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 16:04:04 -07:00
Eric Dumazet
c1742dcb6b net: no longer acquire RTNL in threaded_show()
dev->threaded can be read locklessly, if we add
corresponding READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240502173926.2010646-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:14:01 -07:00
Jakub Kicinski
3e51f2cbbc tools: ynl: add --list-ops and --list-msgs to CLI
I often forget the exact naming of ops and have to look at
the spec to find it. Add support for listing the operations:

  $ ./cli.py --spec .../netdev.yaml --list-ops
  dev-get  [ do, dump ]
  page-pool-get  [ do, dump ]
  page-pool-stats-get  [ do, dump ]
  queue-get  [ do, dump ]
  napi-get  [ do, dump ]
  qstats-get  [ dump ]

For completeness also support listing all ops (including
notifications:

  # ./cli.py --spec .../netdev.yaml --list-msgs
  dev-get  [ dump, do ]
  dev-add-ntf  [ notify ]
  dev-del-ntf  [ notify ]
  dev-change-ntf  [ notify ]
  page-pool-get  [ dump, do ]
  page-pool-add-ntf  [ notify ]
  page-pool-del-ntf  [ notify ]
  page-pool-change-ntf  [ notify ]
  page-pool-stats-get  [ dump, do ]
  queue-get  [ dump, do ]
  napi-get  [ dump, do ]
  qstats-get  [ dump ]

Use double space after the name for slightly easier to read
output.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240502164043.2130184-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:13:21 -07:00
Jakub Kicinski
f3ad491433 Merge branch 'rtnetlink-rtnl_stats_dump-changes'
Eric Dumazet says:

====================
rtnetlink: rtnl_stats_dump() changes

Getting rid of RTNL in rtnl_stats_dump() looks challenging.

In the meantime, we can:

1) Avoid RTNL acquisition for the final NLMSG_DONE marker.

2) Use for_each_netdev_dump() instead of the net->dev_index_head[]
   hash table.
====================

Link: https://lore.kernel.org/r/20240502113748.1622637-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:04:12 -07:00
Eric Dumazet
0feb396f74 rtnetlink: use for_each_netdev_dump() in rtnl_stats_dump()
Switch rtnl_stats_dump() to use for_each_netdev_dump()
instead of net->dev_index_head[] hash table.

This makes the code much easier to read, and fixes
scalability issues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240502113748.1622637-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:03:42 -07:00
Eric Dumazet
136c2a9a2a rtnetlink: change rtnl_stats_dump() return value
By returning 0 (or an error) instead of skb->len,
we allow NLMSG_DONE to be appended to the current
skb at the end of a dump, saving a couple of recvmsg()
system calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240502113748.1622637-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:03:42 -07:00
David S. Miller
5829614a7b Merge branch 'net-sysctl-sentinel'
Joel Granados says:

====================
sysctl: Remove sentinel elements from networking

What?
These commits remove the sentinel element (last empty element) from the
sysctl arrays of all the files under the "net/" directory that register
a sysctl array. The merging of the preparation patches [4] to mainline
allows us to just remove sentinel elements without changing behavior.
This is safe because the sysctl registration code (register_sysctl() and
friends) use the array size in addition to checking for a sentinel [1].

Why?
By removing the sysctl sentinel elements we avoid kernel bloat as
ctl_table arrays get moved out of kernel/sysctl.c into their own
respective subsystems. This move was started long ago to avoid merge
conflicts; the sentinel removal bit came after Mathew Wilcox suggested
it to avoid bloating the kernel by one element as arrays moved out. This
patchset will reduce the overall build time size of the kernel and run
time memory bloat by about ~64 bytes per declared ctl_table array (more
info here [5]).

When are we done?
There are 4 patchest (25 commits [2]) that are still outstanding to
completely remove the sentinels: files under "net/" (this patchset),
files under "kernel/" dir, misc dirs (files under mm/ security/ and
others) and the final set that removes the unneeded check for ->procname
== NULL.

Testing:
* Ran sysctl selftests (./tools/testing/selftests/sysctl/sysctl.sh)
* Ran this through 0-day with no errors or warnings

Savings in vmlinux:
  A total of 64 bytes per sentinel is saved after removal; I measured in
  x86_64 to give an idea of the aggregated savings. The actual savings
  will depend on individual kernel configuration.
    * bloat-o-meter
        - The "yesall" config saves 3976 bytes (bloat-o-meter output [6])
        - A reduced config [3] saves 1263 bytes (bloat-o-meter output [7])

Savings in allocated memory:
  None in this set but will occur when the superfluous allocations are
  removed from proc_sysctl.c. I include it here for context. The
  estimated savings during boot for config [3] are 6272 bytes. See [8]
  for how to measure it.

Comments/feedback greatly appreciated

Changes in v6:
- Rebased onto net-next/main.
- Besides re-running my cocci scripts, I ran a new find script [9].
  Found 0 hits in net/
- Moved "i" variable declaraction out of for() in sysctl_core_net_init
- Removed forgotten sentinel in mpls_table
- Removed CONFIG_AX25_DAMA_SLAVE guard from net/ax25/ax25_ds_timer.c. It
  is not needed because that file is compiled only when
  CONFIG_AX25_DAMA_SLAVE is set.
- When traversing smc_table, stop on ARRAY_SIZE instead of ARRAY_SIZE-1.
- Link to v5: https://lore.kernel.org/r/20240426-jag-sysctl_remset_net-v5-0-e3b12f6111a6@samsung.com

Changes in v5:
- Added net files with additional variable to my test .config so the
  typo can be caught next time.
- Fixed typo tabel_size -> table_size
- Link to v4: https://lore.kernel.org/r/20240425-jag-sysctl_remset_net-v4-0-9e82f985777d@samsung.com

Changes in v4:
- Keep reverse xmas tree order when introducing new variables
- Use a table_size variable to keep the value of ARRAY_SIZE
- Separated the original "networking: Remove the now superfluous
  sentinel elements from ctl_table arra" into smaller commits to ease
  review
- Merged x.25 and ax.25 commits together.
- Removed any SOB from the commits that were changed
- Link to v3: https://lore.kernel.org/r/20240412-jag-sysctl_remset_net-v3-0-11187d13c211@samsung.com

Changes in v3:
- Reworkded ax.25
  - Added a BUILD_BUG_ON for the ax.25 commit
  - Added a CONFIG_AX25_DAMA_SLAVE guard where needed
- Link to v2: https://lore.kernel.org/r/20240328-jag-sysctl_remset_net-v2-0-52c9fad9a1af@samsung.com

Changes in v2:
- Rebased to v6.9-rc1
- Removed unneeded comment from sysctl_net_ax25.c
- Link to v1: https://lore.kernel.org/r/20240314-jag-sysctl_remset_net-v1-0-aa26b44d29d9@samsung.com
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:43 +01:00
Joel Granados
78a7b5dbc0 ax.25: x.25: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Avoid a buffer overflow when traversing the ctl_table by ensuring that
AX25_MAX_VALUES is the same as the size of ax25_param_table. This is
done with a BUILD_BUG_ON where ax25_param_table is defined and a
CONFIG_AX25_DAMA_SLAVE guard in the unnamed enum definition as well as
in the ax25_dev_device_up and ax25_ds_set_timer functions.

The overflow happened when the sentinel was removed from
ax25_param_table. The sentinel's data element was changed when
CONFIG_AX25_DAMA_SLAVE was undefined. This had no adverse effects as it
still stopped on the sentinel's null procname but needed to be addressed
once the sentinel was removed.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:43 +01:00
Joel Granados
e00e35e217 appletalk: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel from atalk_table ctl_table array.

Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados
635470eb0a netfilter: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel elements from ctl_table structs
* Remove instances where an array element is zeroed out to make it look
  like a sentinel. This is not longer needed and is safe after commit
  c899710fe7 ("networking: Update to register_net_sysctl_sz") added
  the array size to the ctl_table registration
* Remove the need for having __NF_SYSCTL_CT_LAST_SYSCTL as the
  sysctl array size is now in NF_SYSCTL_CT_LAST_SYSCTL
* Remove extra element in ctl_table arrays declarations

Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados
73dbd8cf79 net: Remove ctl_table sentinel elements from several networking subsystems
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

To avoid lots of small commits, this commit brings together network
changes from (as they appear in MAINTAINERS) LLC, MPTCP, NETROM NETWORK
LAYER, PHONET PROTOCOL, ROSE NETWORK LAYER, RXRPC SOCKETS, SCTP
PROTOCOL, SHARED MEMORY COMMUNICATIONS (SMC), TIPC NETWORK LAYER and
NETWORKING [IPSEC]

* Remove sentinel element from ctl_table structs.
* Replace empty array registration with the register_net_sysctl_sz call
  in llc_sysctl_init
* Replace the for loop stop condition that tests for procname == NULL
  with one that depends on array size in sctp_sysctl_net_register
* Remove instances where an array element is zeroed out to make it look
  like a sentinel in xfrm_sysctl_init. This is not longer needed and is
  safe after commit c899710fe7 ("networking: Update to
  register_net_sysctl_sz") added the array size to the ctl_table
  registration
* Use a table_size variable to keep the value of ARRAY_SIZE

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados
ca5d1fce79 net: sunrpc: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados
92bedf0783 net: rds: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Acked-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00
Joel Granados
1c106eb01c net: ipv{6,4}: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
* Remove the zeroing out of an array element (to make it look like a
  sentinel) in sysctl_route_net_init And ipv6_route_sysctl_init.
  This is not longer needed and is safe after commit c899710fe7
  ("networking: Update to register_net_sysctl_sz") added the array size
  to the ctl_table registration.
* Remove extra sentinel element in the declaration of devinet_vars.
* Removed the "-1" in __devinet_sysctl_register, sysctl_route_net_init,
  ipv6_sysctl_net_init and ipv4_sysctl_init_net that adjusted for having
  an extra empty element when looping over ctl_table arrays
* Replace the for loop stop condition in __addrconf_sysctl_register that
  tests for procname == NULL with one that depends on array size
* Removing the unprivileged user check in ipv6_route_sysctl_init is
  safe as it is replaced by calling ipv6_route_sysctl_table_size;
  introduced in commit c899710fe7 ("networking: Update to
  register_net_sysctl_sz")
* Use a table_size variable to keep the value of ARRAY_SIZE

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:42 +01:00