linux/net/ipv4
Ido Schimmel 7e1dc94b2d nexthop: Fix infinite nexthop bucket dump when using maximum nexthop ID
commit 8743aeff5b upstream.

A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.

The nexthop bucket dump callback always returns a positive number if
nexthop buckets were filled in the provided skb, even if the dump is
complete. This means that a dump will span at least two recvmsg() calls
as long as nexthop buckets are present. In the last recvmsg() call the
dump callback will not fill in any nexthop buckets because the previous
call indicated that the dump should restart from the last dumped nexthop
ID plus one.

 # ip link add name dummy1 up type dummy
 # ip nexthop add id 1 dev dummy1
 # ip nexthop add id 10 group 1 type resilient buckets 2
 # strace -e sendto,recvmsg -s 5 ip nexthop bucket
 sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 128
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
 id 10 index 0 idle_time 6.66 nhid 1
 id 10 index 1 idle_time 6.66 nhid 1
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
 +++ exited with 0 +++

This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:

 # ip link add name dummy1 up type dummy
 # ip nexthop add id 1 dev dummy1
 # ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
 # ip nexthop bucket
 id 4294967295 index 0 idle_time 5.55 nhid 1
 id 4294967295 index 1 idle_time 5.55 nhid 1
 id 4294967295 index 0 idle_time 5.55 nhid 1
 id 4294967295 index 1 idle_time 5.55 nhid 1
 [...]

Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses:

 # ip link add name dummy1 up type dummy
 # ip nexthop add id 1 dev dummy1
 # ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
 # strace -e sendto,recvmsg -s 5 ip nexthop bucket
 sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 148
 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
 id 4294967295 index 0 idle_time 6.61 nhid 1
 id 4294967295 index 1 idle_time 6.61 nhid 1
 +++ exited with 0 +++

Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.

Add a test that fails before the fix:

 # ./fib_nexthops.sh -t basic_res
 [...]
 TEST: Maximum nexthop ID dump                                       [FAIL]
 [...]

And passes after it:

 # ./fib_nexthops.sh -t basic_res
 [...]
 TEST: Maximum nexthop ID dump                                       [ OK ]
 [...]

Fixes: 8a1bbabb03 ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-16 18:22:02 +02:00
..
bpfilter net: Revert "net: optimize the sockptr_t for unified kernel/user address spaces" 2020-08-10 12:06:44 -07:00
netfilter netfilter: tproxy: fix deadlock due to missing BH disable 2023-03-17 08:48:55 +01:00
af_inet.c tcp: deny tcp_disconnect() when threads are waiting 2023-06-09 10:32:17 +02:00
ah4.c Networking changes for 5.14. 2021-06-30 15:51:09 -07:00
arp.c ipv4: Invalidate neighbour for broadcast address upon address addition 2022-04-13 20:59:05 +02:00
bpf_tcp_ca.c bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs 2021-11-25 09:49:07 +01:00
cipso_ipv4.c cipso: Fix data-races around sysctl. 2022-07-21 21:24:21 +02:00
datagram.c udp: Update reuse->has_conns under reuseport_lock. 2022-10-29 10:12:56 +02:00
devinet.c net: Fix data-races around sysctl_devconf_inherit_init_net. 2022-08-31 17:16:44 +02:00
esp4_offload.c xfrm: Linearize the skb after offloading if needed. 2023-06-28 10:29:46 +02:00
esp4.c net: ipv4: Use kfree_sensitive instead of kfree 2023-07-27 08:47:01 +02:00
fib_frontend.c ipv4: Fix incorrect table ID in IOCTL path 2023-03-22 13:31:28 +01:00
fib_lookup.h ipv4: fix data races in fib_alias_hw_flags_set 2022-02-23 12:03:10 +01:00
fib_notifier.c
fib_rules.c ipv4: convert fib_num_tclassid_users to atomic_t 2021-12-08 09:04:49 +01:00
fib_semantics.c ipv4: prevent potential spectre v1 gadget in fib_metrics_match() 2023-02-01 08:27:27 +01:00
fib_trie.c ipv4: Fix error return code in fib_table_insert() 2022-12-02 17:41:07 +01:00
fou.c fou: remove sparse errors 2021-08-31 12:03:33 +01:00
gre_demux.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
gre_offload.c ip_gre: add csum offload support for gre header 2021-01-29 20:39:14 -08:00
icmp.c icmp: guard against too small mtu 2023-04-13 16:48:18 +02:00
igmp.c igmp: Fix data-races around sysctl_igmp_qrv. 2022-08-03 12:03:48 +02:00
inet_connection_sock.c tcp: annotate data-races around icsk->icsk_syn_retries 2023-07-27 08:47:03 +02:00
inet_diag.c inet_diag: fix kernel-infoleak for UDP sockets 2021-12-22 09:32:40 +01:00
inet_fragment.c inet: frags: annotate races around fqdir->dead and fqdir->high_thresh 2022-01-27 11:05:35 +01:00
inet_hashtables.c Revert "tcp: avoid the lookup process failing to get sk in ehash table" 2023-07-27 08:47:02 +02:00
inet_timewait_sock.c Revert "tcp: avoid the lookup process failing to get sk in ehash table" 2023-07-27 08:47:02 +02:00
inetpeer.c inetpeer: Fix data-races around sysctl. 2022-07-21 21:24:21 +02:00
ip_forward.c ip: Fix data-races around sysctl_ip_fwd_update_priority. 2022-07-29 17:25:13 +02:00
ip_fragment.c inet: frags: annotate races around fqdir->dead and fqdir->high_thresh 2022-01-27 11:05:35 +01:00
ip_gre.c erspan: do not use skb_mac_header() in ndo_start_xmit() 2023-03-30 12:47:48 +02:00
ip_input.c xfrm: fix "disable_policy" on ipv4 early demux 2022-12-02 17:41:02 +01:00
ip_options.c net: clean up codestyle for net/ipv4 2020-08-25 06:28:02 -07:00
ip_output.c ipv4: Fix potential uninit variable access bug in __ip_make_skb() 2023-05-11 23:00:31 +09:00
ip_sockglue.c ipv{4,6}/raw: fix output xfrm lookup wrt protocol 2023-06-05 09:21:26 +02:00
ip_tunnel_core.c tunnels: fix kasan splat when generating ipv4 pmtu error 2023-08-16 18:22:01 +02:00
ip_tunnel.c net: tunnels: annotate lockless accesses to dev->needed_headroom 2023-03-22 13:31:26 +01:00
ip_vti.c ip_tunnel: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
ipcomp.c Networking changes for 5.14. 2021-06-30 15:51:09 -07:00
ipconfig.c net: ipconfig: Don't override command-line hostnames or domains 2021-06-02 13:27:03 -07:00
ipip.c ip_tunnel: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
ipmr_base.c
ipmr.c ipmr,ip6mr: acquire RTNL before calling ip[6]mr_free_table() on failure path 2022-02-16 12:56:29 +01:00
Kconfig tcp: configurable source port perturb table size 2022-12-02 17:41:11 +01:00
Makefile bpf: Clean up sockmap related Kconfigs 2021-02-26 12:28:03 -08:00
metrics.c ipv4: prevent potential spectre v1 gadget in ip_metrics_convert() 2023-02-01 08:27:27 +01:00
netfilter.c netfilter: Dissect flow after packet mangling 2021-04-18 22:04:16 +02:00
netlink.c
nexthop.c nexthop: Fix infinite nexthop bucket dump when using maximum nexthop ID 2023-08-16 18:22:02 +02:00
ping.c ping: fix address binding wrt vrf 2022-05-18 10:26:57 +02:00
proc.c ip: Fix data-races around sysctl_ip_default_ttl. 2022-07-29 17:25:09 +02:00
protocol.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
raw_diag.c net: Use nlmsg_unicast() instead of netlink_unicast() 2021-07-13 09:28:29 -07:00
raw.c ipv{4,6}/raw: fix output xfrm lookup wrt protocol 2023-06-05 09:21:26 +02:00
route.c ipv4: Fix data-races around sysctl_fib_multipath_hash_fields. 2022-07-29 17:25:21 +02:00
syncookies.c mptcp: remove MPTCP 'ifdef' in TCP SYN cookies 2023-01-12 11:58:52 +01:00
sysctl_net_ipv4.c tcp: restrict net.ipv4.tcp_app_win 2023-04-20 12:13:53 +02:00
tcp_bbr.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_bic.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_bpf.c net: deal with most data-races in sk_wait_event() 2023-05-24 17:36:42 +01:00
tcp_cdg.c tcp: cdg: allow tcp_cdg_release() to be called multiple times 2022-11-26 09:24:50 +01:00
tcp_cong.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_cubic.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_dctcp.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_dctcp.h
tcp_diag.c
tcp_fastopen.c tcp: annotate data-races around fastopenq.max_qlen 2023-07-27 08:47:04 +02:00
tcp_highspeed.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_htcp.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_hybla.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_illinois.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_input.c tcp: annotate data races in __tcp_oow_rate_limited() 2023-07-23 13:47:29 +02:00
tcp_ipv4.c tcp: annotate data-races around tcp_rsk(req)->ts_recent 2023-07-27 08:47:01 +02:00
tcp_lp.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_metrics.c tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen 2023-08-11 15:13:55 +02:00
tcp_minisocks.c tcp: annotate data-races around tcp_rsk(req)->ts_recent 2023-07-27 08:47:01 +02:00
tcp_nv.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_offload.c net, gro: Set inner transport header offset in tcp/udp GRO hook 2021-08-02 10:20:56 +01:00
tcp_output.c tcp: annotate data-races around tcp_rsk(req)->ts_recent 2023-07-27 08:47:01 +02:00
tcp_rate.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_recovery.c tcp: Fix data-races around sysctl_tcp_recovery. 2022-07-29 17:25:22 +02:00
tcp_scalable.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_timer.c tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts. 2022-07-29 17:25:22 +02:00
tcp_ulp.c net/ulp: use consistent error code when blocking ULP 2023-01-24 07:22:48 +01:00
tcp_vegas.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_vegas.h
tcp_veno.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_westwood.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp_yeah.c tcp: add accessors to read/set tp->snd_cwnd 2022-06-14 18:36:11 +02:00
tcp.c tcp: annotate data-races around fastopenq.max_qlen 2023-07-27 08:47:04 +02:00
tunnel4.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
udp_bpf.c bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser() 2023-03-17 08:48:54 +01:00
udp_diag.c net: Use nlmsg_unicast() instead of netlink_unicast() 2021-07-13 09:28:29 -07:00
udp_impl.h net: pass a sockptr_t into ->setsockopt 2020-07-24 15:41:54 -07:00
udp_offload.c fou: remove sparse errors 2021-08-31 12:03:33 +01:00
udp_tunnel_core.c net/tunnel: wait until all sk_user_data reader finish before releasing the sock 2022-12-31 13:14:19 +01:00
udp_tunnel_nic.c udp_tunnel: Fix end of loop test in udp_tunnel_nic_unregister() 2022-03-02 11:47:59 +01:00
udp_tunnel_stub.c udp_tunnel: add central NIC RX port offload infrastructure 2020-07-10 13:54:00 -07:00
udp.c tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct(). 2023-04-26 13:51:54 +02:00
udplite.c udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated(). 2023-05-30 13:55:31 +01:00
xfrm4_input.c xfrm: fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets 2023-06-28 10:29:46 +02:00
xfrm4_output.c
xfrm4_policy.c
xfrm4_protocol.c net: xfrm: unexport __init-annotated xfrm4_protocol_init() 2022-06-14 18:36:18 +02:00
xfrm4_state.c
xfrm4_tunnel.c xfrm: remove description from xfrm_type struct 2021-06-09 09:38:52 +02:00