The 'olr' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'olr' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point cloning an skb and having to free the clone
if the receive queue of the raw socket is full.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240307163020.2524409-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only error that can happen during a nexthop dump is insufficient
space in the skb caring the netlink messages (EMSGSIZE). If this happens
and some messages were already filled in, the nexthop code returns the
skb length to signal the netlink core that more objects need to be
dumped.
After commit b5a899154a ("netlink: handle EMSGSIZE errors in the
core") there is no need to handle this error in the nexthop code as it
is now handled in the core.
Simplify the code and simply return the error to the core.
No regressions in nexthop tests:
# ./fib_nexthops.sh
Tests passed: 234
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240307154727.3555462-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add netlink support for reading NH group hardware stats.
Stats collection is done through a new notifier,
NEXTHOP_EVENT_HW_STATS_REPORT_DELTA. Drivers that implement HW counters for
a given NH group are thereby asked to collect the stats and report back to
core by calling nh_grp_hw_stats_report_delta(). This is similar to what
netdevice L3 stats do.
Besides exposing number of packets that passed in the HW datapath, also
include information on whether any driver actually realizes the counters.
The core can tell based on whether it got any _report_delta() reports from
the drivers. This allows enabling the statistics at the group at any time,
with drivers opting into supporting them. This is also in line with what
netdevice L3 stats are doing.
So as not to waste time and space, tie the collection and reporting of HW
stats with a new op flag, NHA_OP_FLAG_DUMP_HW_STATS.
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Kees Cook <keescook@chromium.org> # For the __counted_by bits
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink support for enabling collection of HW statistics on nexthop
groups.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add hw_stats field to several notifier structures to communicate to the
drivers that HW statistics should be configured for nexthops within a given
group.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink support for reading NH group stats.
This data is only for statistics of the traffic in the SW datapath. HW
nexthop group statistics will be added in the following patches.
Emission of the stats is keyed to a new op_stats flag to avoid cluttering
the netlink message with stats if the user doesn't need them:
NHA_OP_FLAG_DUMP_STATS.
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add nexthop group entry stats to count the number of packets forwarded
via each nexthop in the group. The stats will be exposed to user space
for better data path observability in the next patch.
The per-CPU stats pointer is placed at the beginning of 'struct
nh_grp_entry', so that all the fields accessed for the data path reside
on the same cache line:
struct nh_grp_entry {
struct nexthop * nh; /* 0 8 */
struct nh_grp_entry_stats * stats; /* 8 8 */
u8 weight; /* 16 1 */
/* XXX 7 bytes hole, try to pack */
union {
struct {
atomic_t upper_bound; /* 24 4 */
} hthr; /* 24 4 */
struct {
struct list_head uw_nh_entry; /* 24 16 */
u16 count_buckets; /* 40 2 */
u16 wants_buckets; /* 42 2 */
} res; /* 24 24 */
}; /* 24 24 */
struct list_head nh_list; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct nexthop * nh_parent; /* 64 8 */
/* size: 72, cachelines: 2, members: 6 */
/* sum members: 65, holes: 1, sum holes: 7 */
/* last cacheline: 8 bytes */
};
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to add per-nexthop statistics, but still not increase netlink
message size for consumers that do not care about them, there needs to be a
toggle through which the user indicates their desire to get the statistics.
To that end, add a new attribute, NHA_OP_FLAGS. The idea is to be able to
use the attribute for carrying of arbitrary operation-specific flags, i.e.
not make it specific for get / dump.
Add the new attribute to get and dump policies, but do not actually allow
any flags yet -- those will come later as the flags themselves are defined.
Add the necessary parsing code.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A following patch will introduce a new attribute, op-specific flags to
adjust the behavior of an operation. Different operations will recognize
different flags.
- To make the differentiation possible, stop sharing the policies for get
and del operations.
- To allow querying for presence of the attribute, have all the attribute
arrays sized to NHA_MAX, regardless of what is permitted by policy, and
pass the corresponding value to nlmsg_parse() as well.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move RPS related structures and helpers from include/linux/netdevice.h
and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"struct net_protocol" has a 32bit hole in 32bit arches.
Use it to store the 32bit secret used by UDP and TCP,
to increase cache locality in rx path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-15-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These structures are read in rx path, move them to net_hotdata
for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-14-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These structures are used in GRO and GSO paths.
Move them to net_hodata for better cache locality.
v2: udpv6_offload definition depends on CONFIG_INET=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These are used in TCP fast paths.
Move them into net_hotdata for better cache locality.
v2: tcpv6_offload definition depends on CONFIG_INET
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These structures are used in GRO and GSO paths.
v2: ipv6_packet_offload definition depends on CONFIG_INET
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit b5a899154a ("netlink: handle EMSGSIZE errors
in the core"), we can remove some code that was not 100 % correct
anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306102426.245689-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Found with git grep 'MODULE_AUTHOR(".*([^)]*@'
Fixed with
sed -i '/MODULE_AUTHOR(".*([^)]*@/{s/ (/ </g;s/)"/>"/;s/)and/> and/}' \
$(git grep -l 'MODULE_AUTHOR(".*([^)]*@')
Also:
in drivers/media/usb/siano/smsusb.c normalise ", INC" to ", Inc";
this is what every other MODULE_AUTHOR for this company says,
and it's what the header says
in drivers/sbus/char/openprom.c normalise a double-spaced separator;
this is clearly copied from the copyright header,
where the names are aligned on consecutive lines thusly:
* Linux/SPARC PROM Configuration Driver
* Copyright (C) 1996 Thomas K. Dyas (tdyas@noc.rutgers.edu)
* Copyright (C) 1996 Eddie C. Dost (ecd@skynet.be)
but the authorship branding is single-line
Link: https://lkml.kernel.org/r/mk3geln4azm5binjjlfsgjepow4o73domjv6ajybws3tz22vb3@tarta.nabijaczleweli.xyz
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently getsockopt does not support IP_ROUTER_ALERT and
IPV6_ROUTER_ALERT, and we are unable to get the values of these two
socket options through getsockopt.
This patch adds getsockopt support for IP_ROUTER_ALERT and
IPV6_ROUTER_ALERT.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing for places where zero-sized destinations were still showing
up in the kernel, sock_copy() and inet_reqsk_clone() were found, which
are using very specific memcpy() offsets for both avoiding a portion of
struct sock, and copying beyond the end of it (since struct sock is really
just a common header before the protocol-specific allocation). Instead
of trying to unravel this historical lack of container_of(), just switch
to unsafe_memcpy(), since that's effectively what was happening already
(memcpy() wasn't checking 0-sized destinations while the code base was
being converted away from fake flexible arrays).
Avoid the following false positive warning with future changes to
CONFIG_FORTIFY_SOURCE:
memcpy: detected field-spanning write (size 3068) of destination "&nsk->__sk_common.skc_dontcopy_end" at net/core/sock.c:2057 (size 0)
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240304212928.make.772-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bridge driver today has no support to forward the userspace timestamp
packets and ends up resetting the timestamp. ETF qdisc checks the
packet coming from userspace and encounters to be 0 thereby dropping
time sensitive packets. These changes will allow userspace timestamps
packets to be forwarded from the bridge to NIC drivers.
Setting the same bit (mono_delivery_time) to avoid dropping of
userspace tstamp packets in the forwarding path.
Existing functionality of mono_delivery_time remains unaltered here,
instead just extended with userspace tstamp support for bridge
forwarding path.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240301201348.2815102-1-quic_abchauha@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In tcp_gro_complete() :
Moving the skb->inner_transport_header setting
allows the compiler to reuse the previously loaded value
of skb->transport_header.
Caching skb_shinfo() avoids duplications as well.
In tcp4_gro_complete(), doing a single change on
skb_shinfo(skb)->gso_type also generates better code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
skb_gro_header_hard() is renamed to skb_gro_may_pull() to match
the convention used by common helpers like pskb_may_pull().
This means the condition is inverted:
if (skb_gro_header_hard(skb, hlen))
slow_path();
becomes:
if (!skb_gro_may_pull(skb, hlen))
slow_path();
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Stephen Rothwell and kernel test robot reported that some arches
(parisc, hexagon) and/or compilers would not like blamed commit.
Lets make sure tcp_sock_write_rx group does not start with a hole.
While we are at it, correct tcp_sock_write_tx CACHELINE_ASSERT_GROUP_SIZE()
since after the blamed commit, we went to 105 bytes.
Fixes: 9912362205 ("tcp: remove some holes in struct tcp_sock")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/netdev/20240301121108.5d39e4f9@canb.auug.org.au/
Closes: https://lore.kernel.org/oe-kbuild-all/202403011451.csPYOS3C-lkp@intel.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Link: https://lore.kernel.org/r/20240301171945.2958176-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Perform all validations when updating values of struct_ops maps. Doing
validation in st_ops->reg() and st_ops->update() is not necessary anymore.
However, tcp_register_congestion_control() has been called in various
places. It still needs to do validations.
Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240224223418.526631-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This is a cleanup patch, making code a bit more concise.
1) Use skb_network_offset(skb) in place of
(skb_network_header(skb) - skb->data)
2) Use -skb_network_offset(skb) in place of
(skb->data - skb_network_header(skb))
3) Use skb_transport_offset(skb) in place of
(skb_transport_header(skb) - skb->data)
4) Use skb_inner_transport_offset(skb) in place of
(skb_inner_transport_header(skb) - skb->data)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZeEKVAAKCRDbK58LschI
g7oYAQD5Jlv4fIVTvxvfZrTTZ2tU+OsPa75mc8SDKwpash3YygEA8kvESy8+t6pg
D6QmSf1DIZdFoSp/bV+pfkNWMeR8gwg=
=mTAj
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-02-29
We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).
The main changes are:
1) Extend the BPF verifier to enable static subprog calls in spin lock
critical sections, from Kumar Kartikeya Dwivedi.
2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
in BPF global subprogs, from Andrii Nakryiko.
3) Larger batch of riscv BPF JIT improvements and enabling inlining
of the bpf_kptr_xchg() for RV64, from Pu Lehui.
4) Allow skeleton users to change the values of the fields in struct_ops
maps at runtime, from Kui-Feng Lee.
5) Extend the verifier's capabilities of tracking scalars when they
are spilled to stack, especially when the spill or fill is narrowing,
from Maxim Mikityanskiy & Eduard Zingerman.
6) Various BPF selftest improvements to fix errors under gcc BPF backend,
from Jose E. Marchesi.
7) Avoid module loading failure when the module trying to register
a struct_ops has its BTF section stripped, from Geliang Tang.
8) Annotate all kfuncs in .BTF_ids section which eventually allows
for automatic kfunc prototype generation from bpftool, from Daniel Xu.
9) Several updates to the instruction-set.rst IETF standardization
document, from Dave Thaler.
10) Shrink the size of struct bpf_map resp. bpf_array,
from Alexei Starovoitov.
11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
from Benjamin Tissoires.
12) Fix bpftool to be more portable to musl libc by using POSIX's
basename(), from Arnaldo Carvalho de Melo.
13) Add libbpf support to gcc in CORE macro definitions,
from Cupertino Miranda.
14) Remove a duplicate type check in perf_event_bpf_event,
from Florian Lehner.
15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
with notrace correctly, from Yonghong Song.
16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
array to fix build warnings, from Kees Cook.
17) Fix resolve_btfids cross-compilation to non host-native endianness,
from Viktor Malik.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
selftests/bpf: Test if shadow types work correctly.
bpftool: Add an example for struct_ops map and shadow type.
bpftool: Generated shadow variables for struct_ops maps.
libbpf: Convert st_ops->data to shadow type.
libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
bpf, arm64: use bpf_prog_pack for memory management
arm64: patching: implement text_poke API
bpf, arm64: support exceptions
arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
bpf: add is_async_callback_calling_insn() helper
bpf: introduce in_sleepable() helper
bpf: allow more maps in sleepable bpf programs
selftests/bpf: Test case for lacking CFI stub functions.
bpf: Check cfi_stubs before registering a struct_ops type.
bpf: Clarify batch lookup/lookup_and_delete semantics
bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
bpf, docs: Fix typos in instruction-set.rst
selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
bpf: Shrink size of struct bpf_map/bpf_array.
...
====================
Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) inet_dump_ifaddr() can can run under RCU protection
instead of RTNL.
2) properly return 0 at the end of a dump, avoiding an
an extra recvmsg() system call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the following patch, inet_base_seq() will no longer be called
with RTNL held.
Add READ_ONCE()/WRITE_ONCE() annotations in dev_base_seq_inc()
and inet_base_seq().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifa->ifa_flags can be read locklessly.
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifa->ifa_preferred_lft can be read locklessly.
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifa->ifa_valid_lft can be read locklessly.
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifa->ifa_tstamp can be read locklessly.
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
Do the same for ifa->ifa_cstamp to prepare upcoming changes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The input parameter 'level' in do_raw_set/getsockopt() is not used.
Therefore, remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228072505.640550-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since commit 446fda4f26 ("[NetLabel]: CIPSOv4 engine"), *bitmap_walk
function only returns -1. Nearly 18 years have passed, -2 scenes never
come up, so there's no need to consider it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20240227093604.3574241-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) inet_netconf_dump_devconf() can run under RCU protection
instead of RTNL.
2) properly return 0 at the end of a dump, avoiding an
an extra recvmsg() system call.
3) Do not use inet_base_seq() anymore, for_each_netdev_dump()
has nice properties. Restarting a GETDEVCONF dump if a device has
been added/removed or if net->ipv4.dev_addr_genid has changed is moot.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"ip -4 netconf show dev XXXX" no longer acquires RTNL.
Return -ENODEV instead of -EINVAL if no netdev or idev can be found.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add READ_ONCE() in ipv4_devconf_get() and corresponding
WRITE_ONCE() in ipv4_devconf_set()
Add IPV4_DEVCONF_RO() and IPV4_DEVCONF_ALL_RO() macros,
and use them when reading devconf fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It's time to let it work right now. We've already prepared for this:)
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update three callers including both ipv4 and ipv6 and let the dropreason
mechanism work in reality.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this patch, I equipped this function with more dropreasons, but
it still doesn't work yet, which I will do later.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does two things:
1) add two more new reasons
2) only change the return value(1) to various drop reason values
for the future use
For now, we still cannot trace those two reasons. We'll implement the full
function in the subsequent patch in this series.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now it's time to use the prepared definitions to refine this part.
Four reasons used might enough for now, I think.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only move the skb drop from tcp_v4_do_rcv() to cookie_v4_check() itself,
no other changes made. It can help us refine the specific drop reasons
later.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer hold RTNL while calling inet_dump_fib().
Also change return value for a completed dump:
Returning 0 instead of skb->len allows NLMSG_DONE
to be appended to the skb. User space does not have
to call us again to get a standalone NLMSG_DONE marker.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new field into struct fib_dump_filter, to let callers
tell if they use RTNL locking or RCU.
This is used in the following patch, when inet_dump_fib()
no longer holds RTNL.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller triggered following kasan splat:
BUG: KASAN: use-after-free in __skb_flow_dissect+0x19d1/0x7a50 net/core/flow_dissector.c:1170
Read of size 1 at addr ffff88812fb4000e by task syz-executor183/5191
[..]
kasan_report+0xda/0x110 mm/kasan/report.c:588
__skb_flow_dissect+0x19d1/0x7a50 net/core/flow_dissector.c:1170
skb_flow_dissect_flow_keys include/linux/skbuff.h:1514 [inline]
___skb_get_hash net/core/flow_dissector.c:1791 [inline]
__skb_get_hash+0xc7/0x540 net/core/flow_dissector.c:1856
skb_get_hash include/linux/skbuff.h:1556 [inline]
ip_tunnel_xmit+0x1855/0x33c0 net/ipv4/ip_tunnel.c:748
ipip_tunnel_xmit+0x3cc/0x4e0 net/ipv4/ipip.c:308
__netdev_start_xmit include/linux/netdevice.h:4940 [inline]
netdev_start_xmit include/linux/netdevice.h:4954 [inline]
xmit_one net/core/dev.c:3548 [inline]
dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3564
__dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4349
dev_queue_xmit include/linux/netdevice.h:3134 [inline]
neigh_connected_output+0x42c/0x5d0 net/core/neighbour.c:1592
...
ip_finish_output2+0x833/0x2550 net/ipv4/ip_output.c:235
ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323
..
iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1dbc/0x33c0 net/ipv4/ip_tunnel.c:831
ipgre_xmit+0x4a1/0x980 net/ipv4/ip_gre.c:665
__netdev_start_xmit include/linux/netdevice.h:4940 [inline]
netdev_start_xmit include/linux/netdevice.h:4954 [inline]
xmit_one net/core/dev.c:3548 [inline]
dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3564
...
The splat occurs because skb->data points past skb->head allocated area.
This is because neigh layer does:
__skb_pull(skb, skb_network_offset(skb));
... but skb_network_offset() returns a negative offset and __skb_pull()
arg is unsigned. IOW, we skb->data gets "adjusted" by a huge value.
The negative value is returned because skb->head and skb->data distance is
more than 64k and skb->network_header (u16) has wrapped around.
The bug is in the ip_tunnel infrastructure, which can cause
dev->needed_headroom to increment ad infinitum.
The syzkaller reproducer consists of packets getting routed via a gre
tunnel, and route of gre encapsulated packets pointing at another (ipip)
tunnel. The ipip encapsulation finds gre0 as next output device.
This results in the following pattern:
1). First packet is to be sent out via gre0.
Route lookup found an output device, ipip0.
2).
ip_tunnel_xmit for gre0 bumps gre0->needed_headroom based on the future
output device, rt.dev->needed_headroom (ipip0).
3).
ip output / start_xmit moves skb on to ipip0. which runs the same
code path again (xmit recursion).
4).
Routing step for the post-gre0-encap packet finds gre0 as output device
to use for ipip0 encapsulated packet.
tunl0->needed_headroom is then incremented based on the (already bumped)
gre0 device headroom.
This repeats for every future packet:
gre0->needed_headroom gets inflated because previous packets' ipip0 step
incremented rt->dev (gre0) headroom, and ipip0 incremented because gre0
needed_headroom was increased.
For each subsequent packet, gre/ipip0->needed_headroom grows until
post-expand-head reallocations result in a skb->head/data distance of
more than 64k.
Once that happens, skb->network_header (u16) wraps around when
pskb_expand_head tries to make sure that skb_network_offset() is unchanged
after the headroom expansion/reallocation.
After this skb_network_offset(skb) returns a different (and negative)
result post headroom expansion.
The next trip to neigh layer (or anything else that would __skb_pull the
network header) makes skb->data point to a memory location outside
skb->head area.
v2: Cap the needed_headroom update to an arbitarily chosen upperlimit to
prevent perpetual increase instead of dropping the headroom increment
completely.
Reported-and-tested-by: syzbot+bfde3bef047a81b8fde6@syzkaller.appspotmail.com
Closes: https://groups.google.com/g/syzkaller-bugs/c/fL9G6GtWskY/m/VKk_PR5FBAAJ
Fixes: 243aad830e ("ip_gre: include route header_len in max_headroom calculation")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240220135606.4939-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmXV2OYNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AHCRD/9sHoOd4QCVVgcDr3SjpaVWikM0Zdkge65At/uY
bFENWgcDsSfsH7kAQm+nwzseT+QtTk9OOv9wqWzdEYROD7sqjVK2Zv/CUs24odGj
7Wj35OLYLgUIEMlHF/G9kOuWqW61URXwXcHvoFWkew1WweAVDqi648osLWUP9qkL
IFJ5729/1upq9XJc+pMxIy2Oe2zhMc4XNHsy1OCOg4fUQtDM81jgoJz0137ohCIh
PW4aaSno8ZeRuFe1RKfya5+suv3WgMui/fOBmpnnhjWVxHRJvYZ926wsy/jC7xRJ
E7/TdmymbzijRBEHh+IxQYZkE55XXc0E1Lj1ic653AzUWJ3tQRfD+HWg+GYj/WCu
sWy1e7eRJIjYVbeB5m6ao3g47Zq1XIRXo7E2Rvt3E2beM6t9aMIMuuajBHAOEV2O
pCfG4zBlEYw1SuuuoqzcXTVLKDf6WZjx1xtUAJCTks8JFTjPEwPwOQhGCv1cc/BC
qox7MejeDH/L+ZreeTYnWlQr1GGokNgrmpdDx0G8GBBRUDPoP8D4GTxvNEz44XOO
SfL2yl5v82GBBmsFHzC2J8BGN8KC4JyzDGupU+bcdMWCs8tSvMK0KVeankRvpdBl
x4VLmdoNo6zvtOYlPOxdphhsd6xA0dFiLMgSr9f5WsIgepaC+Umxp59IfCEH/bfl
1Kcg9g==
=GYgG
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-02-21' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
1. Prefer KMEM_CACHE() macro to create kmem caches, from Kunwu Chan.
Patches 2 and 3 consolidate nf_log NULL checks and introduces
extra boundary checks on family and type to make it clear that no out
of bounds access will happen. No in-tree user currently passes such
values, but thats not clear from looking at the function.
From Pablo Neira Ayuso.
Patch 4, also from Pablo, gets rid of unneeded conditional in
nft_osf init function.
Patch 5, from myself, fixes erroneous Kconfig dependencies that
came in an earlier net-next pull request. This should get rid
of the xtables related build failure reports.
Patches 6 to 10 are an update to nftables' concatenated-ranges
set type to speed up element insertions. This series also
compacts a few data structures and cleans up a few oddities such
as reliance on ZERO_SIZE_PTR when asking to allocate a set with
no elements. From myself.
Patches 11 moves the nf_reinject function from the netfilter core
(vmlinux) into the nfnetlink_queue backend, the only location where
this is called from. Also from myself.
Patch 12, from Kees Cook, switches xtables' compat layer to use
unsafe_memcpy because xt_entry_target cannot easily get converted
to a real flexible array (its UAPI and used inside other structs).
* tag 'nf-next-24-02-21' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: x_tables: Use unsafe_memcpy() for 0-sized destination
netfilter: move nf_reinject into nfnetlink_queue modules
netfilter: nft_set_pipapo: use GFP_KERNEL for insertions
netfilter: nft_set_pipapo: speed up bulk element insertions
netfilter: nft_set_pipapo: shrink data structures
netfilter: nft_set_pipapo: do not rely on ZERO_SIZE_PTR
netfilter: nft_set_pipapo: constify lookup fn args where possible
netfilter: xtables: fix up kconfig dependencies
netfilter: nft_osf: simplify init path
netfilter: nf_log: validate nf_logger_find_get()
netfilter: nf_log: consolidate check for NULL logger in lookup function
netfilter: expect: Simplify the allocation of slab caches in nf_conntrack_expect_init
====================
Link: https://lore.kernel.org/r/20240221112637.5396-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to re-organize the struct sock layout. The sk_peek_off
field location is problematic, as most protocols want it in the
RX read area, while UDP wants it on a cacheline different from
sk_receive_queue.
Create a local (inside udp_sock) copy of the 'peek offset is enabled'
flag and place it inside the same cacheline of reader_queue.
Check such flag before reading sk_peek_off. This will save potential
false sharing and cache misses in the fast-path.
Tested under UDP flood with small packets. The struct sock layout
update causes a 4% performance drop, and this patch restores completely
the original tput.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/67ab679c15fbf49fa05b3ffe05d91c47ab84f147.1708426665.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
And change cache name from 'ip_dst_cache' to 'rtable'.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
And change cache name from 'ip_mrt_cache' to 'mfc_cache'.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap reports arptables build failure:
arp_tables.c:(.text+0x20): undefined reference to `xt_find_table'
... because recent change removed a 'select' on the xtables core.
Add a "depends" clause on arptables to resolve this.
Kernel test robot reports another build breakage:
iptable_nat.c:(.text+0x8): undefined reference to `ipt_unregister_table_exit'
... because of a typo, the nat table selected ip6tables.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/netfilter-devel/d0dfbaef-046a-4c42-9daa-53636664bf6d@infradead.org/
Fixes: a9525c7f62 ("netfilter: xtables: allow xtables-nft only builds")
Fixes: 4654467dc7 ("netfilter: arptables: allow xtables-nft only builds")
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Florian Westphal <fw@strlen.de>
The variable len being initialized with a value that is never read, an
if statement is initializing it in both paths of the if statement.
The initialization is redundant and can be removed.
Cleans up clang scan build warning:
net/ipv4/tcp_ao.c:512:11: warning: Value stored to 'len' during its
initialization is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/20240216125443.2107244-1-colin.i.king@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net->dev_base_seq and ipv4.dev_addr_genid are monotonically increasing.
If we XOR their values, we could miss to detect if both values
were changed with the same amount.
Fixes: 0465277f6b ("ipv4: provide addr and netconf dump consistency info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When SOF_TIMESTAMPING_OPT_ID is used to ambiguate timestamped datagrams,
the sk_tskey can become unpredictable in case of any error happened
during sendmsg(). Move increment later in the code and make decrement of
sk_tskey in error path. This solution is still racy in case of multiple
threads doing snedmsg() over the very same socket in parallel, but still
makes error path much more predictable.
Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240213110428.1681540-1-vadfed@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
t->parms.link is read locklessly, annotate these reads
and opposite writes accordingly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set scope automatically in ip_route_output_ports() (using the socket
SOCK_LOCALROUTE flag). This way, callers don't have to overload the
tos with the RTO_ONLINK flag, like RT_CONN_FLAGS() does.
For callers that don't pass a struct sock, this doesn't change anything
as the scope is still set to RT_SCOPE_UNIVERSE when sk is NULL.
Callers that passed a struct sock and used RT_CONN_FLAGS(sk) or
RT_CONN_FLAGS_TOS(sk, tos) for the tos are modified to use
ip_sock_tos(sk) and RT_TOS(tos) respectively, as overloading tos with
the RTO_ONLINK flag now becomes unnecessary.
In drivers/net/amt.c, all ip_route_output_ports() calls use a 0 tos
parameter, ignoring the SOCK_LOCALROUTE flag of the socket. But the sk
parameter is a kernel socket, which doesn't have any configuration path
for setting SOCK_LOCALROUTE anyway. Therefore, ip_route_output_ports()
will continue to initialise scope with RT_SCOPE_UNIVERSE and amt.c
doesn't need to be modified.
Also, remove RT_CONN_FLAGS() and RT_CONN_FLAGS_TOS() from route.h as
these macros are now unused.
The objective is to eventually remove RTO_ONLINK entirely to allow
converting ->flowi4_tos to dscp_t. This will ensure proper isolation
between the DSCP and ECN bits, thus minimising the risk of introducing
bugs where TOS values interfere with ECN.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/dacfd2ab40685e20959ab7b53c427595ba229e7d.1707496938.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tnode_free() should use synchronize_net()
instead of syncronize_rcu() to release RTNL sooner.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tp->scaling_ratio is a read mostly field, used in rx and tx fast paths.
Fixes: d5fed5addb ("tcp: reorganize tcp_sock fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: Wei Wang <weiwan@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the IPv4 modules.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240208164244.3818498-7-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/stmicro/stmmac/common.h
38cc3c6dcc ("net: stmmac: protect updates of 64-bit statistics counters")
fd5a6a7131 ("net: stmmac: est: Per Tx-queue error count for HLBF")
c5c3e1bfc9 ("net: stmmac: Offload queueMaxSDU from tc-taprio")
drivers/net/wireless/microchip/wilc1000/netdev.c
c901388028 ("wifi: fill in MODULE_DESCRIPTION()s for wilc1000")
328efda22a ("wifi: wilc1000: do not realloc workqueue everytime an interface is added")
net/unix/garbage.c
11498715f2 ("af_unix: Remove io_uring code for GC.")
1279f9d9de ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
exit_batch_rtnl() is called while RTNL is held,
and devices to be unregistered can be queued in the dev_kill_list.
This saves one rtnl_lock()/rtnl_unlock() pair
and one unregister_netdevice_many() call.
This patch takes care of ipip, ip_vti, and ip_gre tunnels.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Antoine Tenart <atenart@kernel.org>
Link: https://lore.kernel.org/r/20240206144313.2050392-15-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
unregister_nexthop_notifier() assumes the caller does not hold rtnl.
We need in the following patch to use it from a context
already holding rtnl.
Add __unregister_nexthop_notifier().
unregister_nexthop_notifier() becomes a wrapper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Antoine Tenart <atenart@kernel.org>
Link: https://lore.kernel.org/r/20240206144313.2050392-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
exit_batch_rtnl() is called while RTNL is held.
This saves one rtnl_lock()/rtnl_unlock() pair.
We also need to create nexthop_net_exit()
to make sure net->nexthop.devhash is not freed too soon,
otherwise we will not be able to unregister netdev
from exit_batch_rtnl() methods.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Antoine Tenart <atenart@kernel.org>
Link: https://lore.kernel.org/r/20240206144313.2050392-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_recv_error() is called without holding the socket lock.
IPv6 socket could mutate to IPv4 with IPV6_ADDRFORM
socket option and trigger a KCSAN warning.
Fixes: f4713a3dfa ("net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the ICMPv6 error is built from a non-linear skb we get the following
splat,
BUG: KASAN: slab-out-of-bounds in do_csum+0x220/0x240
Read of size 4 at addr ffff88811d402c80 by task netperf/820
CPU: 0 PID: 820 Comm: netperf Not tainted 6.8.0-rc1+ #543
...
kasan_report+0xd8/0x110
do_csum+0x220/0x240
csum_partial+0xc/0x20
skb_tunnel_check_pmtu+0xeb9/0x3280
vxlan_xmit_one+0x14c2/0x4080
vxlan_xmit+0xf61/0x5c00
dev_hard_start_xmit+0xfb/0x510
__dev_queue_xmit+0x7cd/0x32a0
br_dev_queue_push_xmit+0x39d/0x6a0
Use skb_checksum instead of csum_partial who cannot deal with non-linear
SKBs.
Fixes: 4cb47a8644 ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit marks kfuncs as such inside the .BTF_ids section. The upshot
of these annotations is that we'll be able to automatically generate
kfunc prototypes for downstream users. The process is as follows:
1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs
2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for
each function inside BTF_KFUNCS sets
3. At runtime, vmlinux or module BTF is made available in sysfs
4. At runtime, bpftool (or similar) can look at provided BTF and
generate appropriate prototypes for functions with "bpf_kfunc" tag
To ensure future kfunc are similarly tagged, we now also return error
inside kfunc registration for untagged kfuncs. For vmlinux kfuncs,
we also WARN(), as initcall machinery does not handle errors.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
commit 0a31bd5f2b ("KMEM_CACHE(): simplify slab cache creation")
introduces a new macro.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240130092255.73078-1-chentao@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When inetdev_valid_mtu fails, cork->opt should be freed if it is
allocated in ip_setup_cork. Otherwise there could be a memleak.
Fixes: 501a90c945 ("inet: protect against too small mtu values.")
Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240129091017.2938835-1-alexious@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmW3ugcNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AP+IEADdlinxL+a5Rqx0W3I0gR4LiOrnHdl2SQesCjEE
iBm8Fgx7pQh6jQpjsEl+dg85CFbqI4iVxgLV/uAVCOvRFELH5aR/WHjAdoXQjrTS
55bexDCG9q9KBYCm721h2mSUTdmmx+aKfndFYMhEULzQPfDy+cS2lIh4epQPnlFH
Idc1zXuMNWM/QY0vvwkAxsZ6TMG61GIYDAH4PtEtfCUVksdkLRPG8qWs5tJJgKFp
SIyqKSB3Ab4LqY9e/HG0FwcrMwrSmNhcbO4CwpDfIrHEuIUtMKCqOp6X4lU1ekeb
xVTuQ7fU64KmO+a/sS4QH8rPfDgT31GnxaVfeL7AM9pQsiLhJGMTlfFqgItJjZrS
uch7Jtx0iWMDfuP7OgIYnS46FYD2wXShuz4wIbHI8RSEkln7GBJ2KGpnvyoF07Tf
V6ZrGQk0TnAr7MAEXHe8rd0WEVvbZuBiVHo1xpSxKI9rGJYDdgSRz16wMdBowhIW
Q++nacicTs8ak64vlAsigI4bnDYTNXsHQO2S84tXTikaq88m1/f9EqIVr/V2uMoR
xTQcAaob2TqaGirS/bx/9twEuiwB/gg/nbqmVHni285SO2JbdNQ/iglopc/+EMYS
ES3wibdQzfPL9h61KyHMGUbZke3w72Gn5X5Fp3lnoi7+ZSLMMRTBoMFv4T+DLzqJ
dyouYw==
=iDKQ
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-01-29' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
nf-next pr 2024-01-29
This batch contains updates for your *next* tree.
First three changes, from Phil Sutter, allow userspace to define
a table that is exclusively owned by a daemon (via netlink socket
aliveness) without auto-removing this table when the userspace program
exits. Such table gets marked as orphaned and a restarting management
daemon may re-attach/reassume ownership.
Next patch, from Pablo, passes already-validated flags variable around
rather than having called code re-fetch it from netlnik message.
Patches 5 and 6 update ipvs and nf_conncount to use the recently
introduced KMEM_CACHE() macro.
Last three patches, from myself, tweak kconfig logic a little to
permit a kernel configuration that can run iptables-over-nftables
but not classic (setsockopt) iptables.
Such builds lack the builtin-filter/mangle/raw/nat/security tables,
the set/getsockopt interface and the "old blob format"
interpreter/traverser. For now, this is 'oldconfig friendly', users
need to manually deselect existing config options for this.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For now, the packet with an old ack is not accepted if we are in
FIN_WAIT1 state, which can cause retransmission. Taking the following
case as an example:
Client Server
| |
FIN_WAIT1(Send FIN, seq=10) FIN_WAIT1(Send FIN, seq=20, ack=10)
| |
| Send ACK(seq=21, ack=11)
Recv ACK(seq=21, ack=11)
|
Recv FIN(seq=20, ack=10)
In the case above, simultaneous close is happening, and the FIN and ACK
packet that send from the server is out of order. Then, the FIN will be
dropped by the client, as it has an old ack. Then, the server has to
retransmit the FIN, which can cause delay if the server has set the
SO_LINGER on the socket.
Old ack is accepted in the ESTABLISHED and TIME_WAIT state, and I think
it should be better to keep the same logic.
In this commit, we accept old ack in FIN_WAIT1/FIN_WAIT2/CLOSING/LAST_ACK
states. Maybe we should limit it to FIN_WAIT1 for now?
Signed-off-by: Menglong Dong <menglong8.dong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240126040519.1846345-1-menglong8.dong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add hidden IP(6)_NF_IPTABLES_LEGACY symbol.
When any of the "old" builtin tables are enabled the "old" iptables
interface will be supported.
To disable the old set/getsockopt interface the existing options
for the builtin tables need to be turned off:
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_FILTER is not set
CONFIG_IP_NF_NAT is not set
CONFIG_IP_NF_MANGLE is not set
CONFIG_IP_NF_RAW is not set
CONFIG_IP_NF_SECURITY is not set
Same for CONFIG_IP6_NF_ variants.
This allows to build a kernel that only supports ip(6)tables-nft
(iptables-over-nftables api).
In the future the _LEGACY symbol will become visible and the select
statements will be turned into 'depends on', but for now be on safe side
so "make oldconfig" won't break things.
Signed-off-by: Florian Westphal <fw@strlen.de>
Allows to build kernel that supports the arptables mangle target
via nftables' compat infra but without the arptables get/setsockopt
interface or the old arptables filter interpreter.
IOW, setting IP_NF_ARPFILTER=n will break arptables-legacy, but
arptables-nft will continue to work as long as nftables compat
support is enabled.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Phil Sutter <phil@nwl.cc>
TCP rx zerocopy intent is to map pages initially allocated
from NIC drivers, not pages owned by a fs.
This patch adds to can_map_frag() these additional checks:
- Page must not be a compound one.
- page->mapping must be NULL.
This fixes the panic reported by ZhangPeng.
syzbot was able to loopback packets built with sendfile(),
mapping pages owned by an ext4 file to TCP rx zerocopy.
r3 = socket$inet_tcp(0x2, 0x1, 0x0)
mmap(&(0x7f0000ff9000/0x4000)=nil, 0x4000, 0x0, 0x12, r3, 0x0)
r4 = socket$inet_tcp(0x2, 0x1, 0x0)
bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e24, @multicast1}, 0x10)
connect$inet(r4, &(0x7f00000006c0)={0x2, 0x4e24, @empty}, 0x10)
r5 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00',
0x181e42, 0x0)
fallocate(r5, 0x0, 0x0, 0x85b8)
sendfile(r4, r5, 0x0, 0x8ba0)
getsockopt$inet_tcp_TCP_ZEROCOPY_RECEIVE(r4, 0x6, 0x23,
&(0x7f00000001c0)={&(0x7f0000ffb000/0x3000)=nil, 0x3000, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0}, &(0x7f0000000440)=0x40)
r6 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00',
0x181e42, 0x0)
Fixes: 93ab6cc691 ("tcp: implement mmap() for zero copy receive")
Link: https://lore.kernel.org/netdev/5106a58e-04da-372a-b836-9d3d0bd2507b@huawei.com/T/
Reported-and-bisected-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZbQV+gAKCRDbK58LschI
g2OeAP0VvhZS9SPiS+/AMAFuw2W1BkMrFNbfBTc3nzRnyJSmNAD+NG4CLLJvsKI9
olu7VC20B8pLTGLUGIUSwqnjOC+Kkgc=
=wVMl
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-01-26
We've added 107 non-merge commits during the last 4 day(s) which contain
a total of 101 files changed, 6009 insertions(+), 1260 deletions(-).
The main changes are:
1) Add BPF token support to delegate a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application. With addressed changes from Christian
and Linus' reviews, from Andrii Nakryiko.
2) Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops type,
from Kui-Feng Lee.
3) Add support for retrieval of cookies for perf/kprobe multi links,
from Jiri Olsa.
4) Bigger batch of prep-work for the BPF verifier to eventually support
preserving boundaries and tracking scalars on narrowing fills,
from Maxim Mikityanskiy.
5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help
with the scenario of SYN floods, from Kuniyuki Iwashima.
6) Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF objects,
from Hou Tao.
7) Extend BPF verifier to track aligned ST stores as imprecise spilled
registers, from Yonghong Song.
8) Several fixes to BPF selftests around inline asm constraints and
unsupported VLA code generation, from Jose E. Marchesi.
9) Various updates to the BPF IETF instruction set draft document such
as the introduction of conformance groups for instructions,
from Dave Thaler.
10) Fix BPF verifier to make infinite loop detection in is_state_visited()
exact to catch some too lax spill/fill corner cases,
from Eduard Zingerman.
11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly
instead of implicitly for various register types, from Hao Sun.
12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness
in neighbor advertisement at setup time, from Martin KaFai Lau.
13) Change BPF selftests to skip callback tests for the case when the
JIT is disabled, from Tiezhu Yang.
14) Add a small extension to libbpf which allows to auto create
a map-in-map's inner map, from Andrey Grafin.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits)
selftests/bpf: Add missing line break in test_verifier
bpf, docs: Clarify definitions of various instructions
bpf: Fix error checks against bpf_get_btf_vmlinux().
bpf: One more maintainer for libbpf and BPF selftests
selftests/bpf: Incorporate LSM policy to token-based tests
selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar
libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar
selftests/bpf: Add tests for BPF object load with implicit token
selftests/bpf: Add BPF object loading tests with explicit token passing
libbpf: Wire up BPF token support at BPF object level
libbpf: Wire up token_fd into feature probing logic
libbpf: Move feature detection code into its own file
libbpf: Further decouple feature checking logic from bpf_object
libbpf: Split feature detectors definitions from cached results
selftests/bpf: Utilize string values for delegate_xxx mount options
bpf: Support symbolic BPF FS delegation mount options
bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
bpf,selinux: Allocate bpf_security_struct per BPF token
selftests/bpf: Add BPF token-enabled tests
libbpf: Add BPF token support to bpf_prog_load() API
...
====================
Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
Replace the static list of struct_ops types with per-btf struct_ops_tab to
enable dynamic registration.
Both bpf_dummy_ops and bpf_tcp_ca now utilize the registration function
instead of being listed in bpf_struct_ops_types.h.
Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-12-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move some of members of bpf_struct_ops to bpf_struct_ops_desc. type_id is
unavailabe in bpf_struct_ops anymore. Modules should get it from the btf
received by kmod's init function.
Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
We will support arbitrary SYN Cookie with BPF in the following
patch.
If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to cookie_[46]_check() as skb->sk. If skb->sk is not
NULL, we call cookie_bpf_check().
Then, we clear skb->sk and skb->destructor, which are needed not
to hold refcnt for reqsk and the listener. See the following patch
for details.
After that, we finish initialisation for the remaining fields with
cookie_tcp_reqsk_init().
Note that the server side WScale is set only for non-BPF SYN Cookie.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240115205514.68364-5-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We will support arbitrary SYN Cookie with BPF.
When BPF prog validates ACK and kfunc allocates a reqsk, we need
to call tcp_ns_to_ts() to calculate an offset of TSval for later
use:
time
t0 : Send SYN+ACK
-> tsval = Initial TSval (Random Number)
t1 : Recv ACK of 3WHS
-> tsoff = TSecr - tcp_ns_to_ts(usec_ts_ok, tcp_clock_ns())
= Initial TSval - t1
t2 : Send ACK
-> tsval = t2 + tsoff
= Initial TSval + (t2 - t1)
= Initial TSval + Time Delta (x)
(x) Note that the time delta does not include the initial RTT
from t0 to t1.
Let's move tcp_ns_to_ts() to tcp.h.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240115205514.68364-2-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After the removal of inet_diag_table_mutex, sock_diag_table_mutex
and sock_diag_mutex, I was able so see spinlock contention from
inet_diag_dump_icsk() when running 100 parallel invocations.
It is time to skip over empty buckets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
TCPDIAG_GETSOCK and DCCPDIAG_GETSOCK diag are serialized
on sock_diag_table_mutex.
This is to make sure inet_diag module is not unloaded
while diag was ongoing.
It is time to get rid of this mutex and use RCU protection,
allowing full parallelism.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Following patch is going to use RCU instead of
sock_diag_table_mutex acquisition.
This patch is a preparation, no change of behavior yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
inet_diag_lock_handler() current implementation uses a mutex
to protect inet_diag_table[] array against concurrent changes.
This makes inet_diag dump serialized, thus less scalable
than legacy /proc files.
It is time to switch to full RCU protection.
As a bonus, if a target is statically linked instead of being
modular, inet_diag_lock_handler() & inet_diag_unlock_handler()
reduce to reads only.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Following patch is going to use RCU instead of
inet_diag_table_mutex acquisition.
This patch is a preparation, no change of behavior yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>