As Jakub noticed, prints should be avoided on the datapath.
Also, as packets would never come to the else branch in
ping_lookup(), remove pr_err() from ping_lookup().
Fixes: 35a79e64de ("ping: fix the dif and sdif check in ping_lookup")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1ef3f2fcd31bd681a193b1fcf235eee1603819bd.1645674068.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This test is checking if we exited the list via break or not. However
if it did not exit via a break then "node" does not point to a valid
udp_tunnel_nic_shared_node struct. It will work because of the way
the structs are laid out it's the equivalent of
"if (info->shared->udp_tunnel_nic_info != dev)" which will always be
true, but it's not the right way to test.
Fixes: 74cc6d182d ("udp_tunnel: add the ability to share port tables")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case user space sends a packet destined to a broadcast address when a
matching broadcast route is not configured, the kernel will create a
unicast neighbour entry that will never be resolved [1].
When the broadcast route is configured, the unicast neighbour entry will
not be invalidated and continue to linger, resulting in packets being
dropped.
Solve this by invalidating unresolved neighbour entries for broadcast
addresses after routes for these addresses are internally configured by
the kernel. This allows the kernel to create a broadcast neighbour entry
following the next route lookup.
Another possible solution that is more generic but also more complex is
to have the ARP code register a listener to the FIB notification chain
and invalidate matching neighbour entries upon the addition of broadcast
routes.
It is also possible to wave off the issue as a user space problem, but
it seems a bit excessive to expect user space to be that intimately
familiar with the inner workings of the FIB/neighbour kernel code.
[1] https://lore.kernel.org/netdev/55a04a8f-56f3-f73c-2aea-2195923f09d1@huawei.com/
Reported-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We encounter a tcp drop issue in our cloud environment. Packet GROed in
host forwards to a VM virtio_net nic with net_failover enabled. VM acts
as a IPVS LB with ipip encapsulation. The full path like:
host gro -> vm virtio_net rx -> net_failover rx -> ipvs fullnat
-> ipip encap -> net_failover tx -> virtio_net tx
When net_failover transmits a ipip pkt (gso_type = 0x0103, which means
SKB_GSO_TCPV4, SKB_GSO_DODGY and SKB_GSO_IPXIP4), there is no gso
did because it supports TSO and GSO_IPXIP4. But network_header points to
inner ip header.
Call Trace:
tcp4_gso_segment ------> return NULL
inet_gso_segment ------> inner iph, network_header points to
ipip_gso_segment
inet_gso_segment ------> outer iph
skb_mac_gso_segment
Afterwards virtio_net transmits the pkt, only inner ip header is modified.
And the outer one just keeps unchanged. The pkt will be dropped in remote
host.
Call Trace:
inet_gso_segment ------> inner iph, outer iph is skipped
skb_mac_gso_segment
__skb_gso_segment
validate_xmit_skb
validate_xmit_skb_list
sch_direct_xmit
__qdisc_run
__dev_queue_xmit ------> virtio_net
dev_hard_start_xmit
__dev_queue_xmit ------> net_failover
ip_finish_output2
ip_output
iptunnel_xmit
ip_tunnel_xmit
ipip_tunnel_xmit ------> ipip
dev_hard_start_xmit
__dev_queue_xmit
ip_finish_output2
ip_output
ip_forward
ip_rcv
__netif_receive_skb_one_core
netif_receive_skb_internal
napi_gro_receive
receive_buf
virtnet_poll
net_rx_action
The root cause of this issue is specific with the rare combination of
SKB_GSO_DODGY and a tunnel device that adds an SKB_GSO_ tunnel option.
SKB_GSO_DODGY is set from external virtio_net. We need to reset network
header when callbacks.gso_segment() returns NULL.
This patch also includes ipv6_gso_segment(), considering SIT, etc.
Fixes: cb32f511a7 ("ipip: add GSO/TSO support")
Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace tcp_drop() used in tcp_data_queue_ofo with tcp_drop_reason().
Following drop reasons are introduced:
SKB_DROP_REASON_TCP_OFOMERGE
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace tcp_drop() used in tcp_data_queue() with tcp_drop_reason().
Following drop reasons are introduced:
SKB_DROP_REASON_TCP_ZEROWINDOW
SKB_DROP_REASON_TCP_OLD_DATA
SKB_DROP_REASON_TCP_OVERWINDOW
SKB_DROP_REASON_TCP_OLD_DATA is used for the case that end_seq of skb
less than the left edges of receive window. (Maybe there is a better
name?)
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace tcp_drop() used in tcp_rcv_established() with tcp_drop_reason().
Following drop reasons are added:
SKB_DROP_REASON_TCP_FLAGS
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() used in tcp_v4_do_rcv() and tcp_v6_do_rcv() with
kfree_skb_reason().
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the address of drop_reason to tcp_add_backlog() to store the
reasons for skb drops when fails. Following drop reasons are
introduced:
SKB_DROP_REASON_SOCKET_BACKLOG
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the address of drop reason to tcp_v4_inbound_md5_hash() and
tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this
function fails. Therefore, the drop reason can be passed to
kfree_skb_reason() when the skb needs to be freed.
Following drop reasons are added:
SKB_DROP_REASON_TCP_MD5NOTFOUND
SKB_DROP_REASON_TCP_MD5UNEXPECTED
SKB_DROP_REASON_TCP_MD5FAILURE
SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5*
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use kfree_skb_reason() for some path in tcp_v4_rcv() that missed before,
including:
SKB_DROP_REASON_SOCKET_FILTER
SKB_DROP_REASON_XFRM_POLICY
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For TCP protocol, tcp_drop() is used to free the skb when it needs
to be dropped. To make use of kfree_skb_reason() and pass the drop
reason to it, introduce the function tcp_drop_reason(). Meanwhile,
make tcp_drop() an inline call to tcp_drop_reason().
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new protocol attribute to IPv4 and IPv6 addresses.
Inspiration was taken from the protocol attribute of routes. User space
applications like iproute2 can set/get the protocol with the Netlink API.
The attribute is stored as an 8-bit unsigned integer.
The protocol attribute is set by kernel for these categories:
- IPv4 and IPv6 loopback addresses
- IPv6 addresses generated from router announcements
- IPv6 link local addresses
User space may pass custom protocols, not defined by the kernel.
Grouping addresses on their origin is useful in scenarios where you want
to distinguish between addresses based on who added them, e.g. kernel
vs. user space.
Tagging addresses with a string label is an existing feature that could be
used as a solution. Unfortunately the max length of a label is
15 characters, and for compatibility reasons the label must be prefixed
with the name of the device followed by a colon. Since device names also
have a max length of 15 characters, only -1 characters is guaranteed to be
available for any origin tag, which is not that much.
A reference implementation of user space setting and getting protocols
is available for iproute2:
9a6ea18bd7
Signed-off-by: Jacques de Laval <Jacques.De.Laval@westermo.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220217150202.80802-1-Jacques.De.Laval@westermo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP sendmsg() can be lockless, this is causing all kinds
of data races.
This patch converts sk->sk_tskey to remove one of these races.
BUG: KCSAN: data-race in __ip_append_data / __ip_append_data
read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
__ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
__ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000054d -> 0x0000054e
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_alias_hw_flags_set() can be used by concurrent threads,
and is only RCU protected.
We need to annotate accesses to following fields of struct fib_alias:
offload, trap, offload_failed
Because of READ_ONCE()WRITE_ONCE() limitations, make these
field u8.
BUG: KCSAN: data-race in fib_alias_hw_flags_set / fib_alias_hw_flags_set
read to 0xffff888134224a6a of 1 bytes by task 2013 on cpu 1:
fib_alias_hw_flags_set+0x28a/0x470 net/ipv4/fib_trie.c:1050
nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
process_scheduled_works kernel/workqueue.c:2370 [inline]
worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
kthread+0x1bf/0x1e0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30
write to 0xffff888134224a6a of 1 bytes by task 4872 on cpu 0:
fib_alias_hw_flags_set+0x2d5/0x470 net/ipv4/fib_trie.c:1054
nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
process_scheduled_works kernel/workqueue.c:2370 [inline]
worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
kthread+0x1bf/0x1e0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30
value changed: 0x00 -> 0x02
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 4872 Comm: kworker/0:0 Not tainted 5.17.0-rc3-syzkaller-00188-g1d41d2e82623-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events nsim_fib_event_work
Fixes: 90b93f1b31 ("ipv4: Add "offload" and "trap" indications to routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220216173217.3792411-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When 'ping' changes to use PING socket instead of RAW socket by:
# sysctl -w net.ipv4.ping_group_range="0 100"
There is another regression caused when matching sk_bound_dev_if
and dif, RAW socket is using inet_iif() while PING socket lookup
is using skb->dev->ifindex, the cmd below fails due to this:
# ip link add dummy0 type dummy
# ip link set dummy0 up
# ip addr add 192.168.111.1/24 dev dummy0
# ping -I dummy0 192.168.111.1 -c1
The issue was also reported on:
https://github.com/iputils/iputils/issues/104
But fixed in iputils in a wrong way by not binding to device when
destination IP is on device, and it will cause some of kselftests
to fail, as Jianlin noticed.
This patch is to use inet(6)_iif and inet(6)_sdif to get dif and
sdif for PING socket, and keep consistent with RAW socket.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When multiple containers are running in the environment and multiple
macvlan network port are configured in each container, a lot of martian
source prints will appear after martian_log is enabled. they are almost
the same, and printed by net_warn_ratelimited. Each arp message will
trigger this print on each network port.
Such as:
IPv4: martian source 173.254.95.16 from 173.254.100.109,
on dev eth0
ll header: 00000000: ff ff ff ff ff ff 40 00 ad fe 64 6d
08 06 ......@...dm..
IPv4: martian source 173.254.95.16 from 173.254.100.109,
on dev eth1
ll header: 00000000: ff ff ff ff ff ff 40 00 ad fe 64 6d
08 06 ......@...dm..
There is no description of this kind of source in the RFC1812.
Signed-off-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an optimization to keep the per-cpu lists as short as possible:
Whenever rt_flush_dev() changes one rtable dst.dev
matching the disappearing device, it can can transfer the object
to a quarantine list, waiting for a final rt_del_uncached_list().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch intends to provide a mechanism to put constraint on SMC
connections visit according to the pressure of SMC handshake process.
At present, frequent visits will cause the incoming connections to be
backlogged in SMC handshake queue, raise the connections established
time. Which is quite unacceptable for those applications who base on
short lived connections.
There are two ways to implement this mechanism:
1. Put limitation after TCP established.
2. Put limitation before TCP established.
In the first way, we need to wait and receive CLC messages that the
client will potentially send, and then actively reply with a decline
message, in a sense, which is also a sort of SMC handshake, affect the
connections established time on its way.
In the second way, the only problem is that we need to inject SMC logic
into TCP when it is about to reply the incoming SYN, since we already do
that, it's seems not a problem anymore. And advantage is obvious, few
additional processes are required to complete the constraint.
This patch use the second way. After this patch, connections who beyond
constraint will not informed any SMC indication, and SMC will not be
involved in any of its subsequent processes.
Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 563f8e97e0 ("ipv4: Stop taking ECN bits into account in
fib4-rules") replaced the validation test on frh->tos. While the new
test is stricter for ECN bits, it doesn't detect the use of high order
DSCP bits. This would be fine if IPv4 could properly handle them. But
currently, most IPv4 lookups are done with the three high DSCP bits
masked. Therefore, using these bits doesn't lead to the expected
result.
Let's reject such configurations again, so that nobody starts to
use and make any assumption about how the stack handles the three high
order DSCP bits in fib4 rules.
Fixes: 563f8e97e0 ("ipv4: Stop taking ECN bits into account in fib4-rules")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Conntrack sets on CHECKSUM_UNNECESSARY for UDP packet with no checksum,
from Kevin Mitchell.
2) skb->priority support for nfqueue, from Nicolas Dichtel.
3) Remove conntrack extension register API, from Florian Westphal.
4) Move nat destroy hook to nf_nat_hook instead, to remove
nf_ct_ext_destroy(), also from Florian.
5) Wrap pptp conntrack NAT hooks into single structure, from Florian Westphal.
6) Support for tcp option set to noop for nf_tables, also from Florian.
7) Do not run x_tables comment match from packet path in nf_tables,
from Florian Westphal.
8) Replace spinlock by cmpxchg() loop to update missed ct event,
from Florian Westphal.
9) Wrap cttimeout hooks into single structure, from Florian.
10) Add fast nft_cmp expression for up to 16-bytes.
11) Use cb->ctx to store context in ctnetlink dump, instead of using
cb->args[], from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: ctnetlink: use dump structure instead of raw args
nfqueue: enable to set skb->priority
netfilter: nft_cmp: optimize comparison for 16-bytes
netfilter: cttimeout: use option structure
netfilter: ecache: don't use nf_conn spinlock
netfilter: nft_compat: suppress comment match
netfilter: exthdr: add support for tcp option removal
netfilter: conntrack: pptp: use single option structure
netfilter: conntrack: remove extension register api
netfilter: conntrack: handle ->destroy hook via nat_ops instead
netfilter: conntrack: move extension sizes into core
netfilter: conntrack: make all extensions 8-byte alignned
netfilter: nfqueue: enable to get skb->priority
netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY
====================
Link: https://lore.kernel.org/r/20220209133616.165104-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit
9652dc2eb9 ("tcp: relax listening_hash operations")
removed the need to disable bottom half while acquiring
listening_hash.lock. There are still two callers left which disable
bottom half before the lock is acquired.
On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves is also acquired with enabled bottom halves
followed by disabling bottom halves. This is the reverse locking order.
It has been observed with inet_listen_hashbucket:🔒
local_bh_disable() + spin_lock(&ilb->lock):
inet_listen()
inet_csk_listen_start()
sk->sk_prot->hash() := inet_hash()
local_bh_disable()
__inet_hash()
spin_lock(&ilb->lock);
acquire(&ilb->lock);
Reverse order: spin_lock(&ilb2->lock) + local_bh_disable():
tcp_seq_next()
listening_get_next()
spin_lock(&ilb2->lock);
acquire(&ilb2->lock);
tcp4_seq_show()
get_tcp4_sock()
sock_i_ino()
read_lock_bh(&sk->sk_callback_lock);
acquire(softirq_ctrl) // <---- whoops
acquire(&sk->sk_callback_lock)
Drop local_bh_disable() around __inet_hash() which acquires
listening_hash->lock. Split inet_unhash() and acquire the
listen_hashbucket lock without disabling bottom halves; the inet_ehash
lock with disabled bottom halves.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de
Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net
Link: https://lore.kernel.org/r/YgQOebeZ10eNx1W6@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-02-09
We've added 126 non-merge commits during the last 16 day(s) which contain
a total of 201 files changed, 4049 insertions(+), 2215 deletions(-).
The main changes are:
1) Add custom BPF allocator for JITs that pack multiple programs into a huge
page to reduce iTLB pressure, from Song Liu.
2) Add __user tagging support in vmlinux BTF and utilize it from BPF
verifier when generating loads, from Yonghong Song.
3) Add per-socket fast path check guarding from cgroup/BPF overhead when
used by only some sockets, from Pavel Begunkov.
4) Continued libbpf deprecation work of APIs/features and removal of their
usage from samples, selftests, libbpf & bpftool, from Andrii Nakryiko
and various others.
5) Improve BPF instruction set documentation by adding byte swap
instructions and cleaning up load/store section, from Christoph Hellwig.
6) Switch BPF preload infra to light skeleton and remove libbpf dependency
from it, from Alexei Starovoitov.
7) Fix architecture-agnostic macros in libbpf for accessing syscall
arguments from BPF progs for non-x86 architectures,
from Ilya Leoshkevich.
8) Rework port members in struct bpf_sk_lookup and struct bpf_sock to be
of 16-bit field with anonymous zero padding, from Jakub Sitnicki.
9) Add new bpf_copy_from_user_task() helper to read memory from a different
task than current. Add ability to create sleepable BPF iterator progs,
from Kenny Yu.
10) Implement XSK batching for ice's zero-copy driver used by AF_XDP and
utilize TX batching API from XSK buffer pool, from Maciej Fijalkowski.
11) Generate temporary netns names for BPF selftests to avoid naming
collisions, from Hangbin Liu.
12) Implement bpf_core_types_are_compat() with limited recursion for
in-kernel usage, from Matteo Croce.
13) Simplify pahole version detection and finally enable CONFIG_DEBUG_INFO_DWARF5
to be selected with CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor.
14) Misc minor fixes to libbpf and selftests from various folks.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (126 commits)
selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup
bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide
libbpf: Fix compilation warning due to mismatched printf format
selftests/bpf: Test BPF_KPROBE_SYSCALL macro
libbpf: Add BPF_KPROBE_SYSCALL macro
libbpf: Fix accessing the first syscall argument on s390
libbpf: Fix accessing the first syscall argument on arm64
libbpf: Allow overriding PT_REGS_PARM1{_CORE}_SYSCALL
selftests/bpf: Skip test_bpf_syscall_macro's syscall_arg1 on arm64 and s390
libbpf: Fix accessing syscall arguments on riscv
libbpf: Fix riscv register names
libbpf: Fix accessing syscall arguments on powerpc
selftests/bpf: Use PT_REGS_SYSCALL_REGS in bpf_syscall_macro
libbpf: Add PT_REGS_SYSCALL_REGS macro
selftests/bpf: Fix an endianness issue in bpf_syscall_macro test
bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE
bpf: Fix leftover header->pages in sparc and powerpc code.
libbpf: Fix signedness bug in btf_dump_array_data()
selftests/bpf: Do not export subtest as standalone test
bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failures
...
====================
Link: https://lore.kernel.org/r/20220209210050.8425-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cleanup_net() is competing with other rtnl users.
Avoiding to acquire rtnl for each netns before calling
ipmr_rules_exit() gives chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cleanup_net() is competing with other rtnl users.
Instead of acquiring rtnl at each fib_net_exit() invocation,
add fib_net_exit_batch() so that rtnl is acquired once.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cleanup_net() is competing with other rtnl users.
nexthop_net_exit() seems a good candidate for exit_batch(),
as this gives chance for cleanup_net() to progress much faster,
holding rtnl a bit longer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new dscp_t type to replace the fa_tos field of fib_alias. This
ensures ECN bits are ignored and makes the field compatible with the
fc_dscp field of struct fib_config.
Converting old *tos variables and fields to dscp_t allows sparse to
flag incorrect uses of DSCP and ECN bits. This patch is entirely about
type annotation and shouldn't change any existing behaviour.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new dscp_t type to replace the fc_tos field of fib_config, to
ensure IPv4 routes aren't influenced by ECN bits when configured with
non-zero rtm_tos.
Before this patch, IPv4 routes specifying an rtm_tos with some of the
ECN bits set were accepted. However they wouldn't work (never match) as
IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB
lookup (although a few buggy code paths don't).
After this patch, IPv4 routes specifying an rtm_tos with any ECN bit
set is rejected.
Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted,
but treated as if it were 0.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new dscp_t type to replace the tos field of struct fib4_rule,
so that fib4-rules consistently ignore ECN bits.
Before this patch, fib4-rules did accept rules with the high order ECN
bit set (but not the low order one). Also, it relied on its callers
masking the ECN bits of ->flowi4_tos to prevent those from influencing
the result. This was brittle and a few call paths still do the lookup
without masking the ECN bits first.
After this patch fib4-rules only compare the DSCP bits. ECN can't
influence the result anymore, even if the caller didn't mask these
bits. Also, fib4-rules now must have both ECN bits cleared or they will
be rejected.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace kfree_skb() with kfree_skb_reason() in __udp_queue_rcv_skb().
Following new drop reasons are introduced:
SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in udp_queue_rcv_one_skb().
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_protocol_deliver_rcu().
Following new drop reasons are introduced:
SKB_DROP_REASON_XFRM_POLICY
SKB_DROP_REASON_IP_NOPROTO
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_rcv_finish_core(),
following drop reasons are introduced:
SKB_DROP_REASON_IP_RPFILTER
SKB_DROP_REASON_UNICAST_IN_L2_MULTICAST
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_rcv_core(). Three new
drop reasons are introduced:
SKB_DROP_REASON_OTHERHOST
SKB_DROP_REASON_IP_CSUM
SKB_DROP_REASON_IP_INHDR
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot found that mixing sendpage() and sendmsg(MSG_ZEROCOPY)
calls over the same TCP socket would again trigger the
infamous warning in inet_sock_destruct()
WARN_ON(sk_forward_alloc_get(sk));
While Talal took into account a mix of regular copied data
and MSG_ZEROCOPY one in the same skb, the sendpage() path
has been forgotten.
We want the charging to happen for sendpage(), because
pages could be coming from a pipe. What is missing is the
downgrading of pure zerocopy status to make sure
sk_forward_alloc will stay synced.
Add tcp_downgrade_zcopy_pure() helper so that we can
use it from the two callers.
Fixes: 9b65b17db7 ("net: avoid double accounting for pure zerocopy skbs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20220203225547.665114-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of exposing the four hooks individually use a sinle hook ops
structure.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
tcp_shift_skb_data() might collapse three packets into a larger one.
P_A, P_B, P_C -> P_ABC
Historically, it used a single tcp_skb_can_collapse_to(P_A) call,
because it was enough.
In commit 8571248411 ("tcp: coalesce/collapse must respect MPTCP extensions"),
this call was replaced by a call to tcp_skb_can_collapse(P_A, P_B)
But the now needed test over P_C has been missed.
This probably broke MPTCP.
Then later, commit 9b65b17db7 ("net: avoid double accounting for pure zerocopy skbs")
added an extra condition to tcp_skb_can_collapse(), but the missing call
from tcp_shift_skb_data() is also breaking TCP zerocopy, because P_A and P_C
might have different skb_zcopy_pure() status.
Fixes: 8571248411 ("tcp: coalesce/collapse must respect MPTCP extensions")
Fixes: 9b65b17db7 ("net: avoid double accounting for pure zerocopy skbs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220201184640.756716-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When setting RTO through BPF program, some SYN ACK packets were unaffected
and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout
option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT
and is reassigned through BPF using tcp_timeout_init call. SYN ACK
retransmits now use newly added timeout option.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Acked-by: Martin KaFai Lau <kafai@fb.com>
v2:
- Add timeout option to struct request_sock. Do not call
tcp_timeout_init on every syn ack retransmit.
v3:
- Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX.
v4:
- Refactor duplicate code by adding reqsk_timeout function.
Signed-off-by: David S. Miller <davem@davemloft.net>
We got reports of following warning in inet_sock_destruct()
WARN_ON(sk_forward_alloc_get(sk));
Whenever we add a non zero-copy fragment to a pure zerocopy skb,
we have to anticipate that whole skb->truesize will be uncharged
when skb is finally freed.
skb->data_len is the payload length. But the memory truesize
estimated by __zerocopy_sg_from_iter() is page aligned.
Fixes: 9b65b17db7 ("net: avoid double accounting for pure zerocopy skbs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20220201065254.680532-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Disabling rehash behavior did not affect SYN ACK retransmits because hash
was forcefully changed bypassing the sk_rethink_hash function. This patch
adds a condition which checks for rehash mode before resetting hash.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the SO_TXREHASH socket option to control hash rethink behavior per socket.
When default mode is set, sockets disable rehash at initialization and use
sysctl option when entering listen state. setsockopt() overrides default
behavior.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_idents_reserve is only used in net/ipv4/route.c. Make it static
and remove the export.
Signed-off-by: David Ahern <dsahern@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since v2.5.44 and addition of ip_options_fragment()
ip_options_build() does not render headers for fragments
directly. @is_frag is always 0.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Remove leftovers from flowtable modules, from Geert Uytterhoeven.
2) Missing refcount increment of conntrack template in nft_ct,
from Florian Westphal.
3) Reduce nft_zone selftest time, also from Florian.
4) Add selftest to cover stateless NAT on fragments, from Florian Westphal.
5) Do not set net_device when for reject packets from the bridge path,
from Phil Sutter.
6) Cancel register tracking info on nft_byteorder operations.
7) Extend nft_concat_range selftest to cover set reload with no elements,
from Florian Westphal.
8) Remove useless update of pointer in chain blob builder, reported
by kbuild test robot.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nf_tables: remove assignment with no effect in chain blob builder
selftests: nft_concat_range: add test for reload with no element add/del
netfilter: nft_byteorder: track register operations
netfilter: nft_reject_bridge: Fix for missing reply from prerouting
selftests: netfilter: check stateless nat udp checksum fixup
selftests: netfilter: reduce zone stress test running time
netfilter: nft_ct: fix use after free when attaching zone template
netfilter: Remove flowtable relics
====================
Link: https://lore.kernel.org/r/20220127235235.656931-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>