Commit Graph

61215 Commits

Author SHA1 Message Date
Karsten Graul
b7eede7578 net/smc: fix work request handling
Wait for pending sends only when smc_switch_conns() found a link to move
the connections to. Do not wait during link freeing, this can lead to
permanent hang situations. And refuse to provide a new tx slot on an
unusable link.

Fixes: c6f02ebeea ("net/smc: switch connections to alternate link")
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-08 12:35:15 -07:00
Karsten Graul
6778a6bed0 net/smc: separate LLC wait queues for flow and messages
There might be races in scenarios where both SMC link groups are on the
same system. Prevent that by creating separate wait queues for LLC flows
and messages. Switch to non-interruptable versions of wait_event() and
wake_up() for the llc flow waiter to make sure the waiters get control
sequentially. Fine tune the llc_flow_lock to include the assignment of
the message. Write to system log when an unexpected message was
dropped. And remove an extra indirection and use the existing local
variable lgr in smc_llc_enqueue().

Fixes: 555da9af82 ("net/smc: add event-based llc_flow framework")
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-08 12:35:15 -07:00
Christoph Hellwig
6955a76fbc bpfilter: switch to kernel_write
While pipes don't really need sb_writers projection, __kernel_write is an
interface better kept private, and the additional rw_verify_area does not
hurt here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 08:27:56 +02:00
Stanislav Fomichev
f5836749c9 bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook
Sometimes it's handy to know when the socket gets freed. In
particular, we'd like to try to use a smarter allocation of
ports for bpf_bind and explore the possibility of limiting
the number of SOCK_DGRAM sockets the process can have.

Implement BPF_CGROUP_INET_SOCK_RELEASE hook that triggers on
inet socket release. It triggers only for userspace sockets
(not in-kernel ones) and therefore has the same semantics as
the existing BPF_CGROUP_INET_SOCK_CREATE.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200706230128.4073544-2-sdf@google.com
2020-07-08 01:03:31 +02:00
Gustavo A. R. Silva
964201de69 net/sched: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:47:46 -07:00
Alexander A. Klimov
535094a0c9 Replace HTTP links with HTTPS ones: X.25 network layer
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
    For each line:
      If doesn't contain `\bxmlns\b`:
        For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
          If both the HTTP and HTTPS versions
          return 200 OK and serve the same content:
            Replace HTTP with HTTPS.

Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:44:44 -07:00
Florian Fainelli
f3631ab08e net: ethtool: Remove PHYLIB direct dependency
Now that we have introduced ethtool_phy_ops and the PHY library
dynamically registers its operations with that function pointer, we can
remove the direct PHYLIB dependency in favor of using dynamic
operations.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:41:05 -07:00
Florian Fainelli
4895d7808e net: ethtool: Introduce ethtool_phy_ops
In order to decouple ethtool from its PHY library dependency, define an
ethtool_phy_ops singleton which can be overriden by the PHY library when
it loads with an appropriate set of function pointers.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:41:04 -07:00
Linus Lüssing
5fc6266af7 bridge: mcast: Fix MLD2 Report IPv6 payload length check
Commit e57f61858b ("net: bridge: mcast: fix stale nsrcs pointer in
igmp3/mld2 report handling") introduced a bug in the IPv6 header payload
length check which would potentially lead to rejecting a valid MLD2 Report:

The check needs to take into account the 2 bytes for the "Number of
Sources" field in the "Multicast Address Record" before reading it.
And not the size of a pointer to this field.

Fixes: e57f61858b ("net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling")
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:37:57 -07:00
wenxu
8367b3ab6e net/sched: act_ct: add miss tcf_lastuse_update.
When tcf_ct_act execute the tcf_lastuse_update should
be update or the used stats never update

filter protocol ip pref 3 flower chain 0
filter protocol ip pref 3 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 1.1.1.1
  ip_flags frag/firstfrag
  skip_hw
  not_in_hw
 action order 1: ct zone 1 nat pipe
  index 1 ref 1 bind 1 installed 103 sec used 103 sec
 Action statistics:
 Sent 151500 bytes 101 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 cookie 4519c04dc64a1a295787aab13b6a50fb

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:29:44 -07:00
Paolo Abeni
9c29e36152 mptcp: fix DSS map generation on fin retransmission
The RFC 8684 mandates that no-data DATA FIN packets should carry
a DSS with 0 sequence number and data len equal to 1. Currently,
on FIN retransmission we re-use the existing mapping; if the previous
fin transmission was part of a partially acked data packet, we could
end-up writing in the egress packet a non-compliant DSS.

The above will be detected by a "Bad mapping" warning on the receiver
side.

This change addresses the issue explicitly checking for 0 len packet
when adding the DATA_FIN option.

Fixes: 6d0060f600 ("mptcp: Write MPTCP DSS headers to outgoing data packets")
Reported-by: syzbot+42a07faa5923cfaeb9c9@syzkaller.appspotmail.com
Tested-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:27:37 -07:00
Sabrina Dubroca
5eff069023 ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg
IPv4 ping sockets don't set fl4.fl4_icmp_{type,code}, which leads to
incomplete IPsec ACQUIRE messages being sent to userspace. Currently,
both raw sockets and IPv6 ping sockets set those fields.

Expected output of "ip xfrm monitor":
    acquire proto esp
      sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 8 code 0 dev ens4
      policy src 10.0.2.15/32 dst 8.8.8.8/32
        <snip>

Currently with ping sockets:
    acquire proto esp
      sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 0 code 0 dev ens4
      policy src 10.0.2.15/32 dst 8.8.8.8/32
        <snip>

The Libreswan test suite found this problem after Fedora changed the
value for the sysctl net.ipv4.ping_group_range.

Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Paul Wouters <pwouters@redhat.com>
Tested-by: Paul Wouters <pwouters@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:26:37 -07:00
Cong Wang
ad0f75e5f5 cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 13:34:11 -07:00
David Ahern
aea23c323d ipv6: Fix use of anycast address with loopback
Thomas reported a regression with IPv6 and anycast using the following
reproducer:

    echo 1 >  /proc/sys/net/ipv6/conf/all/forwarding
    ip -6 a add fc12::1/16 dev lo
    sleep 2
    echo "pinging lo"
    ping6 -c 2 fc12::

The conversion of addrconf_f6i_alloc to use ip6_route_info_create missed
the use of fib6_is_reject which checks addresses added to the loopback
interface and sets the REJECT flag as needed. Update fib6_is_reject for
loopback checks to handle RTF_ANYCAST addresses.

Fixes: c7a1ce397a ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create")
Reported-by: thomas.gambier@nexedi.com
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 13:04:16 -07:00
Florian Westphal
b416268b7a mptcp: use mptcp worker for path management
We can re-use the existing work queue to handle path management
instead of a dedicated work queue.  Just move pm_worker to protocol.c,
call it from the mptcp worker and get rid of the msk lock (already held).

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 13:02:13 -07:00
Eric W. Biederman
e80eb1dc86 bpfilter: Take advantage of the facilities of struct pid
Instead of relying on the exit_umh cleanup callback use the fact a
struct pid can be tested to see if a process still exists, and that
struct pid has a wait queue that notifies when the process dies.

v1: https://lkml.kernel.org/r/87h7uydlu9.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/874kqt4owu.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-14-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-07 11:58:57 -05:00
Davide Caratti
d47a721520 mptcp: fix race in subflow_data_ready()
syzkaller was able to make the kernel reach subflow_data_ready() for a
server subflow that was closed before subflow_finish_connect() completed.
In these cases we can avoid using the path for regular/fallback MPTCP
data, and just wake the main socket, to avoid the following warning:

 WARNING: CPU: 0 PID: 9370 at net/mptcp/subflow.c:885
 subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885
 Kernel panic - not syncing: panic_on_warn set ...
 CPU: 0 PID: 9370 Comm: syz-executor.0 Not tainted 5.7.0 #106
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
 rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0xb7/0xfe lib/dump_stack.c:118
  panic+0x29e/0x692 kernel/panic.c:221
  __warn.cold+0x2f/0x3d kernel/panic.c:582
  report_bug+0x28b/0x2f0 lib/bug.c:195
  fixup_bug arch/x86/kernel/traps.c:105 [inline]
  fixup_bug arch/x86/kernel/traps.c:100 [inline]
  do_error_trap+0x10f/0x180 arch/x86/kernel/traps.c:197
  do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:216
  invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:1027
 RIP: 0010:subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885
 Code: 04 02 84 c0 74 06 0f 8e 91 00 00 00 41 0f b6 5e 48 31 ff 83 e3 18
 89 de e8 37 ec 3d fe 84 db 0f 85 65 ff ff ff e8 fa ea 3d fe <0f> 0b e9
 59 ff ff ff e8 ee ea 3d fe 48 89 ee 4c 89 ef e8 f3 77 ff
 RSP: 0018:ffff88811b2099b0 EFLAGS: 00010206
 RAX: ffff888111197000 RBX: 0000000000000000 RCX: ffffffff82fbc609
 RDX: 0000000000000100 RSI: ffffffff82fbc616 RDI: 0000000000000001
 RBP: ffff8881111bc800 R08: ffff888111197000 R09: ffffed10222a82af
 R10: ffff888111541577 R11: ffffed10222a82ae R12: 1ffff11023641336
 R13: ffff888111541000 R14: ffff88810fd4ca00 R15: ffff888111541570
  tcp_child_process+0x754/0x920 net/ipv4/tcp_minisocks.c:841
  tcp_v4_do_rcv+0x749/0x8b0 net/ipv4/tcp_ipv4.c:1642
  tcp_v4_rcv+0x2666/0x2e60 net/ipv4/tcp_ipv4.c:1999
  ip_protocol_deliver_rcu+0x29/0x1f0 net/ipv4/ip_input.c:204
  ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
  NF_HOOK include/linux/netfilter.h:421 [inline]
  ip_local_deliver+0x2da/0x390 net/ipv4/ip_input.c:252
  dst_input include/net/dst.h:441 [inline]
  ip_rcv_finish net/ipv4/ip_input.c:428 [inline]
  ip_rcv_finish net/ipv4/ip_input.c:414 [inline]
  NF_HOOK include/linux/netfilter.h:421 [inline]
  ip_rcv+0xef/0x140 net/ipv4/ip_input.c:539
  __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5268
  __netif_receive_skb+0x27/0x1c0 net/core/dev.c:5382
  process_backlog+0x1e5/0x6d0 net/core/dev.c:6226
  napi_poll net/core/dev.c:6671 [inline]
  net_rx_action+0x3e3/0xd70 net/core/dev.c:6739
  __do_softirq+0x18c/0x634 kernel/softirq.c:292
  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
  </IRQ>
  do_softirq.part.0+0x26/0x30 kernel/softirq.c:337
  do_softirq arch/x86/include/asm/preempt.h:26 [inline]
  __local_bh_enable_ip+0x46/0x50 kernel/softirq.c:189
  local_bh_enable include/linux/bottom_half.h:32 [inline]
  rcu_read_unlock_bh include/linux/rcupdate.h:723 [inline]
  ip_finish_output2+0x78a/0x19c0 net/ipv4/ip_output.c:229
  __ip_finish_output+0x471/0x720 net/ipv4/ip_output.c:306
  dst_output include/net/dst.h:435 [inline]
  ip_local_out+0x181/0x1e0 net/ipv4/ip_output.c:125
  __ip_queue_xmit+0x7a1/0x14e0 net/ipv4/ip_output.c:530
  __tcp_transmit_skb+0x19dc/0x35e0 net/ipv4/tcp_output.c:1238
  __tcp_send_ack.part.0+0x3c2/0x5b0 net/ipv4/tcp_output.c:3785
  __tcp_send_ack net/ipv4/tcp_output.c:3791 [inline]
  tcp_send_ack+0x7d/0xa0 net/ipv4/tcp_output.c:3791
  tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6040 [inline]
  tcp_rcv_state_process+0x36a4/0x49c2 net/ipv4/tcp_input.c:6209
  tcp_v4_do_rcv+0x343/0x8b0 net/ipv4/tcp_ipv4.c:1651
  sk_backlog_rcv include/net/sock.h:996 [inline]
  __release_sock+0x1ad/0x310 net/core/sock.c:2548
  release_sock+0x54/0x1a0 net/core/sock.c:3064
  inet_wait_for_connect net/ipv4/af_inet.c:594 [inline]
  __inet_stream_connect+0x57e/0xd50 net/ipv4/af_inet.c:686
  inet_stream_connect+0x53/0xa0 net/ipv4/af_inet.c:725
  mptcp_stream_connect+0x171/0x5f0 net/mptcp/protocol.c:1920
  __sys_connect_file net/socket.c:1854 [inline]
  __sys_connect+0x267/0x2f0 net/socket.c:1871
  __do_sys_connect net/socket.c:1882 [inline]
  __se_sys_connect net/socket.c:1879 [inline]
  __x64_sys_connect+0x6f/0xb0 net/socket.c:1879
  do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:295
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fb577d06469
 Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89
 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
 RSP: 002b:00007fb5783d5dd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
 RAX: ffffffffffffffda RBX: 000000000068bfa0 RCX: 00007fb577d06469
 RDX: 000000000000004d RSI: 0000000020000040 RDI: 0000000000000003
 RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 000000000041427c R14: 00007fb5783d65c0 R15: 0000000000000003

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/39
Reported-by: Christoph Paasch <cpaasch@apple.com>
Fixes: e1ff9e82e2 ("net: mptcp: improve fallback to TCP")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06 13:31:12 -07:00
David Ahern
34fe5a1cf9 ipv6: fib6_select_path can not use out path for nexthop objects
Brian reported a crash in IPv6 code when using rpfilter with a setup
running FRR and external nexthop objects. The root cause of the crash
is fib6_select_path setting fib6_nh in the result to NULL because of
an improper check for nexthop objects.

More specifically, rpfilter invokes ip6_route_lookup with flowi6_oif
set causing fib6_select_path to be called with have_oif_match set.
fib6_select_path has early check on have_oif_match and jumps to the
out label which presumes a builtin fib6_nh. This path is invalid for
nexthop objects; for external nexthops fib6_select_path needs to just
return if the fib6_nh has already been set in the result otherwise it
returns after the call to nexthop_path_fib6_result. Update the check
on have_oif_match to not bail on external nexthops.

Update selftests for this problem.

Fixes: f88d8ea67f ("ipv6: Plumb support for nexthop object in a fib6_info")
Reported-by: Brian Rak <brak@choopa.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06 13:24:16 -07:00
Alexander A. Klimov
7a6498ebcd Replace HTTP links with HTTPS ones: IPv*
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
    For each line:
      If doesn't contain `\bxmlns\b`:
        For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
          If both the HTTP and HTTPS versions
          return 200 OK and serve the same content:
            Replace HTTP with HTTPS.

Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06 13:23:03 -07:00
Andrew Lunn
99bac53d06 net: dsa: tag_qca.c: Fix warning for __be16 vs u16
net/dsa/tag_qca.c:48:15: warning: incorrect type in assignment (different base types)
net/dsa/tag_qca.c:48:15:    expected unsigned short [usertype]
net/dsa/tag_qca.c:48:15:    got restricted __be16 [usertype]
net/dsa/tag_qca.c:68:13: warning: incorrect type in assignment (different base types)
net/dsa/tag_qca.c:68:13:    expected restricted __be16 [usertype] hdr
net/dsa/tag_qca.c:68:13:    got int
net/dsa/tag_qca.c:71:16: warning: restricted __be16 degrades to integer
net/dsa/tag_qca.c:81:17: warning: restricted __be16 degrades to integer

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05 15:31:58 -07:00
Andrew Lunn
04d63f9d91 net: dsa: tag_mtk: Fix warnings for __be16
net/dsa/tag_mtk.c:84:13: warning: incorrect type in assignment (different base types)
net/dsa/tag_mtk.c:84:13:    expected restricted __be16 [usertype] hdr
net/dsa/tag_mtk.c:84:13:    got int
net/dsa/tag_mtk.c:94:17: warning: restricted __be16 degrades to integer

The result of a ntohs() is not __be16, but u16.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05 15:31:58 -07:00
Andrew Lunn
802734ad77 net: dsa: tag_lan9303: Fix __be16 warnings
net/dsa/tag_lan9303.c:76:24: warning: incorrect type in assignment (different base types)
net/dsa/tag_lan9303.c:76:24:    expected unsigned short [usertype]
net/dsa/tag_lan9303.c:76:24:    got restricted __be16 [usertype]
net/dsa/tag_lan9303.c:80:24: warning: incorrect type in assignment (different base types)
net/dsa/tag_lan9303.c:80:24:    expected unsigned short [usertype]
net/dsa/tag_lan9303.c:80:24:    got restricted __be16 [usertype]
net/dsa/tag_lan9303.c:106:31: warning: restricted __be16 degrades to integer
net/dsa/tag_lan9303.c:111:24: warning: cast to restricted __be16
net/dsa/tag_lan9303.c:111:24: warning: cast to restricted __be16
net/dsa/tag_lan9303.c:111:24: warning: cast to restricted __be16
net/dsa/tag_lan9303.c:111:24: warning: cast to restricted __be16

Make use of __be16 where appropriate to fix these warnings.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05 15:31:58 -07:00
Andrew Lunn
ed6444ea03 net: dsa: tag_ksz: Fix __be16 warnings
cpu_to_be16 returns a __be16 value. So what it is assigned to needs to
have the same type to avoid warnings.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05 15:31:58 -07:00
Andrew Lunn
a61bf20831 net: dsa: Add __percpu property to prevent warnings
net/dsa/slave.c:505:13: warning: incorrect type in initializer (different address spaces)
net/dsa/slave.c:505:13:    expected void const [noderef] <asn:3> *__vpp_verify
net/dsa/slave.c:505:13:    got struct pcpu_sw_netstats *

Add the needed _percpu property to prevent this warning.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05 15:31:58 -07:00
Taehee Yoo
ccfc9df135 hsr: fix interface leak in error path of hsr_dev_finalize()
To release hsr(upper) interface, it should release
its own lower interfaces first.
Then, hsr(upper) interface can be released safely.
In the current code of error path of hsr_dev_finalize(), it releases hsr
interface before releasing a lower interface.
So, a warning occurs, which warns about the leak of lower interfaces.
In order to fix this problem, changing the ordering of the error path of
hsr_dev_finalize() is needed.

Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add dummy2 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
    ip link add hsr1 type hsr slave1 dummy2 slave2 dummy0

Splat looks like:
[  214.923127][    C2] WARNING: CPU: 2 PID: 1093 at net/core/dev.c:8992 rollback_registered_many+0x986/0xcf0
[  214.923129][    C2] Modules linked in: hsr dummy openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipx
[  214.923154][    C2] CPU: 2 PID: 1093 Comm: ip Not tainted 5.8.0-rc2+ #623
[  214.923156][    C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  214.923157][    C2] RIP: 0010:rollback_registered_many+0x986/0xcf0
[  214.923160][    C2] Code: 41 8b 4e cc 45 31 c0 31 d2 4c 89 ee 48 89 df e8 e0 47 ff ff 85 c0 0f 84 cd fc ff ff 5
[  214.923162][    C2] RSP: 0018:ffff8880c5156f28 EFLAGS: 00010287
[  214.923165][    C2] RAX: ffff8880d1dad458 RBX: ffff8880bd1b9000 RCX: ffffffffb929d243
[  214.923167][    C2] RDX: 1ffffffff77e63f0 RSI: 0000000000000008 RDI: ffffffffbbf31f80
[  214.923168][    C2] RBP: dffffc0000000000 R08: fffffbfff77e63f1 R09: fffffbfff77e63f1
[  214.923170][    C2] R10: ffffffffbbf31f87 R11: 0000000000000001 R12: ffff8880c51570a0
[  214.923172][    C2] R13: ffff8880bd1b90b8 R14: ffff8880c5157048 R15: ffff8880d1dacc40
[  214.923174][    C2] FS:  00007fdd257a20c0(0000) GS:ffff8880da200000(0000) knlGS:0000000000000000
[  214.923175][    C2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  214.923177][    C2] CR2: 00007ffd78beb038 CR3: 00000000be544005 CR4: 00000000000606e0
[  214.923179][    C2] Call Trace:
[  214.923180][    C2]  ? netif_set_real_num_tx_queues+0x780/0x780
[  214.923182][    C2]  ? dev_validate_mtu+0x140/0x140
[  214.923183][    C2]  ? synchronize_rcu.part.79+0x85/0xd0
[  214.923185][    C2]  ? synchronize_rcu_expedited+0xbb0/0xbb0
[  214.923187][    C2]  rollback_registered+0xc8/0x170
[  214.923188][    C2]  ? rollback_registered_many+0xcf0/0xcf0
[  214.923190][    C2]  unregister_netdevice_queue+0x18b/0x240
[  214.923191][    C2]  hsr_dev_finalize+0x56e/0x6e0 [hsr]
[  214.923192][    C2]  hsr_newlink+0x36b/0x450 [hsr]
[  214.923194][    C2]  ? hsr_dellink+0x70/0x70 [hsr]
[  214.923195][    C2]  ? rtnl_create_link+0x2e4/0xb00
[  214.923197][    C2]  ? __netlink_ns_capable+0xc3/0xf0
[  214.923198][    C2]  __rtnl_newlink+0xbdb/0x1270
[ ... ]

Fixes: e0a4b99773 ("hsr: use upper/lower device infrastructure")
Reported-by: syzbot+7f1c020f68dab95aab59@syzkaller.appspotmail.com
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 18:02:48 -07:00
Florian Westphal
c9b95a1359 mptcp: support IPV6_V6ONLY setsockopt
Without this, Opensshd fails to open an ipv6 socket listening
socket:
  error: setsockopt IPV6_V6ONLY: Operation not supported
  error: Bind to port 22 on :: failed: Address already in use.

Opensshd opens an ipv4 and and ipv6 listening socket, but because
IPV6_V6ONLY setsockopt fails, the port number is already in use.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:56:22 -07:00
Florian Westphal
fd1452d8ef mptcp: add REUSEADDR/REUSEPORT support
This will e.g. make 'sshd restart' work when MPTCP is used, as we will
now set this option on the listener socket instead of only the mptcp
socket (where it has no effect).

We still need to copy the setting to the master socket so that a
subsequent getsockopt() returns the expected value.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:56:22 -07:00
Florian Westphal
83f0c10bc3 net: use mptcp setsockopt function for SOL_SOCKET on mptcp sockets
setsockopt(mptcp_fd, SOL_SOCKET, ...)...  appears to work (returns 0),
but it has no effect -- this is because the MPTCP layer never has a
chance to copy the settings to the subflow socket.

Skip the generic handling for the mptcp case and instead call the
mptcp specific handler instead for SOL_SOCKET too.

Next patch adds more specific handling for SOL_SOCKET to mptcp.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:56:22 -07:00
David S. Miller
f91c031e65 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-07-04

The following pull-request contains BPF updates for your *net-next* tree.

We've added 73 non-merge commits during the last 17 day(s) which contain
a total of 106 files changed, 5233 insertions(+), 1283 deletions(-).

The main changes are:

1) bpftool ability to show PIDs of processes having open file descriptors
   for BPF map/program/link/BTF objects, relying on BPF iterator progs
   to extract this info efficiently, from Andrii Nakryiko.

2) Addition of BPF iterator progs for dumping TCP and UDP sockets to
   seq_files, from Yonghong Song.

3) Support access to BPF map fields in struct bpf_map from programs
   through BTF struct access, from Andrey Ignatov.

4) Add a bpf_get_task_stack() helper to be able to dump /proc/*/stack
   via seq_file from BPF iterator progs, from Song Liu.

5) Make SO_KEEPALIVE and related options available to bpf_setsockopt()
   helper, from Dmitry Yakunin.

6) Optimize BPF sk_storage selection of its caching index, from Martin
   KaFai Lau.

7) Removal of redundant synchronize_rcu()s from BPF map destruction which
   has been a historic leftover, from Alexei Starovoitov.

8) Several improvements to test_progs to make it easier to create a shell
   loop that invokes each test individually which is useful for some CIs,
   from Jesper Dangaard Brouer.

9) Fix bpftool prog dump segfault when compiled without skeleton code on
   older clang versions, from John Fastabend.

10) Bunch of cleanups and minor improvements, from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:48:34 -07:00
David S. Miller
c00e858d55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Use kvfree() to release vmalloc()'ed areas in ipset, from Eric Dumazet.

2) UAF in nfnetlink_queue from the nf_conntrack_update() path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:47:35 -07:00
Eric W. Biederman
1c340ead18 umd: Track user space drivers with struct pid
Use struct pid instead of user space pid values that are prone to wrap
araound.

In addition track the entire thread group instead of just the first
thread that is started by exec.  There are no multi-threaded user mode
drivers today but there is nothing preclucing user drivers from being
multi-threaded, so it is just a good idea to track the entire process.

Take a reference count on the tgid's in question to make it possible
to remove exit_umh in a future change.

As a struct pid is available directly use kill_pid_info.

The prior process signalling code was iffy in using a userspace pid
known to be in the initial pid namespace and then looking up it's task
in whatever the current pid namespace is.  It worked only because
kernel threads always run in the initial pid namespace.

As the tgid is now refcounted verify the tgid is NULL at the start of
fork_usermode_driver to avoid the possibility of silent pid leaks.

v1: https://lkml.kernel.org/r/87mu4qdlv2.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/a70l4oy8.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-12-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:35:56 -05:00
Eric W. Biederman
0fe3c63148 bpfilter: Move bpfilter_umh back into init data
To allow for restarts 61fbf5933d ("net: bpfilter: restart
bpfilter_umh when error occurred") moved the blob holding the
userspace binary out of the init sections.

Now that loading the blob into a filesystem is separate from executing
the blob the blob no longer needs to live .rodata to allow for restarting.
So move the blob back to .init.rodata.

v1: https://lkml.kernel.org/r/87sgeidlvq.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87ftad4ozc.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-11-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:35:50 -05:00
Eric W. Biederman
e2dc9bf3f5 umd: Transform fork_usermode_blob into fork_usermode_driver
Instead of loading a binary blob into a temporary file with
shmem_kernel_file_setup load a binary blob into a temporary tmpfs
filesystem.  This means that the blob can be stored in an init section
and discared, and it means the binary blob will have a filename so can
be executed normally.

The only tricky thing about this code is that in the helper function
blob_to_mnt __fput_sync is used.  That is because a file can not be
executed if it is still open for write, and the ordinary delayed close
for kernel threads does not happen soon enough, which causes the
following exec to fail.  The function umd_load_blob is not called with
any locks so this should be safe.

Executing the blob normally winds up correcting several problems with
the user mode driver code discovered by Tetsuo Handa[1].  By passing
an ordinary filename into the exec, it is no longer necessary to
figure out how to turn a O_RDWR file descriptor into a properly
referende counted O_EXEC file descriptor that forbids all writes.  For
path based LSMs there are no new special cases.

[1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
v1: https://lkml.kernel.org/r/87d05mf0j9.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87wo3p4p35.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-8-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:35:29 -05:00
Eric W. Biederman
1199c6c3da umd: Rename umd_info.cmdline umd_info.driver_name
The only thing supplied in the cmdline today is the driver name so
rename the field to clarify the code.

As this value is always supplied stop trying to handle the case of
a NULL cmdline.

Additionally since we now have a name we can count on use the
driver_name any place where the code is looking for a name
of the binary.

v1: https://lkml.kernel.org/r/87imfef0k3.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87366d63os.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-7-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:35:13 -05:00
Eric W. Biederman
74be2d3b80 umd: For clarity rename umh_info umd_info
This structure is only used for user mode drivers so change
the prefix from umh to umd to make that clear.

v1: https://lkml.kernel.org/r/87o8p6f0kw.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/878sg563po.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-6-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:34:39 -05:00
Pablo Neira Ayuso
c1f79a2eef netfilter: nf_tables: reject unsupported chain flags
Bail out if userspace sends unsupported chain flags.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04 02:51:28 +02:00
Pablo Neira Ayuso
d0e2c7de92 netfilter: nf_tables: add NFT_CHAIN_BINDING
This new chain flag specifies that:

* the kernel dynamically allocates the chain name, if no chain name
  is specified.

* If the immediate expression that refers to this chain is removed,
  then this bound chain (and its content) is destroyed.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04 01:22:14 +02:00
Pablo Neira Ayuso
04b7db4144 netfilter: nf_tables: add nft_chain_add()
This patch adds a helper function to add the chain to the hashtable and
the chain list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04 01:18:42 +02:00
Pablo Neira Ayuso
67c49de4ad netfilter: nf_tables: expose enum nft_chain_flags through UAPI
This enum definition was never exposed through UAPI. Rename
NFT_BASE_CHAIN to NFT_CHAIN_BASE for consistency.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04 01:18:41 +02:00
Pablo Neira Ayuso
51d70f181f netfilter: nf_tables: add NFTA_VERDICT_CHAIN_ID attribute
This netlink attribute allows you to identify the chain to jump/goto by
means of the chain ID.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04 01:18:41 +02:00
Pablo Neira Ayuso
837830a4b4 netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute
This new netlink attribute allows you to add rules to chains by the
chain ID.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04 01:18:41 +02:00
Pablo Neira Ayuso
74cccc3d38 netfilter: nf_tables: add NFTA_CHAIN_ID attribute
This netlink attribute allows you to refer to chains inside a
transaction as an alternative to the name and the handle. The chain
binding support requires this new chain ID approach.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04 01:18:41 +02:00
Julian Anastasov
f0a5e4d7a5 ipvs: allow connection reuse for unconfirmed conntrack
YangYuxi is reporting that connection reuse
is causing one-second delay when SYN hits
existing connection in TIME_WAIT state.
Such delay was added to give time to expire
both the IPVS connection and the corresponding
conntrack. This was considered a rare case
at that time but it is causing problem for
some environments such as Kubernetes.

As nf_conntrack_tcp_packet() can decide to
release the conntrack in TIME_WAIT state and
to replace it with a fresh NEW conntrack, we
can use this to allow rescheduling just by
tuning our check: if the conntrack is
confirmed we can not schedule it to different
real server and the one-second delay still
applies but if new conntrack was created,
we are free to select new real server without
any delays.

YangYuxi lists some of the problem reports:

- One second connection delay in masquerading mode:
https://marc.info/?t=151683118100004&r=1&w=2

- IPVS low throughput #70747
https://github.com/kubernetes/kubernetes/issues/70747

- Apache Bench can fill up ipvs service proxy in seconds #544
https://github.com/cloudnativelabs/kube-router/issues/544

- Additional 1s latency in `host -> service IP -> pod`
https://github.com/kubernetes/kubernetes/issues/90854

Fixes: f719e3754e ("ipvs: drop first packet to redirect conntrack")
Co-developed-by: YangYuxi <yx.atom1@gmail.com>
Signed-off-by: YangYuxi <yx.atom1@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04 01:18:37 +02:00
Willem de Bruijn
cd8700e45e ipv6/ping: set skb->mark on icmpv6 sockets
IPv6 ping sockets route based on fwmark, but do not yet set skb->mark.
Add this. IPv4 ping sockets also do both.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-03 14:36:04 -07:00
Toke Høiland-Jørgensen
d7bf2ebebc sched: consistently handle layer3 header accesses in the presence of VLANs
There are a couple of places in net/sched/ that check skb->protocol and act
on the value there. However, in the presence of VLAN tags, the value stored
in skb->protocol can be inconsistent based on whether VLAN acceleration is
enabled. The commit quoted in the Fixes tag below fixed the users of
skb->protocol to use a helper that will always see the VLAN ethertype.

However, most of the callers don't actually handle the VLAN ethertype, but
expect to find the IP header type in the protocol field. This means that
things like changing the ECN field, or parsing diffserv values, stops
working if there's a VLAN tag, or if there are multiple nested VLAN
tags (QinQ).

To fix this, change the helper to take an argument that indicates whether
the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
make sure to skip all of them, so behaviour is consistent even in QinQ
mode.

To make the helper usable from the ECN code, move it to if_vlan.h instead
of pkt_sched.h.

v3:
- Remove empty lines
- Move vlan variable definitions inside loop in skb_protocol()
- Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
  bpf_skb_ecn_set_ce()

v2:
- Use eth_type_vlan() helper in skb_protocol()
- Also fix code that reads skb->protocol directly
- Change a couple of 'if/else if' statements to switch constructs to avoid
  calling the helper twice

Reported-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com>
Fixes: d8b9605d26 ("net: sched: fix skb->protocol use in case of accelerated vlan path")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-03 14:34:53 -07:00
Pablo Neira Ayuso
d005fbb855 netfilter: conntrack: refetch conntrack after nf_conntrack_update()
__nf_conntrack_update() might refresh the conntrack object that is
attached to the skbuff. Otherwise, this triggers UAF.

[  633.200434] ==================================================================
[  633.200472] BUG: KASAN: use-after-free in nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[  633.200478] Read of size 1 at addr ffff888370804c00 by task nfqnl_test/6769

[  633.200487] CPU: 1 PID: 6769 Comm: nfqnl_test Not tainted 5.8.0-rc2+ #388
[  633.200490] Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012
[  633.200491] Call Trace:
[  633.200499]  dump_stack+0x7c/0xb0
[  633.200526]  ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[  633.200532]  print_address_description.constprop.6+0x1a/0x200
[  633.200539]  ? _raw_write_lock_irqsave+0xc0/0xc0
[  633.200568]  ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[  633.200594]  ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[  633.200598]  kasan_report.cold.9+0x1f/0x42
[  633.200604]  ? call_rcu+0x2c0/0x390
[  633.200633]  ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[  633.200659]  nf_conntrack_update+0x34e/0x770 [nf_conntrack]
[  633.200687]  ? nf_conntrack_find_get+0x30/0x30 [nf_conntrack]

Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1436
Fixes: ee04805ff5 ("netfilter: conntrack: make conntrack userspace helpers work again")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-03 14:47:03 +02:00
Linus Torvalds
083176c86f Fixes for a umask bug on exported filesystems lacking ACL support, a
leak and a module unloading bug in the /proc/fs/nfsd/clients/ code, and
 a compile warning.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl79+IoVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+6I8P/24e9/W50SUBsQYseG2BpwjR/RsQ
 YjMbqrf1XOxo+8axNpdbe0bhq2jWiyQz0esnF33RlztlDSJmSNJfueWDSezKzKwC
 o8afQx0qJaVZUsT/XAXa2Hk2OZd2ZYF6f3DGMiz+knBGdAzSwJjpgqhzocMCQ3Hr
 t/PG6DJazLB3VDIe1VziTet2uv52A0A+uBYKguK/QPlpae2uXKFJ8U7v6wCsU395
 Sqd2/X2KGbeYoCrWsmpvdCDVeNmAbI0KlhY8pR6BHqGp7TYm4+AueqWzpYHlNHei
 PukM8AROoTBEAc6Wiqqmp0UKRR+Qn/9NIuvQtvBnC6WGIPjEG1hTkAwlRfT6VYvn
 oPg4oekKjRJLz/TSaqfJRpli5GwxfWAW14LTZZT+Xe0/7FhVe28/R8F1dP5ZJaeq
 h9//4rCt/yUYAQq1odOMbNCr0rGVcKzdSN3E36OvJFVQ9bMyXHKetKHywOki13w5
 M8UQK5zb21ghT7OSICmeRXHqsXRmTFO8QhUZ8L63Qb2hfiQ5fVQdSiHmM8iRcwWY
 bxqrSs8YV7i+I0i1YYTYWmmFgP8Y11sL7ovAEs86cP2Rk58Bk5VA2TPT114W53AD
 xaZHpjsH0AfZS87dEVdvS2/dAdtbHZsFwHxGnfvyl/CKTqoz5yY5etcwULQI3+XQ
 8kG8FOFpt7T/5zmB
 =wF3M
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Fixes for a umask bug on exported filesystems lacking ACL support, a
  leak and a module unloading bug in the /proc/fs/nfsd/clients/ code,
  and a compile warning"

* tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux:
  SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
  nfsd: fix nfsdfs inode reference count leak
  nfsd4: fix nfsdfs reference count loop
  nfsd: apply umask on fs without ACL support
2020-07-02 20:35:33 -07:00
Horatiu Vultur
36a8e8e265 bridge: Extend br_fill_ifinfo to return MPR status
This patch extends the function br_fill_ifinfo to return also the MRP
status for each instance on a bridge. It also adds a new filter
RTEXT_FILTER_MRP to return the MRP status only when this is set, not to
interfer with the vlans. The MRP status is return only on the bridge
interfaces.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02 14:19:15 -07:00
Horatiu Vultur
df42ef227d bridge: mrp: Add br_mrp_fill_info
Add the function br_mrp_fill_info which populates the MRP attributes
regarding the status of each MRP instance.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02 14:19:15 -07:00
Eric Dumazet
1ca0fafd73 tcp: md5: allow changing MD5 keys in all socket states
This essentially reverts commit 7212303268 ("tcp: md5: reject TCP_MD5SIG
or TCP_MD5SIG_EXT on established sockets")

Mathieu reported that many vendors BGP implementations can
actually switch TCP MD5 on established flows.

Quoting Mathieu :
   Here is a list of a few network vendors along with their behavior
   with respect to TCP MD5:

   - Cisco: Allows for password to be changed, but within the hold-down
     timer (~180 seconds).
   - Juniper: When password is initially set on active connection it will
     reset, but after that any subsequent password changes no network
     resets.
   - Nokia: No notes on if they flap the tcp connection or not.
   - Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until
     both sides are ok with new passwords.
   - Meta-Switch: Expects the password to be set before a connection is
     attempted, but no further info on whether they reset the TCP
     connection on a change.
   - Avaya: Disable the neighbor, then set password, then re-enable.
   - Zebos: Would normally allow the change when socket connected.

We can revert my prior change because commit 9424e2e7ad ("tcp: md5: fix potential
overestimation of TCP option space") removed the leak of 4 kernel bytes to
the wire that was the main reason for my patch.

While doing my investigations, I found a bug when a MD5 key is changed, leading
to these commits that stable teams want to consider before backporting this revert :

 Commit 6a2febec33 ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
 Commit e6ced831ef ("tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers")

Fixes: 7212303268 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02 14:07:49 -07:00
Florian Westphal
a6b118febb mptcp: add receive buffer auto-tuning
When mptcp is used, userspace doesn't read from the tcp (subflow)
socket but from the parent (mptcp) socket receive queue.

skbs are moved from the subflow socket to the mptcp rx queue either from
'data_ready' callback (if mptcp socket can be locked), a work queue, or
the socket receive function.

This means tcp_rcv_space_adjust() is never called and thus no receive
buffer size auto-tuning is done.

An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the
function that moves skbs from subflow to mptcp socket.
While this enabled autotuning, it also meant tuning was done even if
userspace was reading the mptcp socket very slowly.

This adds mptcp_rcv_space_adjust() and calls it after userspace has
read data from the mptcp socket rx queue.

Its very similar to tcp_rcv_space_adjust, with two differences:

1. The rtt estimate is the largest one observed on a subflow
2. The rcvbuf size and window clamp of all subflows is adjusted
   to the mptcp-level rcvbuf.

Otherwise, we get spurious drops at tcp (subflow) socket level if
the skbs are not moved to the mptcp socket fast enough.

Before:
time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap
[..]
ns4 MPTCP -> ns3 (10.0.3.2:10108      ) MPTCP   (duration 40823ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10109      ) TCP     (duration 23119ms) [ OK ]
ns4 TCP   -> ns3 (10.0.3.2:10110      ) MPTCP   (duration  5421ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP   (duration 41446ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP     (duration 23427ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:3::2:10113) MPTCP   (duration  5426ms) [ OK ]
Time: 1396 seconds

After:
ns4 MPTCP -> ns3 (10.0.3.2:10108      ) MPTCP   (duration  5417ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10109      ) TCP     (duration  5427ms) [ OK ]
ns4 TCP   -> ns3 (10.0.3.2:10110      ) MPTCP   (duration  5422ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP   (duration  5415ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP     (duration  5422ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:3::2:10113) MPTCP   (duration  5423ms) [ OK ]
Time: 296 seconds

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:47:55 -07:00
Eric Dumazet
ba3bb0e76c tcp: fix SO_RCVLOWAT possible hangs under high mem pressure
Whenever tcp_try_rmem_schedule() returns an error, we are under
trouble and should make sure to wakeup readers so that they
can drain socket queues and eventually make room.

Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:46:04 -07:00
Danny Lin
b97e9d9d67 net: sched: Allow changing default qdisc to FQ-PIE
Similar to fq_codel and the other qdiscs that can set as default,
fq_pie is also suitable for general use without explicit configuration,
which makes it a valid choice for this.

This is useful in situations where a painless out-of-the-box solution
for reducing bufferbloat is desired but fq_codel is not necessarily the
best choice. For example, fq_pie can be better for DASH streaming, but
there could be more cases where it's the better choice of the two simple
AQMs available in the kernel.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:43:27 -07:00
Willem de Bruijn
0da7536fb4 ip: Fix SO_MARK in RST, ACK and ICMP packets
When no full socket is available, skbs are sent over a per-netns
control socket. Its sk_mark is temporarily adjusted to match that
of the real (request or timewait) socket or to reflect an incoming
skb, so that the outgoing skb inherits this in __ip_make_skb.

Introduction of the socket cookie mark field broke this. Now the
skb is set through the cookie and cork:

<caller>		# init sockc.mark from sk_mark or cmsg
ip_append_data
  ip_setup_cork		# convert sockc.mark to cork mark
ip_push_pending_frames
  ip_finish_skb
    __ip_make_skb	# set skb->mark to cork mark

But I missed these special control sockets. Update all callers of
__ip(6)_make_skb that were originally missed.

For IPv6, the same two icmp(v6) paths are affected. The third
case is not, as commit 92e55f412c ("tcp: don't annotate
mark on control socket from tcp_v6_send_response()") replaced
the ctl_sk->sk_mark with passing the mark field directly as a
function argument. That commit predates the commit that
introduced the bug.

Fixes: c6af0c227a ("ip: support SO_MARK cmsg")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:38:30 -07:00
Eric Dumazet
e114e1e8ac tcp: md5: do not send silly options in SYNCOOKIES
Whenever cookie_init_timestamp() has been used to encode
ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

Otherwise, tcp_synack_options() will still advertize options like WSCALE
that we can not deduce later when receiving the packet from the client
to complete 3WHS.

Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
but we can not know for sure that all TCP stacks have the same logic.

Before the fix a tcpdump would exhibit this wrong exchange :

10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0

After this patch the exchange looks saner :

11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0
11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0
11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

Of course, no SACK block will ever be added later, but nothing should break.
Technically, we could remove the 4 nops included in MD5+TS options,
but again some stacks could break seeing not conventional alignment.

Fixes: 4957faade1 ("TCPCT part 1g: Responder Cookie => Initiator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:36:23 -07:00
Rao Shoaib
9ef845f894 rds: If one path needs re-connection, check all and re-connect
In testing with mprds enabled, Oracle Cluster nodes after reboot were
not able to communicate with others nodes and so failed to rejoin
the cluster. Peers with lower IP address initiated connection but the
node could not respond as it choose a different path and could not
initiate a connection as it had a higher IP address.

With this patch, when a node sends out a packet and the selected path
is down, all other paths are also checked and any down paths are
re-connected.

Reviewed-by: Ka-cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:35:17 -07:00
Eric Dumazet
e6ced831ef tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers
My prior fix went a bit too far, according to Herbert and Mathieu.

Since we accept that concurrent TCP MD5 lookups might see inconsistent
keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()

Clearing all key->key[] is needed to avoid possible KMSAN reports,
if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.

data_race() was added in linux-5.8 and will prevent KCSAN reports,
this can safely be removed in stable backports, if data_race() is
not yet backported.

v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()

Fixes: 6a2febec33 ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Marco Elver <elver@google.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:29:45 -07:00
Sean Tranchetti
1e82a62fec genetlink: remove genl_bind
A potential deadlock can occur during registering or unregistering a
new generic netlink family between the main nl_table_lock and the
cb_lock where each thread wants the lock held by the other, as
demonstrated below.

1) Thread 1 is performing a netlink_bind() operation on a socket. As part
   of this call, it will call netlink_lock_table(), incrementing the
   nl_table_users count to 1.
2) Thread 2 is registering (or unregistering) a genl_family via the
   genl_(un)register_family() API. The cb_lock semaphore will be taken for
   writing.
3) Thread 1 will call genl_bind() as part of the bind operation to handle
   subscribing to GENL multicast groups at the request of the user. It will
   attempt to take the cb_lock semaphore for reading, but it will fail and
   be scheduled away, waiting for Thread 2 to finish the write.
4) Thread 2 will call netlink_table_grab() during the (un)registration
   call. However, as Thread 1 has incremented nl_table_users, it will not
   be able to proceed, and both threads will be stuck waiting for the
   other.

genl_bind() is a noop, unless a genl_family implements the mcast_bind()
function to handle setting up family-specific multicast operations. Since
no one in-tree uses this functionality as Cong pointed out, simply removing
the genl_bind() function will remove the possibility for deadlock, as there
is no attempt by Thread 1 above to take the cb_lock semaphore.

Fixes: c380d9a7af ("genetlink: pass multicast bind/unbind to families")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Johannes Berg <johannes.berg@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 15:49:11 -07:00
Colin Ian King
2a6d6c31f1 net/packet: remove redundant initialization of variable err
The variable err is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 12:44:52 -07:00
Randy Dunlap
6b207d66aa bpf: Fix net/core/filter build errors when INET is not enabled
Fix build errors when CONFIG_INET is not set/enabled.

(.text+0x2b1b): undefined reference to `tcp_prot'
(.text+0x2b3b): undefined reference to `tcp_prot'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/b1a858ec-7e04-56bc-248a-62cb9bbee726@infradead.org
2020-07-01 08:36:58 -07:00
Julian Anastasov
f9200a52ee ipvs: avoid expiring many connections from timer
Add new functions ip_vs_conn_del() and ip_vs_conn_del_put()
to release many IPVS connections in process context.
They are suitable for connections found in table
when we do not want to overload the timers.

Currently, the change is useful for the dropentry delayed
work but it will be used also in following patch
when flushing connections to failed destinations.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-01 10:18:20 +02:00
Dan Carpenter
8ff41cc217 net: qrtr: Fix an out of bounds read qrtr_endpoint_post()
This code assumes that the user passed in enough data for a
qrtr_hdr_v1 or qrtr_hdr_v2 struct, but it's not necessarily true.  If
the buffer is too small then it will read beyond the end.

Reported-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reported-by: syzbot+b8fe393f999a291a9ea6@syzkaller.appspotmail.com
Fixes: 194ccc8829 ("net: qrtr: Support decoding incoming v2 packets")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 18:36:13 -07:00
Eric Dumazet
6a2febec33 tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()
MD5 keys are read with RCU protection, and tcp_md5_do_add()
might update in-place a prior key.

Normally, typical RCU updates would allocate a new piece
of memory. In this case only key->key and key->keylen might
be updated, and we do not care if an incoming packet could
see the old key, the new one, or some intermediate value,
since changing the key on a live flow is known to be problematic
anyway.

We only want to make sure that in the case key->keylen
is changed, cpus in tcp_md5_hash_key() wont try to use
uninitialized data, or crash because key->keylen was
read twice to feed sg_init_one() and ahash_request_set_crypt()

Fixes: 9ea88a1530 ("tcp: md5: check md5 signature without socket lock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 18:14:38 -07:00
Carl Huang
28541f3d32 net: qrtr: free flow in __qrtr_node_release
The flow is allocated in qrtr_tx_wait, but not freed when qrtr node
is released. (*slot) becomes NULL after radix_tree_iter_delete is
called in __qrtr_node_release. The fix is to save (*slot) to a
vairable and then free it.

This memory leak is catched when kmemleak is enabled in kernel,
the report looks like below:

unreferenced object 0xffffa0de69e08420 (size 32):
  comm "kworker/u16:3", pid 176, jiffies 4294918275 (age 82858.876s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 28 84 e0 69 de a0 ff ff  ........(..i....
    28 84 e0 69 de a0 ff ff 03 00 00 00 00 00 00 00  (..i............
  backtrace:
    [<00000000e252af0a>] qrtr_node_enqueue+0x38e/0x400 [qrtr]
    [<000000009cea437f>] qrtr_sendmsg+0x1e0/0x2a0 [qrtr]
    [<000000008bddbba4>] sock_sendmsg+0x5b/0x60
    [<0000000003beb43a>] qmi_send_message.isra.3+0xbe/0x110 [qmi_helpers]
    [<000000009c9ae7de>] qmi_send_request+0x1c/0x20 [qmi_helpers]

Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 16:25:04 -07:00
Bartosz Golaszewski
fe189519e4 net: devres: rename the release callback of devm_register_netdev()
Make it an explicit counterpart to devm_register_netdev() just like we
do with devm_free_netdev() for better clarity.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 15:57:34 -07:00
David S. Miller
e708e2bd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-06-30

The following pull-request contains BPF updates for your *net* tree.

We've added 28 non-merge commits during the last 9 day(s) which contain
a total of 35 files changed, 486 insertions(+), 232 deletions(-).

The main changes are:

1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer
   types, from Yonghong Song.

2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various
   arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki.

3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb
   internals and integrate it properly into DMA core, from Christoph Hellwig.

4) Fix RCU splat from recent changes to avoid skipping ingress policy when
   kTLS is enabled, from John Fastabend.

5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its
   position masking to work, from Andrii Nakryiko.

6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading
   of network programs, from Maciej Żenczykowski.

7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer.

8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 14:20:45 -07:00
Yousuk Seung
ff91e9292f tcp: call tcp_ack_tstamp() when not fully acked
When skb is coalesced tcp_ack_tstamp() still needs to be called when not
fully acked in tcp_clean_rtx_queue(), otherwise SCM_TSTAMP_ACK
timestamps may never be fired. Since the original patch series had
dependent commits, this patch fixes the issue instead of reverting by
restoring calls to tcp_ack_tstamp() when skb is not fully acked.

Fixes: fdb7eb21dd ("tcp: stamp SCM_TSTAMP_ACK later in tcp_clean_rtx_queue()")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 13:40:33 -07:00
Paolo Abeni
6bad912b7e mptcp: do nonce initialization at subflow creation time
This clean-up the code a bit, reduces the number of
used hooks and indirect call requested, and allow
better error reporting from __mptcp_subflow_connect()

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 13:38:00 -07:00
Yonghong Song
d923021c2c bpf: Add tests for PTR_TO_BTF_ID vs. null comparison
Add two tests for PTR_TO_BTF_ID vs. null ptr comparison,
one for PTR_TO_BTF_ID in the ctx structure and the
other for PTR_TO_BTF_ID after one level pointer chasing.
In both cases, the test ensures condition is not
removed.

For example, for this test
 struct bpf_fentry_test_t {
     struct bpf_fentry_test_t *a;
 };
 int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
 {
     if (arg == 0)
         test7_result = 1;
     return 0;
 }
Before the previous verifier change, we have xlated codes:
  int test7(long long unsigned int * ctx):
  ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
     0: (79) r1 = *(u64 *)(r1 +0)
  ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
     1: (b4) w0 = 0
     2: (95) exit
After the previous verifier change, we have:
  int test7(long long unsigned int * ctx):
  ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
     0: (79) r1 = *(u64 *)(r1 +0)
  ; if (arg == 0)
     1: (55) if r1 != 0x0 goto pc+4
  ; test7_result = 1;
     2: (18) r1 = map[id:6][0]+48
     4: (b7) r2 = 1
     5: (7b) *(u64 *)(r1 +0) = r2
  ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
     6: (b4) w0 = 0
     7: (95) exit

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630171241.2523875-1-yhs@fb.com
2020-06-30 22:21:29 +02:00
David S. Miller
d9b8b9845f This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - update mailing list URL, by Sven Eckelmann
 
  - fix typos and grammar in documentation, by Sven Eckelmann
 
  - introduce a configurable per interface hop penalty,
    by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl769w8WHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoatoEACN9O+fCwDIlIdxw/bUlGFw1J+V
 4rDReo5grIkl3BSdnLkkaPXTcazCWRoFhCdH6esTsF4/R/nwUa5AZuAbVd2NnCFU
 H+hVc7q/w0Hqppx5APT0AghgLcp8mFuHV9jqtJOjv9Kvd05Il6XwGkALT8QX2e2P
 ypMD+PBuQT1XCMl0v02lA0ki2TxvCtFYs1UznbAdl86GOGMua3t6fuRWQ2fpUMou
 t9251NnwSvSyeYuI8Zws7Wza2S4FOilX/CAjjGi02D0xf7T/JDs3FiPd6obdNfe/
 4j6R2dHWjxz1sb6bEdM3wfmUlU4StMQ+Vywpu1fh3gihLc2tmtMZhn6tJk3YjALt
 zlwDO2p/si7FLrizfMS7mUesZJH7Ji79BZtZWXsdW3p1zU1jlfJh5dc2ZaKwL9qk
 MSpd9DH2F459YC6ELcITLnhCymc7O/g3n3voNnSxi7uX4M5LsGt0mcx6g/iiFN16
 nPNHzXo7WJjgRcmLEQ3kYeD7Kjqh5XEogKLnwhThBxCFIQkNFF3OYqWtJyj5Sg5o
 +EredOAcq1Rc5CR//V43rQZLk4o1I9qQysKFqbjHohMIgLccz+47QAeBXrhku0ZT
 XU6k09Q0pkuo3iVPYnIUTDUkHtvFtsCKR98+8tB5xDR5F155OpFg7AXaKIx5Iym8
 4/eGrx22T+6TIeSGGg==
 =ceu6
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20200630' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - update mailing list URL, by Sven Eckelmann

 - fix typos and grammar in documentation, by Sven Eckelmann

 - introduce a configurable per interface hop penalty,
   by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 12:59:15 -07:00
Florian Fainelli
65951a9eb6 net: dsa: Improve subordinate PHY error message
It is not very informative to know the DSA master device when a
subordinate network device fails to get its PHY setup. Provide the
device name and capitalize PHY while we are it.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 12:44:20 -07:00
Jason A. Donenfeld
8f9a1fa430 net: xfrmi: implement header_ops->parse_protocol for AF_PACKET
The xfrm interface uses skb->protocol to determine packet type, and
bails out if it's not set. For AF_PACKET injection, we need to support
its call chain of:

    packet_sendmsg -> packet_snd -> packet_parse_headers ->
      dev_parse_header_protocol -> parse_protocol

Without a valid parse_protocol, this returns zero, and xfrmi rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.

Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 12:29:39 -07:00
Jason A. Donenfeld
75ea1f4773 net: sit: implement header_ops->parse_protocol for AF_PACKET
Sit uses skb->protocol to determine packet type, and bails out if it's
not set. For AF_PACKET injection, we need to support its call chain of:

    packet_sendmsg -> packet_snd -> packet_parse_headers ->
      dev_parse_header_protocol -> parse_protocol

Without a valid parse_protocol, this returns zero, and sit rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.

Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 12:29:39 -07:00
Jason A. Donenfeld
ab59d2b698 net: vti: implement header_ops->parse_protocol for AF_PACKET
Vti uses skb->protocol to determine packet type, and bails out if it's
not set. For AF_PACKET injection, we need to support its call chain of:

    packet_sendmsg -> packet_snd -> packet_parse_headers ->
      dev_parse_header_protocol -> parse_protocol

Without a valid parse_protocol, this returns zero, and vti rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 12:29:39 -07:00
Jason A. Donenfeld
e53ac93220 net: ipip: implement header_ops->parse_protocol for AF_PACKET
Ipip uses skb->protocol to determine packet type, and bails out if it's
not set. For AF_PACKET injection, we need to support its call chain of:

    packet_sendmsg -> packet_snd -> packet_parse_headers ->
      dev_parse_header_protocol -> parse_protocol

Without a valid parse_protocol, this returns zero, and ipip rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 12:29:39 -07:00
Jason A. Donenfeld
2606aff916 net: ip_tunnel: add header_ops for layer 3 devices
Some devices that take straight up layer 3 packets benefit from having a
shared header_ops so that AF_PACKET sockets can inject packets that are
recognized. This shared infrastructure will be used by other drivers
that currently can't inject packets using AF_PACKET. It also exposes the
parser function, as it is useful in standalone form too.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 12:29:39 -07:00
Lorenz Bauer
bb0de3131f bpf: sockmap: Require attach_bpf_fd when detaching a program
The sockmap code currently ignores the value of attach_bpf_fd when
detaching a program. This is contrary to the usual behaviour of
checking that attach_bpf_fd represents the currently attached
program.

Ensure that attach_bpf_fd is indeed the currently attached
program. It turns out that all sockmap selftests already do this,
which indicates that this is unlikely to cause breakage.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
2020-06-30 10:46:39 -07:00
Lorenz Bauer
9b2b09717e bpf: sockmap: Check value of unused args to BPF_PROG_ATTACH
Using BPF_PROG_ATTACH on a sockmap program currently understands no
flags or replace_bpf_fd, but accepts any value. Return EINVAL instead.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200629095630.7933-4-lmb@cloudflare.com
2020-06-30 10:46:39 -07:00
Jakub Sitnicki
695c12147a bpf, netns: Keep attached programs in bpf_prog_array
Prepare for having multi-prog attachments for new netns attach types by
storing programs to run in a bpf_prog_array, which is well suited for
iterating over programs and running them in sequence.

After this change bpf(PROG_QUERY) may block to allocate memory in
bpf_prog_array_copy_to_user() for collected program IDs. This forces a
change in how we protect access to the attached program in the query
callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from
an RCU read lock to holding a mutex that serializes updaters.

Because we allow only one BPF flow_dissector program to be attached to
netns at all times, the bpf_prog_array pointed by net->bpf.run_array is
always either detached (null) or one element long.

No functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
2020-06-30 10:45:08 -07:00
Jakub Sitnicki
3b7016996c flow_dissector: Pull BPF program assignment up to bpf-netns
Prepare for using bpf_prog_array to store attached programs by moving out
code that updates the attached program out of flow dissector.

Managing bpf_prog_array is more involved than updating a single bpf_prog
pointer. This will let us do it all from one place, bpf/net_namespace.c, in
the subsequent patch.

No functional change intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
2020-06-30 10:45:07 -07:00
Eric Dumazet
c4e8fa9074 netfilter: ipset: call ip_set_free() instead of kfree()
Whenever ip_set_alloc() is used, allocated memory can either
use kmalloc() or vmalloc(). We should call kvfree() or
ip_set_free()

invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 21935 Comm: syz-executor.3 Not tainted 5.8.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__phys_addr+0xa7/0x110 arch/x86/mm/physaddr.c:28
Code: 1d 7a 09 4c 89 e3 31 ff 48 d3 eb 48 89 de e8 d0 58 3f 00 48 85 db 75 0d e8 26 5c 3f 00 4c 89 e0 5b 5d 41 5c c3 e8 19 5c 3f 00 <0f> 0b e8 12 5c 3f 00 48 c7 c0 10 10 a8 89 48 ba 00 00 00 00 00 fc
RSP: 0000:ffffc900018572c0 EFLAGS: 00010046
RAX: 0000000000040000 RBX: 0000000000000001 RCX: ffffc9000fac3000
RDX: 0000000000040000 RSI: ffffffff8133f437 RDI: 0000000000000007
RBP: ffffc90098aff000 R08: 0000000000000000 R09: ffff8880ae636cdb
R10: 0000000000000000 R11: 0000000000000000 R12: 0000408018aff000
R13: 0000000000080000 R14: 000000000000001d R15: ffffc900018573d8
FS:  00007fc540c66700(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc9dcd67200 CR3: 0000000059411000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 virt_to_head_page include/linux/mm.h:841 [inline]
 virt_to_cache mm/slab.h:474 [inline]
 kfree+0x77/0x2c0 mm/slab.c:3749
 hash_net_create+0xbb2/0xd70 net/netfilter/ipset/ip_set_hash_gen.h:1536
 ip_set_create+0x6a2/0x13c0 net/netfilter/ipset/ip_set_core.c:1128
 nfnetlink_rcv_msg+0xbe8/0xea0 net/netfilter/nfnetlink.c:230
 netlink_rcv_skb+0x15a/0x430 net/netlink/af_netlink.c:2469
 nfnetlink_rcv+0x1ac/0x420 net/netfilter/nfnetlink.c:564
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2352
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2406
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2439
 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45cb19
Code: Bad RIP value.
RSP: 002b:00007fc540c65c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004fed80 RCX: 000000000045cb19
RDX: 0000000000000000 RSI: 0000000020001080 RDI: 0000000000000003
RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000095e R14: 00000000004cc295 R15: 00007fc540c666d4

Fixes: f66ee0410b ("netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports")
Fixes: 03c8b234e6 ("netfilter: ipset: Generalize extensions support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30 19:09:56 +02:00
Julian Anastasov
857ca89711 ipvs: register hooks only with services
Keep the IPVS hooks registered in Netfilter only
while there are configured virtual services. This
saves CPU cycles while IPVS is loaded but not used.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30 18:37:39 +02:00
Stefano Brivio
d61d2e902a netfilter: nft_set_pipapo: Drop useless assignment of scratch map index on insert
In nft_pipapo_insert(), we need to reallocate scratch maps that will
be used for matching by lookup functions, if they have never been
allocated or if the bucket size changes as a result of the insertion.

As pipapo_realloc_scratch() provides a pair of fresh, zeroed out
maps, there's no need to select a particular one after reallocation.

Other than being useless, the existing assignment was also troubled
by the fact that the index was set only on the CPU performing the
actual insertion, as spotted by Florian.

Simply drop the assignment.

Reported-by: Florian Westphal <fw@strlen.de>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30 18:25:07 +02:00
Laura Garcia Liebana
f53b9b0bdc netfilter: introduce support for reject at prerouting stage
REJECT statement can be only used in INPUT, FORWARD and OUTPUT
chains. This patch adds support of REJECT, both icmp and tcp
reset, at PREROUTING stage.

The need for this patch comes from the requirement of some
forwarding devices to reject traffic before the natting and
routing decisions.

The main use case is to be able to send a graceful termination
to legitimate clients that, under any circumstances, the NATed
endpoints are not available. This option allows clients to
decide either to perform a reconnection or manage the error in
their side, instead of just dropping the connection and let
them die due to timeout.

It is supported ipv4, ipv6 and inet families for nft
infrastructure.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30 18:21:02 +02:00
Christoph Hellwig
7e0245753f xsk: Use dma_need_sync instead of reimplenting it
Use the dma_need_sync helper instead of (not always entirely correctly)
poking into the dma-mapping internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200629130359.2690853-5-hch@lst.de
2020-06-30 15:44:03 +02:00
Christoph Hellwig
53937ff7bc xsk: Remove a double pool->dev assignment in xp_dma_map
->dev is already assigned at the top of the function, remove the duplicate
one at the end.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200629130359.2690853-4-hch@lst.de
2020-06-30 15:44:03 +02:00
Christoph Hellwig
91d5b70273 xsk: Replace the cheap_dma flag with a dma_need_sync flag
Invert the polarity and better name the flag so that the use case is
properly documented.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200629130359.2690853-3-hch@lst.de
2020-06-30 15:44:03 +02:00
Amit Cohen
ecc31c6024 ethtool: Add link extended state
Currently, drivers can only tell whether the link is up/down using
LINKSTATE_GET, but no additional information is given.

Add attributes to LINKSTATE_GET command in order to allow drivers
to expose the user more information in addition to link state to ease
the debug process, for example, reason for link down state.

Extended state consists of two attributes - link_ext_state and
link_ext_substate. The idea is to avoid 'vendor specific' states in order
to prevent drivers to use specific link_ext_state that can be in the future
common link_ext_state.

The substates allows drivers to add more information to the common
link_ext_state. For example, vendor can expose 'Autoneg' as link_ext_state
and add 'No partner detected during force mode' as link_ext_substate.

If a driver cannot pinpoint the extended state with the substate
accuracy, it is free to expose only the extended state and omit the
substate attribute.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:45:02 -07:00
Po Liu
5f035af76e net:qos: police action offloading parameter 'burst' change to the original value
Since 'tcfp_burst' with TICK factor, driver side always need to recover
it to the original value, this patch moves the generic calculation and
recover to the 'burst' original value before offloading to device driver.

Signed-off-by: Po Liu <po.liu@nxp.com>
Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:33:42 -07:00
Paolo Abeni
8a05661b2b mptcp: close poll() races
mptcp_poll always return POLLOUT for unblocking
connect(), ensure that the socket is a suitable
state.
The MPTCP_DATA_READY bit is never cleared on accept:
ensure we don't leave mptcp_accept() with an empty
accept queue and such bit set.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Paolo Abeni
76660afbb7 mptcp: __mptcp_tcp_fallback() returns a struct sock
Currently __mptcp_tcp_fallback() always return NULL
on incoming connections, because MPTCP does not create
the additional socket for the first subflow.
Since the previous commit no __mptcp_tcp_fallback()
caller needs a struct socket, so let __mptcp_tcp_fallback()
return the first subflow sock and cope correctly even with
incoming connections.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Paolo Abeni
fa68018dc4 mptcp: create first subflow at msk creation time
This cleans the code a bit and makes the behavior more consistent.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Paolo Abeni
d2f77c5334 mptcp: check for plain TCP sock at accept time
This cleanup the code a bit and avoid corrupted states
on weird syscall sequence (accept(), connect()).

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Davide Caratti
8fd738049a mptcp: fallback in case of simultaneous connect
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Davide Caratti
e1ff9e82e2 net: mptcp: improve fallback to TCP
Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
to regular TCP. When fallback is triggered, skip addition of the MPTCP
option on send.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
David S. Miller
1078029172 mlx5-tls-2020-06-26
1) Improve hardware layouts and structure for kTLS support
 
 2) Generalize ICOSQ (Internal Channel Operations Send Queue)
 Due to the asynchronous nature of adding new kTLS flows and handling
 HW asynchronous kTLS resync requests, the XSK ICOSQ was extended to
 support generic async operations, such as kTLS add flow and resync, in
 addition to the existing XSK usages.
 
 3) kTLS hardware flow steering and classification:
 The driver already has the means to classify TCP ipv4/6 flows to send them
 to the corresponding RSS HW engine, as reflected in patches 3 through 5,
 the series will add a steering layer that will hook to the driver's TCP
 classifiers and will match on well known kTLS connection, in case of a
 match traffic will be redirected to the kTLS decryption engine, otherwise
 traffic will continue flowing normally to the TCP RSS engine.
 
 3) kTLS add flow RX HW offload support
 New offload contexts post their static/progress params WQEs
 (Work Queue Element) to communicate the newly added kTLS contexts
 over the per-channel async ICOSQ.
 
 The Channel/RQ is selected according to the socket's rxq index.
 
 A new TLS-RX workqueue is used to allow asynchronous addition of
 steering rules, out of the NAPI context.
 It will be also used in a downstream patch in the resync procedure.
 
 Feature is OFF by default. Can be turned on by:
 $ ethtool -K <if> tls-hw-rx-offload on
 
 4) Added mlx5 kTLS sw stats and new counters are documented in
 Documentation/networking/tls-offload.rst
 rx_tls_ctx - number of TLS RX HW offload contexts added to device for
 decryption.
 
 rx_tls_ooo - number of RX packets which were part of a TLS stream
 but did not arrive in the expected order and triggered the resync
 procedure.
 
 rx_tls_del - number of TLS RX HW offload contexts deleted from device
 (connection has finished).
 
 rx_tls_err - number of RX packets which were part of a TLS stream
  but were not decrypted due to unexpected error in the state machine.
 
 5) Asynchronous RX resync
 
 a. The NIC driver indicates that it would like to resync on some TLS
 record within the received packet (P), but the driver does not
 know (yet) which of the TLS records within the packet.
 At this stage, the NIC driver will query the device to find the exact
 TCP sequence for resync (tcpsn), however, the driver does not wait
 for the device to provide the response.
 
 b. Eventually, the device responds, and the driver provides the tcpsn
 within the resync packet to KTLS. Now, KTLS can check the tcpsn against
 any processed TLS records within packet P, and also against any record
 that is processed in the future within packet P.
 
 The asynchronous resync path simplifies the device driver, as it can
 save bits on the packet completion (32-bit TCP sequence), and pass this
 information on an asynchronous command instead.
 
 Performance:
     CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off
     NIC: ConnectX-6 Dx 100GbE dual port
 
     Goodput (app-layer throughput) comparison:
     +---------------+-------+-------+---------+
     | # connections |   1   |   4   |    8    |
     +---------------+-------+-------+---------+
     | SW (Gbps)     |  7.26 | 24.70 |   50.30 |
     +---------------+-------+-------+---------+
     | HW (Gbps)     | 18.50 | 64.30 |   92.90 |
     +---------------+-------+-------+---------+
     | Speedup       | 2.55x | 2.56x | 1.85x * |
     +---------------+-------+-------+---------+
 
     * After linerate is reached, diff is observed in CPU util
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl73s2kACgkQSD+KveBX
 +j4wqAf/ZhcEn7i4N2F9wMMIL6wd4DgwKWWhbGpiREIxDwcRbqH7PGom8nBZMNd9
 +3g3zfURvByWehLtYcjmMgR4B7+xDgEs0dSx6pQM9764HqLDV2jW8ENr9Vr/u8s1
 hJ/eV8uzIfvx27MzbENZi0oJTw7N9nCgdcv1OyZkIba+Iado9pOeakPgBmTbINgo
 46LJI9nIEROE15gfjyxrVeYAs3Nxt+bogQCWYfMqUfRmKcMJ0d4oTHaUdtmm+xQB
 jC685/e4gE7jRgZ3qH/xvCZYp7+TVKaXsB0EtaJdPFEkvvvQpgPTfquIQ+6l7vvE
 Yf1YUhnDOoxGUQy1CdSZ2reNxLIm8A==
 =7+rG
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-tls-2020-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-tls-2020-06-26

1) Improve hardware layouts and structure for kTLS support

2) Generalize ICOSQ (Internal Channel Operations Send Queue)
Due to the asynchronous nature of adding new kTLS flows and handling
HW asynchronous kTLS resync requests, the XSK ICOSQ was extended to
support generic async operations, such as kTLS add flow and resync, in
addition to the existing XSK usages.

3) kTLS hardware flow steering and classification:
The driver already has the means to classify TCP ipv4/6 flows to send them
to the corresponding RSS HW engine, as reflected in patches 3 through 5,
the series will add a steering layer that will hook to the driver's TCP
classifiers and will match on well known kTLS connection, in case of a
match traffic will be redirected to the kTLS decryption engine, otherwise
traffic will continue flowing normally to the TCP RSS engine.

3) kTLS add flow RX HW offload support
New offload contexts post their static/progress params WQEs
(Work Queue Element) to communicate the newly added kTLS contexts
over the per-channel async ICOSQ.

The Channel/RQ is selected according to the socket's rxq index.

A new TLS-RX workqueue is used to allow asynchronous addition of
steering rules, out of the NAPI context.
It will be also used in a downstream patch in the resync procedure.

Feature is OFF by default. Can be turned on by:
$ ethtool -K <if> tls-hw-rx-offload on

4) Added mlx5 kTLS sw stats and new counters are documented in
Documentation/networking/tls-offload.rst
rx_tls_ctx - number of TLS RX HW offload contexts added to device for
decryption.

rx_tls_ooo - number of RX packets which were part of a TLS stream
but did not arrive in the expected order and triggered the resync
procedure.

rx_tls_del - number of TLS RX HW offload contexts deleted from device
(connection has finished).

rx_tls_err - number of RX packets which were part of a TLS stream
 but were not decrypted due to unexpected error in the state machine.

5) Asynchronous RX resync

a. The NIC driver indicates that it would like to resync on some TLS
record within the received packet (P), but the driver does not
know (yet) which of the TLS records within the packet.
At this stage, the NIC driver will query the device to find the exact
TCP sequence for resync (tcpsn), however, the driver does not wait
for the device to provide the response.

b. Eventually, the device responds, and the driver provides the tcpsn
within the resync packet to KTLS. Now, KTLS can check the tcpsn against
any processed TLS records within packet P, and also against any record
that is processed in the future within packet P.

The asynchronous resync path simplifies the device driver, as it can
save bits on the packet completion (32-bit TCP sequence), and pass this
information on an asynchronous command instead.

Performance:
    CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off
    NIC: ConnectX-6 Dx 100GbE dual port

    Goodput (app-layer throughput) comparison:
    +---------------+-------+-------+---------+
    | # connections |   1   |   4   |    8    |
    +---------------+-------+-------+---------+
    | SW (Gbps)     |  7.26 | 24.70 |   50.30 |
    +---------------+-------+-------+---------+
    | HW (Gbps)     | 18.50 | 64.30 |   92.90 |
    +---------------+-------+-------+---------+
    | Speedup       | 2.55x | 2.56x | 1.85x * |
    +---------------+-------+-------+---------+

    * After linerate is reached, diff is observed in CPU util
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:18:40 -07:00
Cong Wang
bf64ff4c2a genetlink: get rid of family->attrbuf
genl_family_rcv_msg_attrs_parse() reuses the global family->attrbuf
when family->parallel_ops is false. However, family->attrbuf is not
protected by any lock on the genl_family_rcv_msg_doit() code path.

This leads to several different consequences, one of them is UAF,
like the following:

genl_family_rcv_msg_doit():		genl_start():
					  genl_family_rcv_msg_attrs_parse()
					    attrbuf = family->attrbuf
					    __nlmsg_parse(attrbuf);
  genl_family_rcv_msg_attrs_parse()
    attrbuf = family->attrbuf
    __nlmsg_parse(attrbuf);
					  info->attrs = attrs;
					  cb->data = info;

netlink_unicast_kernel():
 consume_skb()
					genl_lock_dumpit():
					  genl_dumpit_info(cb)->attrs

Note family->attrbuf is an array of pointers to the skb data, once
the skb is freed, any dereference of family->attrbuf will be a UAF.

Maybe we could serialize the family->attrbuf with genl_mutex too, but
that would make the locking more complicated. Instead, we can just get
rid of family->attrbuf and always allocate attrbuf from heap like the
family->parallel_ops==true code path. This may add some performance
overhead but comparing with taking the global genl_mutex, it still
looks better.

Fixes: 75cdbdd089 ("net: ieee802154: have genetlink code to parse the attrs during dumpit")
Fixes: 057af70713 ("net: tipc: have genetlink code to parse the attrs during dumpit")
Reported-and-tested-by: syzbot+3039ddf6d7b13daf3787@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+80cad1e3cb4c41cde6ff@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+736bcbcb11b60d0c0792@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+520f8704db2b68091d44@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c96e4dfb32f8987fdeed@syzkaller.appspotmail.com
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:15:57 -07:00
Petr Machata
aee9caa03f net: sched: sch_red: Add qevents "early_drop" and "mark"
In order to allow acting on dropped and/or ECN-marked packets, add two new
qevents to the RED qdisc: "early_drop" and "mark". Filters attached at
"early_drop" block are executed as packets are early-dropped, those
attached at the "mark" block are executed as packets are ECN-marked.

Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block
index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark"
qevent. Absence of these attributes signifies "don't care": no block is
allocated in that case, or the existing blocks are left intact in case of
the change callback.

For purposes of offloading, blocks attached to these qevents appear with
newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and
FLOW_BLOCK_BINDER_TYPE_RED_MARK.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:08:28 -07:00
Petr Machata
65545ea249 net: sched: sch_red: Split init and change callbacks
In the following patches, RED will get two qevents. The implementation will
be clearer if the callback for change is not a pure subset of the callback
for init. Split the two and promote attribute parsing to the callbacks
themselves from the common code, because it will be handy there.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:08:28 -07:00
Petr Machata
3625750f05 net: sched: Introduce helpers for qevent blocks
Qevents are attach points for TC blocks, where filters can be put that are
executed when "interesting events" take place in a qdisc. The data to keep
and the functions to invoke to maintain a qevent will be largely the same
between qevents. Therefore introduce sched-wide helpers for qevent
management.

Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc,
blocks attachment cannot be changed after the qdisc is created. To that
end, add a helper tcf_qevent_validate_change(), which verifies whether
block index attribute is not attached, or if it is, whether its value
matches the current one (i.e. there is no material change).

The function tcf_qevent_handle() should be invoked when qdisc hits the
"interesting event" corresponding to a block. This function releases root
lock for the duration of executing the attached filters, to allow packets
generated through user actions (notably mirred) to be reinserted to the
same qdisc tree.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:08:28 -07:00
Petr Machata
aebe4426cc net: sched: Pass root lock to Qdisc_ops.enqueue
A following patch introduces qevents, points in qdisc algorithm where
packet can be processed by user-defined filters. Should this processing
lead to a situation where a new packet is to be enqueued on the same port,
holding the root lock would lead to deadlocks. To solve the issue, qevent
handler needs to unlock and relock the root lock when necessary.

To that end, add the root lock argument to the qdisc op enqueue, and
propagate throughout.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:08:28 -07:00
David S. Miller
33c568ba49 Couple of fixes/small things:
* TX control port status check fixed to not assume frame format
  * mesh control port fixes
  * error handling/leak fixes when starting AP, with HE attributes
  * fix broadcast packet handling with encapsulation offload
  * add new AKM suites
  * and a small code cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl75+b4ACgkQB8qZga/f
 l8QfpA//QLGvnLKLZwEpmGhqtp/ZvmX3jv24QsbxLO8do210/UqVojf8WrUUWq8P
 5kccKp1wxbobuUVzviAmSRRfYIkXQjFsz82dV1hei2xVhhOB1rHavCPqkorZIX+s
 p3a9WzaqRwgK+chGtGGDEGjkx31+Ve8tRkvRmIfhxX+r1mAF1jI8gwlsUmX7jTfN
 VfbiLP1pbyNemNBqrugjhOmprTqUanfaY9alJVEaglQAild7YcFdmqbpKEBogQk7
 0KPnYRsD0w3pv4ewLeBypnz4+kD2+JPaPkgAeq1ppi9YcvFwTalTRaP2P4miL0+x
 lxfsT8eSlCls/FTEb7UheWP1Xt4ym0xAQh9e9VaWavzvbw2l6oHtCOEFL9YQrOa2
 xSmNJBvle82DXy38RSFpcsaQSHAS3TAJyEeS1Z1b0D7IzBudpbwDLE828FXwylDP
 eTnlNy7zrYtKwazhtI0uVB2K/DrEWmwqnyPmX75AiP/3FmiDI4mxmIj0hwpzC27e
 jfl2d7qWLRQ8WVs0mE5u7QnNz2sZ3trq6TeO6NwLDIjdm65GpiQB2kpS6DsmIOiu
 phpS8vnGAC9ZNThIDvFPYvEBTkRQ+hzrsonwIU54PEVj3RMAqQaL3RPrN7UFfmVK
 fDZ21rjzujOchDZNCzwIpXMeAz1EzyNOzBcYwDs2bOoSKD5nYdE=
 =rzWh
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Couple of fixes/small things:
 * TX control port status check fixed to not assume frame format
 * mesh control port fixes
 * error handling/leak fixes when starting AP, with HE attributes
 * fix broadcast packet handling with encapsulation offload
 * add new AKM suites
 * and a small code cleanup
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 16:58:30 -07:00
Christophe Leroy
becd201492 SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
Even if that's only a warning, not including asm/cacheflush.h
leads to svc_flush_bvec() being empty allthough powerpc defines
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.

  CC      net/sunrpc/svcsock.o
net/sunrpc/svcsock.c:227:5: warning: "ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE" is not defined [-Wundef]
 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
     ^

Include linux/highmem.h so that asm/cacheflush.h will be included.

Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reported-by: kernel test robot <lkp@intel.com>
Fixes: ca07eda33e ("SUNRPC: Refactor svc_recvfrom()")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-06-29 14:50:25 -04:00
Eric Dumazet
a9b1110162 llc: make sure applications use ARPHRD_ETHER
syzbot was to trigger a bug by tricking AF_LLC with
non sensible addr->sllc_arphrd

It seems clear LLC requires an Ethernet device.

Back in commit abf9d537fe ("llc: add support for SO_BINDTODEVICE")
Octavian Purdila added possibility for application to use a zero
value for sllc_arphrd, convert it to ARPHRD_ETHER to not cause
regressions on existing applications.

BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:199 [inline]
BUG: KASAN: use-after-free in list_empty include/linux/list.h:268 [inline]
BUG: KASAN: use-after-free in waitqueue_active include/linux/wait.h:126 [inline]
BUG: KASAN: use-after-free in wq_has_sleeper include/linux/wait.h:160 [inline]
BUG: KASAN: use-after-free in skwq_has_sleeper include/net/sock.h:2092 [inline]
BUG: KASAN: use-after-free in sock_def_write_space+0x642/0x670 net/core/sock.c:2813
Read of size 8 at addr ffff88801e0b4078 by task ksoftirqd/3/27

CPU: 3 PID: 27 Comm: ksoftirqd/3 Not tainted 5.5.0-rc1-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x197/0x210 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
 kasan_report+0x12/0x20 mm/kasan/common.c:639
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135
 __read_once_size include/linux/compiler.h:199 [inline]
 list_empty include/linux/list.h:268 [inline]
 waitqueue_active include/linux/wait.h:126 [inline]
 wq_has_sleeper include/linux/wait.h:160 [inline]
 skwq_has_sleeper include/net/sock.h:2092 [inline]
 sock_def_write_space+0x642/0x670 net/core/sock.c:2813
 sock_wfree+0x1e1/0x260 net/core/sock.c:1958
 skb_release_head_state+0xeb/0x260 net/core/skbuff.c:652
 skb_release_all+0x16/0x60 net/core/skbuff.c:663
 __kfree_skb net/core/skbuff.c:679 [inline]
 consume_skb net/core/skbuff.c:838 [inline]
 consume_skb+0xfb/0x410 net/core/skbuff.c:832
 __dev_kfree_skb_any+0xa4/0xd0 net/core/dev.c:2967
 dev_kfree_skb_any include/linux/netdevice.h:3650 [inline]
 e1000_unmap_and_free_tx_resource.isra.0+0x21b/0x3a0 drivers/net/ethernet/intel/e1000/e1000_main.c:1963
 e1000_clean_tx_irq drivers/net/ethernet/intel/e1000/e1000_main.c:3854 [inline]
 e1000_clean+0x4cc/0x1d10 drivers/net/ethernet/intel/e1000/e1000_main.c:3796
 napi_poll net/core/dev.c:6532 [inline]
 net_rx_action+0x508/0x1120 net/core/dev.c:6600
 __do_softirq+0x262/0x98c kernel/softirq.c:292
 run_ksoftirqd kernel/softirq.c:603 [inline]
 run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
 smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
 kthread+0x361/0x430 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

Allocated by task 8247:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 __kasan_kmalloc mm/kasan/common.c:513 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
 slab_post_alloc_hook mm/slab.h:584 [inline]
 slab_alloc mm/slab.c:3320 [inline]
 kmem_cache_alloc+0x121/0x710 mm/slab.c:3484
 sock_alloc_inode+0x1c/0x1d0 net/socket.c:240
 alloc_inode+0x68/0x1e0 fs/inode.c:230
 new_inode_pseudo+0x19/0xf0 fs/inode.c:919
 sock_alloc+0x41/0x270 net/socket.c:560
 __sock_create+0xc2/0x730 net/socket.c:1384
 sock_create net/socket.c:1471 [inline]
 __sys_socket+0x103/0x220 net/socket.c:1513
 __do_sys_socket net/socket.c:1522 [inline]
 __se_sys_socket net/socket.c:1520 [inline]
 __ia32_sys_socket+0x73/0xb0 net/socket.c:1520
 do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline]
 do_fast_syscall_32+0x27b/0xe16 arch/x86/entry/common.c:408
 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

Freed by task 17:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 kasan_set_free_info mm/kasan/common.c:335 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
 __cache_free mm/slab.c:3426 [inline]
 kmem_cache_free+0x86/0x320 mm/slab.c:3694
 sock_free_inode+0x20/0x30 net/socket.c:261
 i_callback+0x44/0x80 fs/inode.c:219
 __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
 rcu_do_batch kernel/rcu/tree.c:2183 [inline]
 rcu_core+0x570/0x1540 kernel/rcu/tree.c:2408
 rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2417
 __do_softirq+0x262/0x98c kernel/softirq.c:292

The buggy address belongs to the object at ffff88801e0b4000
 which belongs to the cache sock_inode_cache of size 1152
The buggy address is located 120 bytes inside of
 1152-byte region [ffff88801e0b4000, ffff88801e0b4480)
The buggy address belongs to the page:
page:ffffea0000782d00 refcount:1 mapcount:0 mapping:ffff88807aa59c40 index:0xffff88801e0b4ffd
raw: 00fffe0000000200 ffffea00008e6c88 ffffea0000782d48 ffff88807aa59c40
raw: ffff88801e0b4ffd ffff88801e0b4000 0000000100000003 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88801e0b3f00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
 ffff88801e0b3f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88801e0b4000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                                ^
 ffff88801e0b4080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88801e0b4100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: abf9d537fe ("llc: add support for SO_BINDTODEVICE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28 21:41:23 -07:00
Cong Wang
e8280338c7 net: explain the lockdep annotations for dev_uc_unsync()
The lockdep annotations for dev_uc_unsync() and dev_mc_unsync()
are not easy to understand, so add some comments to explain
why they are correct.

Similar for the rest netif_addr_lock_bh() cases, they don't
need nested version.

Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28 21:38:27 -07:00
Cong Wang
be74294ffa net: get rid of lockdep_set_class_and_subclass()
lockdep_set_class_and_subclass() is meant to reduce
the _nested() annotations by assigning a default subclass.
For addr_list_lock, we have to compute the subclass at
run-time as the netdevice topology changes after creation.

So, we should just get rid of these
lockdep_set_class_and_subclass() and stick with our _nested()
annotations.

Fixes: 845e0ebb44 ("net: change addr_list_lock back to static key")
Suggested-by: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28 21:37:23 -07:00
Luc Van Oostenryck
433f17a93c l2tp: fix l2tp_eth_dev_xmit()'s return type
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, but the implementation in this
driver returns an 'int'.

Fix this by returning 'netdev_tx_t' in this driver too.

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28 20:52:53 -07:00
Luc Van Oostenryck
42deace2a5 net/hsr: fix hsr_dev_xmit()'s return type
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, but the implementation in this
driver returns an 'int'.

Fix this by returning 'netdev_tx_t' in this driver too.

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28 20:52:53 -07:00
Horatiu Vultur
9b14d1f8a7 bridge: mrp: Fix endian conversion and some other warnings
The following sparse warnings are fixed:
net/bridge/br_mrp.c:106:18: warning: incorrect type in assignment (different base types)
net/bridge/br_mrp.c:106:18:    expected unsigned short [usertype]
net/bridge/br_mrp.c:106:18:    got restricted __be16 [usertype]
net/bridge/br_mrp.c:281:23: warning: incorrect type in argument 1 (different modifiers)
net/bridge/br_mrp.c:281:23:    expected struct list_head *entry
net/bridge/br_mrp.c:281:23:    got struct list_head [noderef] *
net/bridge/br_mrp.c:332:28: warning: incorrect type in argument 1 (different modifiers)
net/bridge/br_mrp.c:332:28:    expected struct list_head *new
net/bridge/br_mrp.c:332:28:    got struct list_head [noderef] *
net/bridge/br_mrp.c:332:40: warning: incorrect type in argument 2 (different modifiers)
net/bridge/br_mrp.c:332:40:    expected struct list_head *head
net/bridge/br_mrp.c:332:40:    got struct list_head [noderef] *
net/bridge/br_mrp.c:682:29: warning: incorrect type in argument 1 (different modifiers)
net/bridge/br_mrp.c:682:29:    expected struct list_head const *head
net/bridge/br_mrp.c:682:29:    got struct list_head [noderef] *

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 2f1a11ae11 ("bridge: mrp: Add MRP interface.")
Fixes: 4b8d7d4c59 ("bridge: mrp: Extend bridge interface")
Fixes: 9a9f26e8f7 ("bridge: mrp: Connect MRP API with the switchdev API")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28 20:44:10 -07:00
John Fastabend
8025751d4d bpf, sockmap: RCU dereferenced psock may be used outside RCU block
If an ingress verdict program specifies message sizes greater than
skb->len and there is an ENOMEM error due to memory pressure we
may call the rcv_msg handler outside the strp_data_ready() caller
context. This is because on an ENOMEM error the strparser will
retry from a workqueue. The caller currently protects the use of
psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
block.

But, in above workqueue error case the psock is accessed outside
the read_lock/unlock block of the caller. So instead of using
psock directly we must do a look up against the sk again to
ensure the psock is available.

There is an an ugly piece here where we must handle
the case where we paused the strp and removed the psock. On
psock removal we first pause the strparser and then remove
the psock. If the strparser is paused while an skb is
scheduled on the workqueue the skb will be dropped on the
flow and kfree_skb() is called. If the workqueue manages
to get called before we pause the strparser but runs the rcvmsg
callback after the psock is removed we will hit the unlikely
case where we run the sockmap rcvmsg handler but do not have
a psock. For now we will follow strparser logic and drop the
skb on the floor with skb_kfree(). This is ugly because the
data is dropped. To date this has not caused problems in practice
because either the application controlling the sockmap is
coordinating with the datapath so that skbs are "flushed"
before removal or we simply wait for the sock to be closed before
removing it.

This patch fixes the describe RCU bug and dropping the skb doesn't
make things worse. Future patches will improve this by allowing
the normal case where skbs are not merged to skip the strparser
altogether. In practice many (most?) use cases have no need to
merge skbs so its both a code complexity hit as seen above and
a performance issue. For example, in the Cilium case we always
set the strparser up to return sbks 1:1 without any merging and
have avoided above issues.

Fixes: e91de6afa8 ("bpf: Fix running sk_skb program types with ktls")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
2020-06-28 08:33:28 -07:00
John Fastabend
93dd5f1859 bpf, sockmap: RCU splat with redirect and strparser error or TLS
There are two paths to generate the below RCU splat the first and
most obvious is the result of the BPF verdict program issuing a
redirect on a TLS socket (This is the splat shown below). Unlike
the non-TLS case the caller of the *strp_read() hooks does not
wrap the call in a rcu_read_lock/unlock. Then if the BPF program
issues a redirect action we hit the RCU splat.

However, in the non-TLS socket case the splat appears to be
relatively rare, because the skmsg caller into the strp_data_ready()
is wrapped in a rcu_read_lock/unlock. Shown here,

 static void sk_psock_strp_data_ready(struct sock *sk)
 {
	struct sk_psock *psock;

	rcu_read_lock();
	psock = sk_psock(sk);
	if (likely(psock)) {
		if (tls_sw_has_ctx_rx(sk)) {
			psock->parser.saved_data_ready(sk);
		} else {
			write_lock_bh(&sk->sk_callback_lock);
			strp_data_ready(&psock->parser.strp);
			write_unlock_bh(&sk->sk_callback_lock);
		}
	}
	rcu_read_unlock();
 }

If the above was the only way to run the verdict program we
would be safe. But, there is a case where the strparser may throw an
ENOMEM error while parsing the skb. This is a result of a failed
skb_clone, or alloc_skb_for_msg while building a new merged skb when
the msg length needed spans multiple skbs. This will in turn put the
skb on the strp_wrk workqueue in the strparser code. The skb will
later be dequeued and verdict programs run, but now from a
different context without the rcu_read_lock()/unlock() critical
section in sk_psock_strp_data_ready() shown above. In practice
I have not seen this yet, because as far as I know most users of the
verdict programs are also only working on single skbs. In this case no
merge happens which could trigger the above ENOMEM errors. In addition
the system would need to be under memory pressure. For example, we
can't hit the above case in selftests because we missed having tests
to merge skbs. (Added in later patch)

To fix the below splat extend the rcu_read_lock/unnlock block to
include the call to sk_psock_tls_verdict_apply(). This will fix both
TLS redirect case and non-TLS redirect+error case. Also remove
psock from the sk_psock_tls_verdict_apply() function signature its
not used there.

[ 1095.937597] WARNING: suspicious RCU usage
[ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G        W
[ 1095.944363] -----------------------------
[ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
[ 1095.950866]
[ 1095.950866] other info that might help us debug this:
[ 1095.950866]
[ 1095.957146]
[ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
[ 1095.961482] 1 lock held by test_sockmap/15970:
[ 1095.964501]  #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
[ 1095.968568]
[ 1095.968568] stack backtrace:
[ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G        W         5.7.0-rc7-02911-g463bac5f1ca79 #1
[ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1095.980519] Call Trace:
[ 1095.982191]  dump_stack+0x8f/0xd0
[ 1095.984040]  sk_psock_skb_redirect+0xa6/0xf0
[ 1095.986073]  sk_psock_tls_strp_read+0x1d8/0x250
[ 1095.988095]  tls_sw_recvmsg+0x714/0x840 [tls]

v2: Improve commit message to identify non-TLS redirect plus error case
    condition as well as more common TLS case. In the process I decided
    doing the rcu_read_unlock followed by the lock/unlock inside branches
    was unnecessarily complex. We can just extend the current rcu block
    and get the same effeective without the shuffling and branching.
    Thanks Martin!

Fixes: e91de6afa8 ("bpf: Fix running sk_skb program types with ktls")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
2020-06-28 08:33:28 -07:00
Miaohe Lin
2ce578ca94 net: ipv4: Fix wrong type conversion from hint to rt in ip_route_use_hint()
We can't cast sk_buff to rtable by (struct rtable *)hint. Use skb_rtable().

Fixes: 02b2494161 ("ipv4: use dst hint for ipv4 list receive")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-27 18:02:32 -07:00
Yousuk Seung
082d4fa980 tcp: update delivered_ce with delivered
Currently tp->delivered is updated in various places in tcp_ack() but
tp->delivered_ce is updated once at the end. As a result two counts in
OPT_STATS of SCM_TSTAMP_ACK timestamps generated in tcp_ack() may not be
in sync. This patch updates both counts at the same in tcp_ack().

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-27 17:41:27 -07:00
Yousuk Seung
f00394ce60 tcp: count sacked packets in tcp_sacktag_state
Add sack_delivered to tcp_sacktag_state and count the number of sacked
and dsacked packets. This is pure refactor for future patches to improve
tracking delivered counts.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-27 17:41:27 -07:00
Yousuk Seung
c634e34f6e tcp: add ece_ack flag to reno sack functions
Pass a boolean flag that tells the ECE state of the current ack to reno
sack functions. This is pure refactor for future patches to improve
tracking delivered counts.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-27 17:41:27 -07:00
Yousuk Seung
fdb7eb21dd tcp: stamp SCM_TSTAMP_ACK later in tcp_clean_rtx_queue()
Currently tp->delivered is updated with sacked packets but not
cumulatively acked when SCP_TSTAMP_ACK is timestamped. This patch moves
a tcp_ack_tstamp() call in tcp_clean_rtx_queue() to later in the loop so
that when a skb is fully acked OPT_STATS of SCM_TSTAMP_ACK will include
the current skb in the delivered count. When not fully acked
tcp_ack_tstamp() is a no-op and there is no change in behavior.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-27 17:41:27 -07:00
Boris Pismenny
ed9b7646b0 net/tls: Add asynchronous resync
This patch adds support for asynchronous resynchronization in tls_device.
Async resync follows two distinct stages:

1. The NIC driver indicates that it would like to resync on some TLS
record within the received packet (P), but the driver does not
know (yet) which of the TLS records within the packet.
At this stage, the NIC driver will query the device to find the exact
TCP sequence for resync (tcpsn), however, the driver does not wait
for the device to provide the response.

2. Eventually, the device responds, and the driver provides the tcpsn
within the resync packet to KTLS. Now, KTLS can check the tcpsn against
any processed TLS records within packet P, and also against any record
that is processed in the future within packet P.

The asynchronous resync path simplifies the device driver, as it can
save bits on the packet completion (32-bit TCP sequence), and pass this
information on an asynchronous command instead.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-06-27 14:00:22 -07:00
Boris Pismenny
acb5a07aaf Revert "net/tls: Add force_resync for driver resync"
This reverts commit b3ae2459f8.
Revert the force resync API.
Not in use. To be replaced by a better async resync API downstream.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-06-27 14:00:21 -07:00
Linus Torvalds
4e99b32169 NFS Client Bugfixes for Linux 5.8-rc
Stable Fixes:
 - xprtrdma: Fix handling of RDMA_ERROR replies
 - sunrpc: Fix rollback in rpc_gssd_dummy_populate()
 - pNFS/flexfiles: Fix list corruption if the mirror count changes
 - NFSv4: Fix CLOSE not waiting for direct IO completion
 - SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
 
 Other Fixes:
 - xprtrdma: Fix a use-after-free with r_xprt->rx_ep
 - Fix other xprtrdma races during disconnect
 - NFS: Fix memory leak of export_path
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl72aFkACgkQ18tUv7Cl
 QOsWfg//ewmCjJV1LGJJM2ntxcN9xAZJdIY3cuYfxQaDr/qwdbgh8DNbPlkImaoB
 aW5DVciqKJ8HpJchko4wYvNbbnAI32Kd87RcmUYoXwwUY+H2kwuOf41Vm4jfrScF
 NHiN5b5GTUz2X/s83NsbE9uGCFE1TS8pJkn6chVEWJY+QOjWpQmJrFQ0E9ULwP1O
 g46Dym9RtILrsNyGcSks6Rnts4Ujm3+PDW+hLWjGzwotDgMS2LGZ7oQpfcs0NvHs
 A3RjSOywltockeKvqchibTZMAXjIxqLV8cmo6AsT2H3llGbr+F61DkBMuTgqozhp
 QAONwvxDv6EcnsS5NnOJJdhwG7IK1dPIA5oxmGq7XlhShZF+hrfvGYyhkmDkdf8V
 9wfpV6foPC07hTcd+h0+A5DTh4Bxi71q+VIvVyQzgvX4UgRMrRptkNUzAm/Tn56C
 JoFtjxswy0W476rqYaIJKjs/Mv1eozwvEifIuwpMu+VWiwiNEygNKyvmdVYxeDmv
 13hjXVbQCCjyPvQSmBRKUEOR07DxHUt5Kcy9xHQ5ZXr5KdCERSt9MfXucxUxMQTA
 JG143HPt3P7tkr+1wIyerN94w0kZGQqtQR/BHd5Ms0abrv+jgqjQVleFd4vX2igU
 o/pCH4SLEhEndChU6lvv534ilRSH5LLQifyV2ThFFdZpOhtw7tU=
 =BzX9
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Stable Fixes:
   - xprtrdma: Fix handling of RDMA_ERROR replies
   - sunrpc: Fix rollback in rpc_gssd_dummy_populate()
   - pNFS/flexfiles: Fix list corruption if the mirror count changes
   - NFSv4: Fix CLOSE not waiting for direct IO completion
   - SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()

  Other Fixes:
   - xprtrdma: Fix a use-after-free with r_xprt->rx_ep
   - Fix other xprtrdma races during disconnect
   - NFS: Fix memory leak of export_path"

* tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
  NFSv4 fix CLOSE not waiting for direct IO compeletion
  pNFS/flexfiles: Fix list corruption if the mirror count changes
  nfs: Fix memory leak of export_path
  sunrpc: fixed rollback in rpc_gssd_dummy_populate()
  xprtrdma: Fix handling of RDMA_ERROR replies
  xprtrdma: Clean up disconnect
  xprtrdma: Clean up synopsis of rpcrdma_flush_disconnect()
  xprtrdma: Use re_connect_status safely in rpcrdma_xprt_connect()
  xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed
2020-06-27 09:35:47 -07:00
Paolo Abeni
a8ee9c9b58 mptcp: introduce token KUNIT self-tests
Unit tests for the internal MPTCP token APIs, using KUNIT

v1 -> v2:
 - use the correct RCU annotation when initializing icsk ulp
 - fix a few checkpatch issues

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 16:21:39 -07:00
Paolo Abeni
a00a582203 mptcp: move crypto test to KUNIT
currently MPTCP uses a custom hook to executed unit tests at
boot time. Let's use the KUNIT framework instead.
Additionally move the relevant code to a separate file and
export the function needed by the test when self-tests
are build as a module.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 16:21:39 -07:00
Paolo Abeni
2c5ebd001d mptcp: refactor token container
Replace the radix tree with a hash table allocated
at boot time. The radix tree has some shortcoming:
a single lock is contented by all the mptcp operation,
the lookup currently use such lock, and traversing
all the items would require a lock, too.

With hash table instead we trade a little memory to
address all the above - a per bucket lock is used.

To hash the MPTCP sockets, we re-use the msk' sk_node
entry: the MPTCP sockets are never hashed by the stack.
Replace the existing hash proto callbacks with a dummy
implementation, annotating the above constraint.

Additionally refactor the token creation to code to:

- limit the number of consecutive attempts to a fixed
maximum. Hitting a hash bucket with a long chain is
considered a failed attempt

- accept() no longer can fail to token management.

- if token creation fails at connect() time, we do
fallback to TCP (before the connection was closed)

v1 -> v2:
 - fix "no newline at end of file" - Jakub

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 16:21:39 -07:00
Paolo Abeni
d39dceca38 mptcp: add __init annotation on setup functions
Add the missing annotation in some setup-only
functions.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 16:21:39 -07:00
Chuck Lever
89a3c9f5b9 SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
@subbuf is an output parameter of xdr_buf_subsegment(). A survey of
call sites shows that @subbuf is always uninitialized before
xdr_buf_segment() is invoked by callers.

There are some execution paths through xdr_buf_subsegment() that do
not set all of the fields in @subbuf, leaving some pointer fields
containing garbage addresses. Subsequent processing of that buffer
then results in a page fault.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-26 08:45:23 -04:00
Vasily Averin
b7ade38165 sunrpc: fixed rollback in rpc_gssd_dummy_populate()
__rpc_depopulate(gssd_dentry) was lost on error path

cc: stable@vger.kernel.org
Fixes: commit 4b9a445e3e ("sunrpc: create a new dummy pipe for gssd to hold open")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-26 08:43:14 -04:00
Luca Coelho
60a0121f8f nl80211: fix memory leak when parsing NL80211_ATTR_HE_BSS_COLOR
If there is an error when parsing the NL80211_ATTR_HE_BSS_COLOR
attribute, we return immediately without freeing param.acl.  Fit it by
using goto out instead of returning immediately.

Fixes: 5c5e52d1bb ("nl80211: add handling for BSS color")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200626124931.7ad2a3eb894f.I60905fb70bd20389a3b170db515a07275e31845e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-26 11:52:57 +02:00
Luca Coelho
bc7a39b427 nl80211: don't return err unconditionally in nl80211_start_ap()
When a memory leak was fixed, a return err was changed to goto err,
but, accidentally, the if (err) was removed, so now we always exit at
this point.

Fix it by adding if (err) back.

Fixes: 9951ebfcdf ("nl80211: fix potential leak in AP start")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200626124931.871ba5b31eee.I97340172d92164ee92f3c803fe20a8a6e97714e1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-26 11:52:52 +02:00
Linus Lüssing
3bda14d09d batman-adv: Introduce a configurable per interface hop penalty
In some setups multiple hard interfaces with similar link qualities
or throughput values are available. But people have expressed the desire
to consider one of them as a backup only.

Some creative solutions are currently in use: Such people are
configuring multiple batman-adv mesh/soft interfaces, wire them
together with some veth pairs and then tune the hop penalty to achieve
an effect similar to a tunable per interface hop penalty.

This patch introduces a new, configurable, per hard interface hop penalty
to simplify such setups.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-06-26 10:37:11 +02:00
Sven Eckelmann
bccb48c89f batman-adv: Fix typos and grammar in documentation
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-06-26 10:36:30 +02:00
Simon Wunderlich
d528510e6d batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-06-26 10:35:06 +02:00
David S. Miller
7bed145516 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor overlapping changes in xfrm_device.c, between the double
ESP trailing bug fix setting the XFRM_INIT flag and the changes
in net-next preparing for bonding encryption support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 19:29:51 -07:00
Linus Torvalds
4a21185cda Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Don't insert ESP trailer twice in IPSEC code, from Huy Nguyen.

 2) The default crypto algorithm selection in Kconfig for IPSEC is out
    of touch with modern reality, fix this up. From Eric Biggers.

 3) bpftool is missing an entry for BPF_MAP_TYPE_RINGBUF, from Andrii
    Nakryiko.

 4) Missing init of ->frame_sz in xdp_convert_zc_to_xdp_frame(), from
    Hangbin Liu.

 5) Adjust packet alignment handling in ax88179_178a driver to match
    what the hardware actually does. From Jeremy Kerr.

 6) register_netdevice can leak in the case one of the notifiers fail,
    from Yang Yingliang.

 7) Use after free in ip_tunnel_lookup(), from Taehee Yoo.

 8) VLAN checks in sja1105 DSA driver need adjustments, from Vladimir
    Oltean.

 9) tg3 driver can sleep forever when we get enough EEH errors, fix from
    David Christensen.

10) Missing {READ,WRITE}_ONCE() annotations in various Intel ethernet
    drivers, from Ciara Loftus.

11) Fix scanning loop break condition in of_mdiobus_register(), from
    Florian Fainelli.

12) MTU limit is incorrect in ibmveth driver, from Thomas Falcon.

13) Endianness fix in mlxsw, from Ido Schimmel.

14) Use after free in smsc95xx usbnet driver, from Tuomas Tynkkynen.

15) Missing bridge mrp configuration validation, from Horatiu Vultur.

16) Fix circular netns references in wireguard, from Jason A. Donenfeld.

17) PTP initialization on recovery is not done properly in qed driver,
    from Alexander Lobakin.

18) Endian conversion of L4 ports in filters of cxgb4 driver is wrong,
    from Rahul Lakkireddy.

19) Don't clear bound device TX queue of socket prematurely otherwise we
    get problems with ktls hw offloading, from Tariq Toukan.

20) ipset can do atomics on unaligned memory, fix from Russell King.

21) Align ethernet addresses properly in bridging code, from Thomas
    Martitz.

22) Don't advertise ipv4 addresses on SCTP sockets having ipv6only set,
    from Marcelo Ricardo Leitner.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (149 commits)
  rds: transport module should be auto loaded when transport is set
  sch_cake: fix a few style nits
  sch_cake: don't call diffserv parsing code when it is not needed
  sch_cake: don't try to reallocate or unshare skb unconditionally
  ethtool: fix error handling in linkstate_prepare_data()
  wil6210: account for napi_gro_receive never returning GRO_DROP
  hns: do not cast return value of napi_gro_receive to null
  socionext: account for napi_gro_receive never returning GRO_DROP
  wireguard: receive: account for napi_gro_receive never returning GRO_DROP
  vxlan: fix last fdb index during dump of fdb with nhid
  sctp: Don't advertise IPv4 addresses if ipv6only is set on the socket
  tc-testing: avoid action cookies with odd length.
  bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  net: dsa: sja1105: fix tc-gate schedule with single element
  net: dsa: sja1105: recalculate gating subschedule after deleting tc-gate rules
  net: dsa: sja1105: unconditionally free old gating config
  net: dsa: sja1105: move sja1105_compose_gating_subschedule at the top
  net: macb: free resources on failure path of at91ether_open()
  net: macb: call pm_runtime_put_sync on failure path
  ...
2020-06-25 18:27:40 -07:00
Kevin Darbyshire-Bryant
b8392808eb sch_cake: add RFC 8622 LE PHB support to CAKE diffserv handling
Change tin mapping on diffserv3, 4 & 8 for LE PHB support, in essence
making LE a member of the Bulk tin.

Bulk has the least priority and minimum of 1/16th total bandwidth in the
face of higher priority traffic.

NB: Diffserv 3 & 4 swap tin 0 & 1 priorities from the default order as
found in diffserv8, in case anyone is wondering why it looks a bit odd.

Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
[ reword commit message slightly ]
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 16:31:35 -07:00
Rao Shoaib
4c342f778f rds: transport module should be auto loaded when transport is set
This enhancement auto loads transport module when the transport
is set via SO_RDS_TRANSPORT socket option.

Reviewed-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 16:26:25 -07:00
Toke Høiland-Jørgensen
3f608f0c41 sch_cake: fix a few style nits
I spotted a few nits when comparing the in-tree version of sch_cake with
the out-of-tree one: A redundant error variable declaration shadowing an
outer declaration, and an indentation alignment issue. Fix both of these.

Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 16:24:05 -07:00
Toke Høiland-Jørgensen
8c95eca0bb sch_cake: don't call diffserv parsing code when it is not needed
As a further optimisation of the diffserv parsing codepath, we can skip it
entirely if CAKE is configured to neither use diffserv-based
classification, nor to zero out the diffserv bits.

Fixes: c87b4ecdbe ("sch_cake: Make sure we can write the IP header before changing DSCP bits")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 16:24:05 -07:00
Ilya Ponetayev
9208d2863a sch_cake: don't try to reallocate or unshare skb unconditionally
cake_handle_diffserv() tries to linearize mac and network header parts of
skb and to make it writable unconditionally. In some cases it leads to full
skb reallocation, which reduces throughput and increases CPU load. Some
measurements of IPv4 forward + NAPT on MIPS router with 580 MHz single-core
CPU was conducted. It appears that on kernel 4.9 skb_try_make_writable()
reallocates skb, if skb was allocated in ethernet driver via so-called
'build skb' method from page cache (it was discovered by strange increase
of kmalloc-2048 slab at first).

Obtain DSCP value via read-only skb_header_pointer() call, and leave
linearization only for DSCP bleaching or ECN CE setting. And, as an
additional optimisation, skip diffserv parsing entirely if it is not needed
by the current configuration.

Fixes: c87b4ecdbe ("sch_cake: Make sure we can write the IP header before changing DSCP bits")
Signed-off-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com>
[ fix a few style issues, reflow commit message ]
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 16:24:05 -07:00
Michal Kubecek
1ae71d997a ethtool: fix error handling in linkstate_prepare_data()
When getting SQI or maximum SQI value fails in linkstate_prepare_data(), we
must not return without calling ethnl_ops_complete(dev) as that could
result in imbalance between ethtool_ops ->begin() and ->complete() calls.

Fixes: 8066021915 ("ethtool: provide UAPI for PHY Signal Quality Index (SQI)")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 16:17:16 -07:00
Marcelo Ricardo Leitner
471e39df96 sctp: Don't advertise IPv4 addresses if ipv6only is set on the socket
If a socket is set ipv6only, it will still send IPv4 addresses in the
INIT and INIT_ACK packets. This potentially misleads the peer into using
them, which then would cause association termination.

The fix is to not add IPv4 addresses to ipv6only sockets.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 16:11:33 -07:00
Neal Cardwell
b344579ca8 tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where
Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an
ACK when the minimum rtt of a connection goes down. From inspection it
is clear from the existing code that this could happen in an example
like the following:

o The first 8 RTT samples in a round trip are 150ms, resulting in a
  curr_rtt of 150ms and a delay_min of 150ms.

o The 9th RTT sample is 100ms. The curr_rtt does not change after the
  first 8 samples, so curr_rtt remains 150ms. But delay_min can be
  lowered at any time, so delay_min falls to 100ms. The code executes
  the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min
  of 100ms, and the curr_rtt is declared far enough above delay_min to
  force a (spurious) exit of Slow start.

The fix here is simple: allow every RTT sample in a round trip to
lower the curr_rtt.

Fixes: ae27e98a51 ("[TCP] CUBIC v2.3")
Reported-by: Mirja Kuehlewind <mirja.kuehlewind@ericsson.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 16:08:47 -07:00
David S. Miller
f4926d513b Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net, they are:

1) Unaligned atomic access in ipset, from Russell King.

2) Missing module description, from Rob Gill.

3) Patches to fix a module unload causing NULL pointer dereference in
   xtables, from David Wilder. For the record, I posting here his cover
   letter explaining the problem:

    A crash happened on ppc64le when running ltp network tests triggered by
    "rmmod iptable_mangle".

    See previous discussion in this thread:
    https://lists.openwall.net/netdev/2020/06/03/161 .

    In the crash I found in iptable_mangle_hook() that
    state->net->ipv4.iptable_mangle=NULL causing a NULL pointer dereference.
    net->ipv4.iptable_mangle is set to NULL in +iptable_mangle_net_exit() and
    called when ip_mangle modules is unloaded. A rmmod task was found running
    in the crash dump.  A 2nd crash showed the same problem when running
    "rmmod iptable_filter" (net->ipv4.iptable_filter=NULL).

    To fix this I added .pre_exit hook in all iptable_foo.c. The pre_exit will
    un-register the underlying hook and exit would do the table freeing. The
    netns core does an unconditional +synchronize_rcu after the pre_exit hooks
    insuring no packets are in flight that have picked up the pointer before
    completing the un-register.

    These patches include changes for both iptables and ip6tables.

    We tested this fix with ltp running iptables01.sh and iptables01.sh -6 a
    loop for 72 hours.

4) Add a selftest for conntrack helper assignment, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 12:52:41 -07:00
Thomas Martitz
206e732323 net: bridge: enfore alignment for ethernet address
The eth_addr member is passed to ether_addr functions that require
2-byte alignment, therefore the member must be properly aligned
to avoid unaligned accesses.

The problem is in place since the initial merge of multicast to unicast:
commit 6db6f0eae6 bridge: multicast to unicast

Fixes: 6db6f0eae6 ("bridge: multicast to unicast")
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Thomas Martitz <t.martitz@avm.de>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 12:38:16 -07:00
Denis Kirjanov
2570284060 tcp: don't ignore ECN CWR on pure ACK
there is a problem with the CWR flag set in an incoming ACK segment
and it leads to the situation when the ECE flag is latched forever

the following packetdrill script shows what happens:

// Stack receives incoming segments with CE set
+0.1 <[ect0]  . 11001:12001(1000) ack 1001 win 65535
+0.0 <[ce]    . 12001:13001(1000) ack 1001 win 65535
+0.0 <[ect0] P. 13001:14001(1000) ack 1001 win 65535

// Stack repsonds with ECN ECHO
+0.0 >[noecn]  . 1001:1001(0) ack 12001
+0.0 >[noecn] E. 1001:1001(0) ack 13001
+0.0 >[noecn] E. 1001:1001(0) ack 14001

// Write a packet
+0.1 write(3, ..., 1000) = 1000
+0.0 >[ect0] PE. 1001:2001(1000) ack 14001

// Pure ACK received
+0.01 <[noecn] W. 14001:14001(0) ack 2001 win 65535

// Since CWR was sent, this packet should NOT have ECE set

+0.1 write(3, ..., 1000) = 1000
+0.0 >[ect0]  P. 2001:3001(1000) ack 14001
// but Linux will still keep ECE latched here, with packetdrill
// flagging a missing ECE flag, expecting
// >[ect0] PE. 2001:3001(1000) ack 14001
// in the script

In the situation above we will continue to send ECN ECHO packets
and trigger the peer to reduce the congestion window. To avoid that
we can check CWR on pure ACKs received.

v3:
- Add a sequence check to avoid sending an ACK to an ACK

v2:
- Adjusted the comment
- move CWR check before checking for unacknowledged packets

Signed-off-by: Denis Kirjanov <denis.kirjanov@suse.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 12:20:24 -07:00
Markus Theil
0b467b6387 mac80211: allow rx of mesh eapol frames with default rx key
Without this patch, eapol frames cannot be received in mesh
mode, when 802.1X should be used. Initially only a MGTK is
defined, which is found and set as rx->key, when there are
no other keys set. ieee80211_drop_unencrypted would then
drop these eapol frames, as they are data frames without
encryption and there exists some rx->key.

Fix this by differentiating between mesh eapol frames and
other data frames with existing rx->key. Allow mesh mesh
eapol frames only if they are for our vif address.

With this patch in-place, ieee80211_rx_h_mesh_fwding continues
after the ieee80211_drop_unencrypted check and notices, that
these eapol frames have to be delivered locally, as they should.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200625104214.50319-1-markus.theil@tu-ilmenau.de
[small code cleanups]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-25 12:55:45 +02:00
Markus Theil
5af7fef39d mac80211: skip mpath lookup also for control port tx
When using 802.1X over mesh networks, at first an ordinary
mesh peering is established, then the 802.1X EAPOL dialog
happens, afterwards an authenticated mesh peering exchange
(AMPE) happens, finally the peering is complete and we can
set the STA authorized flag.

As 802.1X is an intermediate step here and key material is
not yet exchanged for stations we have to skip mesh path lookup
for these EAPOL frames. Otherwise the already configure mesh
group encryption key would be used to send a mesh path request
which no one can decipher, because we didn't already establish
key material on both peers, like with SAE and directly using AMPE.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200617082637.22670-2-markus.theil@tu-ilmenau.de
[remove pointless braces, remove unnecessary local variable,
 the list can only process one such frame (or its fragments)]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-25 10:59:27 +02:00
Seevalamuthu Mariappan
78fb5b541b mac80211: Fix dropping broadcast packets in 802.11 encap
Broadcast pkts like arp are getting dropped in 'ieee80211_8023_xmit'.
Fix this by replacing is_valid_ether_addr api with is_zero_ether_addr.

Fixes: 50ff477a86 ("mac80211: add 802.11 encapsulation offloading support")
Signed-off-by: Seevalamuthu Mariappan <seevalam@codeaurora.org>
Link: https://lore.kernel.org/r/1591697754-4975-1-git-send-email-seevalam@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-25 10:54:35 +02:00
Pavel Machek
01da2e059d mac80211: simplify mesh code
Doing mod_timer() conditionaly is easier than conditionally unlocking
and jumping around...

Signed-off-by: Pavel Machek (CIP) <pavel@denx.de>
Acked-by: Linus Lüssing <ll@simonwunderlich.de>
Link: https://lore.kernel.org/r/20200604214157.GA9737@amd
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-25 10:54:09 +02:00
Markus Theil
86a1b9d7c2 mac80211: fix control port tx status check
The initial control port tx status patch assumed, that
we have IEEE 802.11 frames, but actually ethernet frames
are stored in the ack skb. Fix this by checking for the
correct ethertype and skb protocol 802.3.

Also allow tx status reports for ETH_P_PREAUTH, as preauth
frames can also be send over the nl80211 control port.

Fixes: a7528198ad ("mac80211: support control port TX status reporting")
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/20200622123542.173695-1-markus.theil@tu-ilmenau.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-25 10:48:09 +02:00
Po Liu
627e39b139 net: qos: police action add index for tc flower offloading
Hardware device may include more than one police entry. Specifying the
action's index make it possible for several tc filters to share the same
police action when installing the filters.

Propagate this index to device drivers through the flow offload
intermediate representation, so that drivers could share a single
hardware policer between multiple filters.

v1->v2 changes:
- Update the commit message suggest by Ido Schimmel <idosch@idosch.org>

Signed-off-by: Po Liu <Po.Liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-24 22:04:26 -07:00
Po Liu
19e528dc9a net: qos: add tc police offloading action with max frame size limit
Current police offloading support the 'burst'' and 'rate_bytes_ps'. Some
hardware own the capability to limit the frame size. If the frame size
larger than the setting, the frame would be dropped. For the police
action itself already accept the 'mtu' parameter in tc command. But not
extend to tc flower offloading. So extend 'mtu' to tc flower offloading.

Signed-off-by: Po Liu <Po.Liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-24 22:04:26 -07:00