Commit Graph

902017 Commits

Author SHA1 Message Date
Danielle Ratson
1cbe65e09b selftests: mlxsw: Use busywait helper in rtnetlink test
Rtnetlink test uses offload indication checks.

Use a busywait helper and wait until the offload indication is set or
fail if it reaches timeout.

Signed-off-by: Danielle Ratson <danieller@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:10:14 -08:00
Danielle Ratson
05ef614c55 selftests: mlxsw: Use busywait helper in vxlan test
Vxlan test uses offload indication checks.

Use a busywait helper and wait until the offload indication is set or
fail if it reaches timeout.

Signed-off-by: Danielle Ratson <danieller@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:10:14 -08:00
Danielle Ratson
0c22f993c9 selftests: mlxsw: Use busywait helper in blackhole routes test
Blackhole routes test uses offload indication checks.

Use busywait helper and wait until the routes offload indication is set or
fail if it reaches timeout.

Signed-off-by: Danielle Ratson <danieller@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:10:14 -08:00
Ido Schimmel
5d66773f41 selftests: devlink_trap_l3_drops: Avoid race condition
The test checks that packets are trapped when they should egress a
router interface (RIF) that has become disabled. This is a temporary
state in a RIF's deletion sequence.

Currently, the test deletes the RIF by flushing all the IP addresses
configured on the associated netdev (br0). However, this is racy, as
this also flushes all the routes pointing to the netdev and if the
routes are deleted from the device before the RIF is disabled, then no
packets will try to egress the disabled RIF and the trap will not be
triggered.

Instead, trigger the deletion of the RIF by unlinking the mlxsw port
from the bridge that is backing the RIF. Unlike before, this will not
cause the kernel to delete the routes pointing to the bridge.

Note that due to current mlxsw locking scheme the RIF is always deleted
first, but this is going to change.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:10:14 -08:00
Jiri Pirko
ab2b8ab253 selftests: add a mirror test to mlxsw tc flower restrictions
Include test of forbidding to have multiple mirror actions.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:10:14 -08:00
Jiri Pirko
c84e903f62 selftests: add egress redirect test to mlxsw tc flower restrictions
Include test of forbidding to have redirect rule on egress-bound block.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:10:14 -08:00
Petr Machata
3de611b507 selftests: mlxsw: Add a RED selftest
This tests that below the queue minimum length, there is no dropping /
marking, and above max, everything is dropped / marked.

The test is structured as a core file with topology and test code, and
three wrappers: one for RED used as a root Qdisc, and two for
testing (W)RED under PRIO and ETS.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:10:14 -08:00
Petr Machata
4113b04823 selftests: forwarding: lib.sh: Add start_tcp_traffic
Extract a helper __start_traffic() configurable by protocol type. Allow
passing through extra mausezahn arguments. Add a wrapper,
start_tcp_traffic().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:10:14 -08:00
David S. Miller
f4979b41f3 Merge branch 'hinic-BugFixes'
Luo bin says:

====================
hinic: BugFixes

the bug fixed in patch #2 has been present since the first commit.
the bugs fixed in patch #1 and patch #3 have been present since the
following commits:
patch #1: 352f58b0d9 ("net-next/hinic: Set Rxq irq to specific cpu for NUMA")
patch #3: 421e952628 ("hinic: add rss support")
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:08:01 -08:00
Luo bin
386d4716fd hinic: fix a bug of rss configuration
should use real receive queue number to configure hw rss
indirect table rather than maximal queue number

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:08:01 -08:00
Luo bin
d2ed69ce9e hinic: fix a bug of setting hw_ioctxt
a reserved field is used to signify prime physical function index
in the latest firmware version, so we must assign a value to it
correctly

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:08:01 -08:00
Luo bin
0bff777bd0 hinic: fix a irq affinity bug
can not use a local variable as an input parameter of
irq_set_affinity_hint

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:08:01 -08:00
Linus Torvalds
e46bfaba59 A pair of docs-build fixes.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl5U9RsACgkQF0NaE2wM
 fljetAgAkIzf4eaENcrU8LUOjR7p66a6IIBLA5WAlSc2CvmfHrI6fJgxUWHZKxZp
 0NhzoNiYo0zPkrPpjabpWLwyiIAIDR3RA7B+dOuxCW0AieLuPV6ltg3ytXukeGvo
 ZOVXEVgorG6Kx4oNvCKqrUVPICU+SErSRmREVggLWi4iM1cQzUxcMIBJEw8SgHEV
 mDT5igcfSqWWxU8rF6zGY0zV+l+pN12gf7EfmKnJ6E0oCWNZ6VHbmyHaVxIS76C7
 9aiNW5+ATeCPEYvXseAaOSCbJWOU0wlxrskpHSN2yBgGHTGipEEFZF5rAj8QbeYH
 9UPqvmgKFLZqQ3aCqAK3jltm8ojyRA==
 =rH5d
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.6-fixes' of git://git.lwn.net/linux

Pull documentation fixes from Jonathan Corbet:
 "A pair of docs-build fixes"

* tag 'docs-5.6-fixes' of git://git.lwn.net/linux:
  docs: Fix empty parallelism argument
  docs: remove MPX from the x86 toc
2020-02-27 11:07:13 -08:00
Linus Torvalds
ed5fa55918 audit/stable-5.6 PR 20200226
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl5XHiYUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOn4w/9EWM7O3ZNr1BJb/p6Z1DxAHi1e11s
 UC5G6njvfihvddS1dhT00eSUX/Bk9WrCLNX5mzXbi5SD8iPP1bDk2L/1xFuR/Dji
 DHOruYsCqBI6t/rmEhwJiAaMWWQnGzHVfL+Fh7cOBhk2GNdfLUSs+BPXrRmchtH5
 OlOogvpMCigK4pvYFN1AXNpvHnBYTvWrmC32KIjj9WcyGuY9Uz0TI+AOnKRJAp/Q
 YKV7vyw6ZpEt2/nT9lQHTWrMrsXLIybJLRF+BX9pVsz3bzi3u5FqROF1FntIOv1R
 OF5cI7ly9ixY2VMyxh/0k2X5aB/6BqyB2kmk58x0Wbqxcwi+rOqnMtm5chwpjQ+u
 bWN+mcB8XGe2Rj+/jjLjoiZReX6CnMuXZe+JNNYMJ+gV6x/Y9aCfg/G5K2ZvXdE2
 Cc1mkhGACivLDlYxgpoELR5UC7KEQ2rlYenP6nYiNEsqyMCYPw6g0Rac4dvjppYv
 YYOp0xyZuK9U+aCGg6Tk2UoOgHRSsG7HUK3yn3Zy8ol5A5NtpOS+8wyyT1e3B/iX
 xjQfaGx6P66r9qhs6ikVMbOcx9ZZxJXERiA6pNjlGcX4HyZdOokuP946NSYJZ4TK
 EUA9f4KcKIyBrDMFne2r1O8fNLybkU+2/HkFdak8eMTVtJx0l9oPEAZV/6TOReBv
 Smf0PXFLTv/oRzc=
 =DyVs
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fixes from Paul Moore:
 "Two fixes for problems found by syzbot:

   - Moving audit filter structure fields into a union caused some
     problems in the code which populates that filter structure.

     We keep the union (that idea is a good one), but we are fixing the
     code so that it doesn't needlessly set fields in the union and mess
     up the error handling.

   - The audit_receive_msg() function wasn't validating user input as
     well as it should in all cases, we add the necessary checks"

* tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: always check the netlink payload length in audit_receive_msg()
  audit: fix error handling in audit_data_to_entry()
2020-02-27 11:01:22 -08:00
David S. Miller
2b99e54b30 Merge branch 'VLANs-DSA-switches-and-multiple-bridges'
Russell King says:

====================
VLANs, DSA switches and multiple bridges

This is a repost of the previously posted RFC back in December, which
did not get fully reviewed.  I've dropped the RFC tag this time as no
one really found anything too problematical in the RFC posting.

I've been trying to configure DSA for VLANs and not having much success.
The setup is quite simple:

- The main network is untagged
- The wifi network is a vlan tagged with id $VN running over the main
  network.

I have an Armada 388 Clearfog with a PCIe wifi card which I'm trying to
setup to provide wifi access to the vlan $VN network, while the switch
is also part of the main network.

However, I'm encountering problems:

1) vlan support in DSA has a different behaviour from the Linux
   software bridge implementation.

    # bridge vlan
    port    vlan ids
    lan1     1 PVID Egress Untagged
    ...

   shows the default setup - the bridge ports are all configured for
   vlan 1, untagged egress, and vlan 1 as the port vid.  Issuing:

    # ip li set dev br0 type bridge vlan_filtering 1

   with no other vlan configuration commands on a Linux software bridge
   continues to allow untagged traffic to flow across the bridge.

   This difference in behaviour is because the MV88E6xxx VTU is
   completely empty - because net/dsa ignores all vlan settings for
   a port if br_vlan_enabled(dp->bridge_dev) is false - this reflects
   the vlan filtering state of the bridge, not whether the bridge is
   vlan aware.

   What this means is that attempting to configure the bridge port
   vlans before enabling vlan filtering works for Linux software
   bridges, but fails for DSA bridges.

2) Assuming the above is sorted, we move on to the next issue, which
   is altogether more weird.  Let's take a setup where we have a
   DSA bridge with lan1..6 in a bridge device, br0, with vlan
   filtering enabled.  lan1 is the upstream port, lan2 is a downstream
   port that also wants to see traffic on vlan id $VN.

   Both lan1 and lan2 are configured for that:

     # bridge vlan add vid $VN dev lan1
     # bridge vlan add vid $VN dev lan2
     # ip li set br0 type bridge vlan_filtering 1

   Untagged traffic can now pass between all the six lan ports, and
   vlan $VN between lan1 and lan2 only.  The MV88E6xxx 8021q_mode
   debugfs file shows all lan ports are in mode "secure" - this is
   important!  /sys/class/net/br0/bridge/vlan_filtering contains 1.

   tcpdumping from another machine on lan4 shows that no $VN traffic
   reaches it.  Everything seems to be working correctly...

   In order to further bridge vlan $VN traffic to hostapd's wifi
   interface, things get a little more complex - we can't add hostapd's
   wifi interface to br0 directly, because hostapd will bring up the
   wifi interface and leak the main, untagged traffic onto the wifi.
   (hostapd does have vlan support, but only as a dynamic per-client
   thing, and there's no hooks I can see to allow script-based config
   of the network setup before hostapd up's the wifi interface.)

   So, what I tried was:

     # ip li add link br0 name br0.$VN type vlan id $VN
     # bridge vlan add vid $VN dev br0 self
     # ip li set dev br0.$VN up

   So far so good, we get a vlan interface on top of the bridge, and
   tcpdumping it shows we get traffic.  The 8021q_mode file has not
   changed state.  Everything still seems to be correct.

     # bridge addbr br1

   Still nothing has changed.

     # bridge addif br1 br0.$VN

   And now the 8021q_mode debugfs file shows that all ports are now in
   "disabled" mode, but /sys/class/net/br0/bridge/vlan_filtering still
   contains '1'.  In other words, br0 still thinks vlan filtering is
   enabled, but the hardware has had vlan filtering disabled.

   Adding some stack traces to an appropriate point indicates that this
   is because __switchdev_handle_port_attr_set() recurses down through
   the tree of interfaces, skipping over the vlan interface, applying
   br1's configuration to br0's ports.

   This surely can not be right - surely
   __switchdev_handle_port_attr_set() and similar should stop recursing
   down through another master bridge device?  There are probably other
   network device classes that switchdev shouldn't recurse down too.

   I've considered whether switchdev is the right level to do it, and
   I think it is - as we want the check/set callbacks to be called for
   the top level device even if it is a master bridge device, but we
   don't want to recurse through a lower master bridge device.

v2: dropped patch 3, since that has an outstanding issue, and my
question on it has not been answered.  Otherwise, these are the
same patches.  Maybe we can move forward with just these two?

v3: include DSA ports in patch 2
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:58:33 -08:00
Russell King
933b442508 net: dsa: mv88e6xxx: fix duplicate vlan warning
When setting VLANs on DSA switches, the VLAN is added to both the port
concerned as well as the CPU port by dsa_slave_vlan_add(), as well as
any DSA ports.  If multiple ports are configured with the same VLAN ID,
this triggers a warning on the CPU and DSA ports.

Avoid this warning for CPU and DSA ports.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:58:33 -08:00
Russell King
07c6f9805f net: switchdev: do not propagate bridge updates across bridges
When configuring a tree of independent bridges, propagating changes
from the upper bridge across a bridge master to the lower bridge
ports brings surprises.

For example, a lower bridge may have vlan filtering enabled.  It
may have a vlan interface attached to the bridge master, which may
then be incorporated into another bridge.  As soon as the lower
bridge vlan interface is attached to the upper bridge, the lower
bridge has vlan filtering disabled.

This occurs because switchdev recursively applies its changes to
all lower devices no matter what.

Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:58:33 -08:00
Karsten Graul
a2f2ef4a54 net/smc: check for valid ib_client_data
In smc_ib_remove_dev() check if the provided ib device was actually
initialized for SMC before.

Reported-by: syzbot+84484ccebdd4e5451d91@syzkaller.appspotmail.com
Fixes: a4cf0443c4 ("smc: introduce SMC as an IB-client")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:56:25 -08:00
Aaro Koskinen
474a31e13a net: stmmac: fix notifier registration
We cannot register the same netdev notifier multiple times when probing
stmmac devices. Register the notifier only once in module init, and also
make debugfs creation/deletion safe against simultaneous notifier call.

Fixes: 481a7d154c ("stmmac: debugfs entry name is not be changed when udev rename device name.")
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:55:14 -08:00
Antoine Tenart
c87a9d6fc6 net: phy: mscc: fix firmware paths
The firmware paths for the VSC8584 PHYs not not contain the leading
'microchip/' directory, as used in linux-firmware, resulting in an
error when probing the driver. This patch fixes it.

Fixes: a5afc16780 ("net: phy: mscc: add support for VSC8584 PHY")
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:53:09 -08:00
Dan Carpenter
9baeea5071 net: qrtr: Fix error pointer vs NULL bugs
The callers only expect NULL pointers, so returning an error pointer
will lead to an Oops.

Fixes: 0c2204a4ad ("net: qrtr: Migrate nameservice to kernel from userspace")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:52:31 -08:00
Antoine Tenart
1ac7b090ec net: phy: mscc: add missing shift for media operation mode selection
This patch adds a missing shift for the media operation mode selection.
This does not fix the driver as the current operation mode (copper) has
a value of 0, but this wouldn't work for other modes.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:51:42 -08:00
Paolo Abeni
dc24f8b4ec mptcp: add dummy icsk_sync_mss()
syzbot noted that the master MPTCP socket lacks the icsk_sync_mss
callback, and was able to trigger a null pointer dereference:

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 8e171067 P4D 8e171067 PUD 93fa2067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8984 Comm: syz-executor066 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc900020b7b80 EFLAGS: 00010246
RAX: 1ffff110124ba600 RBX: 0000000000000000 RCX: ffff88809fefa600
RDX: ffff8880994cdb18 RSI: 0000000000000000 RDI: ffff8880925d3140
RBP: ffffc900020b7bd8 R08: ffffffff870225be R09: fffffbfff140652a
R10: fffffbfff140652a R11: 0000000000000000 R12: ffff8880925d35d0
R13: ffff8880925d3140 R14: dffffc0000000000 R15: 1ffff110124ba6ba
FS:  0000000001a0b880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a6d6f000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 cipso_v4_sock_setattr+0x34b/0x470 net/ipv4/cipso_ipv4.c:1888
 netlbl_sock_setattr+0x2a7/0x310 net/netlabel/netlabel_kapi.c:989
 smack_netlabel security/smack/smack_lsm.c:2425 [inline]
 smack_inode_setsecurity+0x3da/0x4a0 security/smack/smack_lsm.c:2716
 security_inode_setsecurity+0xb2/0x140 security/security.c:1364
 __vfs_setxattr_noperm+0x16f/0x3e0 fs/xattr.c:197
 vfs_setxattr fs/xattr.c:224 [inline]
 setxattr+0x335/0x430 fs/xattr.c:451
 __do_sys_fsetxattr fs/xattr.c:506 [inline]
 __se_sys_fsetxattr+0x130/0x1b0 fs/xattr.c:495
 __x64_sys_fsetxattr+0xbf/0xd0 fs/xattr.c:495
 do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440199
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffcadc19e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000be
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440199
RDX: 0000000020000200 RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000003 R09: 00000000004002c8
R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000401a20
R13: 0000000000401ab0 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
CR2: 0000000000000000

Address the issue adding a dummy icsk_sync_mss callback.
To properly sync the subflows mss and options list we need some
additional infrastructure, which will land to net-next.

Reported-by: syzbot+f4dfece964792d80b139@syzkaller.appspotmail.com
Fixes: 2303f994b3 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:49:50 -08:00
Arthur Kiyanovski
92040c6daa net: ena: fix broken interface between ENA driver and FW
In this commit we revert the part of
commit 1a63443afd ("net/amazon: Ensure that driver version is aligned to the linux kernel"),
which breaks the interface between the ENA driver and FW.

We also replace the use of DRIVER_VERSION with DRIVER_GENERATION
when we bring back the deleted constants that are used in interface with
ENA device FW.

This commit does not change the driver version reported to the user via
ethtool, which remains the kernel version.

Fixes: 1a63443afd ("net/amazon: Ensure that driver version is aligned to the linux kernel")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:47:58 -08:00
David S. Miller
621135a0f9 Merge branch 'mptcp-update-mptcp-ack-sequence-outside-of-recv-path'
Florian Westphal says:

====================
mptcp: update mptcp ack sequence outside of recv path

This series moves mptcp-level ack sequence update outside of the recvmsg path.
Current approach has two problems:

1. There is delay between arrival of new data and the time we can ack
   this data.
2. If userspace doesn't call recv for some time, mptcp ack_seq is not
   updated at all, even if this data is queued in the subflow socket
   receive queue.

Move skbs from the subflow socket receive queue to the mptcp-level
receive queue, updating the mptcp-level ack sequence and have recv
take skbs from the mptcp-level receive queue.

The first place where we will attempt to update the mptcp level acks
is from the subflows' data_ready callback, even before we make userspace
aware of new data.

Because of possible deadlock (we need to take the mptcp socket lock
while already holding the subflow sockets lock), we may still need to
defer the mptcp-level ack update.  In such case, this work will be either
done from work queue or recv path, depending on which runs sooner.

In order to avoid pointless scheduling of the work queue, work
will be queued from the mptcp sockets lock release callback.
This allows to detect when the socket owner did drain the subflow
socket receive queue.

Please see individual patches for more information.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Paolo Abeni
14c441b564 mptcp: defer work schedule until mptcp lock is released
Don't schedule the work queue right away, instead defer this
to the lock release callback.

This has the advantage that it will give recv path a chance to
complete -- this might have moved all pending packets from the
subflow to the mptcp receive queue, which allows to avoid the
schedule_work().

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
2e52213c79 mptcp: avoid work queue scheduling if possible
We can't lock_sock() the mptcp socket from the subflow data_ready callback,
it would result in ABBA deadlock with the subflow socket lock.

We can however grab the spinlock: if that succeeds and the mptcp socket
is not owned at the moment, we can process the new skbs right away
without deferring this to the work queue.

This avoids the schedule_work and hence the small delay until the
work item is processed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
bfae9dae44 mptcp: remove mptcp_read_actor
Only used to discard stale data from the subflow, so move
it where needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
600911ff5f mptcp: add rmem queue accounting
If userspace never drains the receive buffers we must stop draining
the subflow socket(s) at some point.

This adds the needed rmem accouting for this.
If the threshold is reached, we stop draining the subflows.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
6771bfd9ee mptcp: update mptcp ack sequence from work queue
If userspace is not reading data, all the mptcp-level acks contain the
ack_seq from the last time userspace read data rather than the most
recent in-sequence value.

This causes pointless retransmissions for data that is already queued.

The reason for this is that all the mptcp protocol level processing
happens at mptcp_recv time.

This adds work queue to move skbs from the subflow sockets receive
queue on the mptcp socket receive queue (which was not used so far).

This allows us to announce the correct mptcp ack sequence in a timely
fashion, even when the application does not call recv() on the mptcp socket
for some time.

We still wake userspace tasks waiting for POLLIN immediately:
If the mptcp level receive queue is empty (because the work queue is
still pending) it can be filled from in-sequence subflow sockets at
recv time without a need to wait for the worker.

The skb_orphan when moving skbs from subflow to mptcp level is needed,
because the destructor (sock_rfree) relies on skb->sk (ssk!) lock
being taken.

A followup patch will add needed rmem accouting for the moved skbs.

Other problem: In case application behaves as expected, and calls
recv() as soon as mptcp socket becomes readable, the work queue will
only waste cpu cycles.  This will also be addressed in followup patches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Paolo Abeni
8099201715 mptcp: add work queue skeleton
Will be extended with functionality in followup patches.
Initial user is moving skbs from subflows receive queue to
the mptcp-level receive queue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
101f6f851e mptcp: add and use mptcp_data_ready helper
allows us to schedule the work queue to drain the ssk receive queue in
a followup patch.

This is needed to avoid sending all-to-pessimistic mptcp-level
acknowledgements.  At this time, the ack_seq is what was last read by
userspace instead of the highest in-sequence number queued for reading.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
David S. Miller
5cd129dd5e Merge branch 'mlxsw-Small-driver-update'
Jiri Pirko says:

====================
mlxsw: Small driver update

This patchset contains couple of patches not related to each other. They
are small optimization and extension changes to the driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:44:42 -08:00
Petr Machata
3b909c552a mlxsw: spectrum: Add mlxsw_sp_span_ops.buffsize_get for Spectrum-3
The buffer factor on Spectrum-3 is larger than on Spectrum-2. Add a new
callback and use it for mlxsw_sp->span_ops on Spectrum-3.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:44:42 -08:00
Ido Schimmel
b401ff8541 mlxsw: spectrum: Initialize advertised speeds to supported speeds
During port initialization the driver instructs the device to only
advertise speeds that can be supported by the port's current width.

Since the device now returns the supported speeds based on the port's
current width, the driver no longer needs to compute the speeds that can
be advertised.

Simplify port initialization by setting the advertised speeds to the
queried supported speeds.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:44:42 -08:00
Petr Machata
8a29581eb0 mlxsw: spectrum: Move the ECN-marked packet counter to ethtool
Spectrum-1 and Spectrum-2 do not have a per-TC counter of number of packets
marked by ECN. The value reported currently is the total number of marked
packets. Showing this value at individual TC Qdiscs is misleading.

Move the counter to ethtool instead.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:44:42 -08:00
Jiri Pirko
648e53cac7 mlxsw: spectrum_switchdev: Optimize SFN records processing
Currently, only one SFN query is done from repetitive work at a time,
processing 64 entries. Another work iteration is scheduled in 100ms,
that means that the max rate of learned FDB entries is limited to 6400/s.
That is slow. Fix this by doing 2 optimizations:
1) Run 10 SFN queries at a time.
2) In case the SFN is not drained, schedule work with 0 delay to allow
   to continue processing rest of the records.

On a testing setup with 500K entries the time to process decreased
from 870secs to 10secs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Alex Kushnarov <alexanderk@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:44:42 -08:00
Sudheesh Mavila
4f31c532ad net: phy: corrected the return value for genphy_check_and_restart_aneg and genphy_c45_check_and_restart_aneg
When auto-negotiation is not required, return value should be zero.

Changes v1->v2:
- improved comments and code as Andrew Lunn and Heiner Kallweit suggestion
- fixed issue in genphy_c45_check_and_restart_aneg as Russell King
  suggestion.

Fixes: 2a10ab043a ("net: phy: add genphy_check_and_restart_aneg()")
Fixes: 1af9f16840 ("net: phy: add genphy_c45_check_and_restart_aneg()")
Signed-off-by: Sudheesh Mavila <sudheesh.mavila@amd.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:41:42 -08:00
Randy Dunlap
c535f92032 af_llc: fix if-statement empty body warning
When debugging via dprintk() is not enabled, make the dprintk()
macro be an empty do-while loop, as is done in
<linux/sunrpc/debug.h>.

This fixes a gcc warning when -Wextra is set:
../net/llc/af_llc.c:974:51: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]

I have verified that there is not object code change (with gcc 7.5.0).

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:38:13 -08:00
yangerkun
f596c87005 slip: not call free_netdev before rtnl_unlock in slip_open
As the description before netdev_run_todo, we cannot call free_netdev
before rtnl_unlock, fix it by reorder the code.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:37:06 -08:00
David S. Miller
165b94ffcf mlx5-updates-2020-02-25
The following series provides some misc updates to mlx5 driver:
 
 1) From Maxim, Refactoring for mlx5e netdev channels recreation flow.
   - Add error handling
   - Add context to the preactivate hook
   - Use preactivate hook with context where it can be used
     and subsequently unify channel recreation flow everywhere.
   - Fix XPS cpumask to not reset upon channel recreation.
 
 2) From Tariq:
   - Use indirect calls wrapper on RX.
   - Check LRO capability bit
 
 3) Multiple small cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl5VxJAACgkQSD+KveBX
 +j5q6gf+NPVz9uaWlegeAD0J/AkrycIOlceaBhwQ7zDJ1u2qy8Aan6QuprFLKwkW
 hHDfyqxMXdsHnQJAfOiUU1HGgK2v092BWoArdHMJwift333TQkRzJNXCM9WBKAOU
 PpyOgvRESLWOjcgOcm3AYv/FKjiW5akm7rlW+jYy3JswjY4a7rqJvVKUnXXPP9B1
 5oiis1OIQVz9y2GTEOYwBx3euZ87TCqzsgF+qNkKtFil74gOnVHNx4kqFTLE3b0Z
 awq+dvKOUlCIobAX69BmKDj6ieAKNO7l0mfMYnmbVk8nNt33F0z94MBIw2ztCcPq
 AN9hDZdQitnQFH9okdVCf4rKxaQoxA==
 =lixf
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2020-02-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-02-25

The following series provides some misc updates to mlx5 driver:

1) From Maxim, Refactoring for mlx5e netdev channels recreation flow.
  - Add error handling
  - Add context to the preactivate hook
  - Use preactivate hook with context where it can be used
    and subsequently unify channel recreation flow everywhere.
  - Fix XPS cpumask to not reset upon channel recreation.

2) From Tariq:
  - Use indirect calls wrapper on RX.
  - Check LRO capability bit

3) Multiple small cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:30:43 -08:00
David S. Miller
06baf4be20 Merge branch 'net-smc-improve-peer-ID-in-CLC-decline'
Hans Wippel says:

====================
net/smc: improve peer ID in CLC decline

The following two patches improve the peer ID in CLC decline messages if
RoCE devices are present in the host but no suitable device is found for
a connection. The first patch reworks the peer ID initialization. The
second patch contains the actual changes of the CLC decline messages.

Changes v1 -> v2:
* make smc_ib_is_valid_local_systemid() static in first patch
* changed if in smc_clc_send_decline() to remove curly braces

Changes RFC -> v1:
* split the patch into two parts
* removed zero assignment to global variable (thanks Leon)

Thanks to Leon Romanovsky and Karsten Graul for the feedback!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:27:06 -08:00
Hans Wippel
a082ec897f net/smc: improve peer ID in CLC decline for SMC-R
According to RFC 7609, all CLC messages contain a peer ID that consists
of a unique instance ID and the MAC address of one of the host's RoCE
devices. But if a SMC-R connection cannot be established, e.g., because
no matching pnet table entry is found, the current implementation uses a
zero value in the CLC decline message although the host's peer ID is set
to a proper value.

If no RoCE and no ISM device is usable for a connection, there is no LGR
and the LGR check in smc_clc_send_decline() prevents that the peer ID is
copied into the CLC decline message for both SMC-D and SMC-R. So, this
patch modifies the check to also accept the case of no LGR. Also, only a
valid peer ID is copied into the decline message.

Signed-off-by: Hans Wippel <ndev@hwipl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:27:06 -08:00
Hans Wippel
366bb249b5 net/smc: rework peer ID handling
This patch initializes the peer ID to a random instance ID and a zero
MAC address. If a RoCE device is in the host, the MAC address part of
the peer ID is overwritten with the respective address. Also, a function
for checking if the peer ID is valid is added. A peer ID is considered
valid if the MAC address part contains a non-zero MAC address.

Signed-off-by: Hans Wippel <ndev@hwipl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:27:06 -08:00
Arjun Roy
0b7f41f687 tcp-zerocopy: Update returned getsockopt() optlen.
TCP receive zerocopy currently does not update the returned optlen for
getsockopt() if the user passed in a larger than expected value.
Thus, userspace cannot properly determine if all the fields are set in
the passed-in struct. This patch sets the optlen for this case before
returning, in keeping with the expected operation of getsockopt().

Fixes: c8856c0514 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:24:22 -08:00
Eric Dumazet
b6f6118901 ipv6: restrict IPV6_ADDRFORM operation
IPV6_ADDRFORM is able to transform IPv6 socket to IPv4 one.
While this operation sounds illogical, we have to support it.

One of the things it does for TCP socket is to switch sk->sk_prot
to tcp_prot.

We now have other layers playing with sk->sk_prot, so we should make
sure to not interfere with them.

This patch makes sure sk_prot is the default pointer for TCP IPv6 socket.

syzbot reported :
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD a0113067 P4D a0113067 PUD a8771067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 10686 Comm: syz-executor.0 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
FS:  00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 inet_release+0x165/0x1c0 net/ipv4/af_inet.c:427
 __sock_release net/socket.c:605 [inline]
 sock_close+0xe1/0x260 net/socket.c:1283
 __fput+0x2e4/0x740 fs/file_table.c:280
 ____fput+0x15/0x20 fs/file_table.c:313
 task_work_run+0x176/0x1b0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_usermode_loop arch/x86/entry/common.c:164 [inline]
 prepare_exit_to_usermode+0x480/0x5b0 arch/x86/entry/common.c:195
 syscall_return_slowpath+0x113/0x4a0 arch/x86/entry/common.c:278
 do_syscall_64+0x11f/0x1c0 arch/x86/entry/common.c:304
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45c429
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f2ae75dac78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: 0000000000000000 RBX: 00007f2ae75db6d4 RCX: 000000000045c429
RDX: 0000000000000001 RSI: 000000000000011a RDI: 0000000000000004
RBP: 000000000076bf20 R08: 0000000000000038 R09: 0000000000000000
R10: 0000000020000180 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000a9d R14: 00000000004ccfb4 R15: 000000000076bf2c
Modules linked in:
CR2: 0000000000000000
---[ end trace 82567b5207e87bae ]---
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
FS:  00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+1938db17e275e85dc328@syzkaller.appspotmail.com
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:20:58 -08:00
Ursula Braun
51e3dfa890 net/smc: fix cleanup for linkgroup setup failures
If an SMC connection to a certain peer is setup the first time,
a new linkgroup is created. In case of setup failures, such a
linkgroup is unusable and should disappear. As a first step the
linkgroup is removed from the linkgroup list in smc_lgr_forget().

There are 2 problems:
smc_listen_decline() might be called before linkgroup creation
resulting in a crash due to calling smc_lgr_forget() with
parameter NULL.
If a setup failure occurs after linkgroup creation, the connection
is never unregistered from the linkgroup, preventing linkgroup
freeing.

This patch introduces an enhanced smc_lgr_cleanup_early() function
which
* contains a linkgroup check for early smc_listen_decline()
  invocations
* invokes smc_conn_free() to guarantee unregistering of the
  connection.
* schedules fast linkgroup removal of the unusable linkgroup

And the unused function smcd_conn_free() is removed from smc_core.h.

Fixes: 3b2dec2603 ("net/smc: restructure client and server code in af_smc")
Fixes: 2a0674fffb ("net/smc: improve abnormal termination of link groups")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:18:07 -08:00
David S. Miller
ebb4a4bf76 Merge branch 'net-fix-sysfs-permssions-when-device-changes-network'
Christian Brauner says:

====================
net: fix sysfs permssions when device changes network

/* v7 */
This is v7 with a build warning fixup that slipped past me the last
time. It removes to unused variables in sysfs_group_change_owner(). I
observed no warning when building just now.

/* v6 */
This is v6 with two small fixups. I missed adapting the commit message
to reflect the renamed helper for changing the owner of sysfs files and
I also forgot to make the new dpm helper static inline.

/* v5 */
This is v5 with a small fixup requested by Rafael.

/* v4 */
This is v4 with more documentation and other fixes that Greg requested.

/* v3 */
This is v3 with explicit uid and gid parameters added to functions that
change sysfs object ownership as Greg requested.

(I've tagged this with net-next since it's triggered by a bug for
 network device files but it also touches driver core aspects so it's
 not clear-cut. I can of course split this series into separate
 patchsets.)
We have been struggling with a bug surrounding the ownership of network
device sysfs files when moving network devices between network
namespaces owned by different user namespaces reported by multiple
users.

Currently, when moving network devices between network namespaces the
ownership of the corresponding sysfs entries is not changed. This leads
to problems when tools try to operate on the corresponding sysfs files.

I also causes a bug when creating a network device in a network
namespaces owned by a user namespace and moving that network device back
to the host network namespaces. Because when a network device is created
in a network namespaces it will be owned by the root user of the user
namespace and all its associated sysfs files will also be owned by the
root user of the corresponding user namespace.
If such a network device has to be moved back to the host network
namespace the permissions will still be set to the root user of the
owning user namespaces of the originating network namespace. This means
unprivileged users can e.g. re-trigger uevents for such incorrectly
owned devices on the host or in other network namespaces. They can also
modify the settings of the device itself through sysfs when they
wouldn't be able to do the same through netlink. Both of these things
are unwanted.

For example, quite a few workloads will create network devices in the
host network namespace. Other tools will then proceed to move such
devices between network namespaces owner by other user namespaces. While
the ownership of the device itself is updated in
net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs
entry for the device is not. Below you'll find that moving a network
device (here a veth device) from a network namespace into another
network namespaces owned by a different user namespace with a different
id mapping. As you can see the permissions are wrong even though it is
owned by the userns root user after it has been moved and can be
interacted with through netlink:

drwxr-xr-x 5 nobody nobody    0 Jan 25 18:08 .
drwxr-xr-x 9 nobody nobody    0 Jan 25 18:08 ..
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id
drwxr-xr-x 2 nobody nobody    0 Jan 25 18:09 power
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down
drwxr-xr-x 4 nobody nobody    0 Jan 25 18:09 queues
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed
drwxr-xr-x 2 nobody nobody    0 Jan 25 18:09 statistics
lrwxrwxrwx 1 nobody nobody    0 Jan 25 18:08 subsystem -> ../../../../class/net
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent

Constrast this with creating a device of the same type in the network
namespace directly. In this case the device's sysfs permissions will be
correctly updated.
(Please also note, that in a lot of workloads this strategy of creating
 the network device directly in the network device to workaround this
 issue can not be used. Either because the network device is dedicated
 after it has been created or because it used by a process that is
 heavily sandboxed and couldn't create network devices itself.):

drwxr-xr-x 5 root   root      0 Jan 25 18:12 .
drwxr-xr-x 9 nobody nobody    0 Jan 25 18:08 ..
-r--r--r-- 1 root   root   4096 Jan 25 18:12 addr_assign_type
-r--r--r-- 1 root   root   4096 Jan 25 18:12 addr_len
-r--r--r-- 1 root   root   4096 Jan 25 18:12 address
-r--r--r-- 1 root   root   4096 Jan 25 18:12 broadcast
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 carrier
-r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_changes
-r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_down_count
-r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_up_count
-r--r--r-- 1 root   root   4096 Jan 25 18:12 dev_id
-r--r--r-- 1 root   root   4096 Jan 25 18:12 dev_port
-r--r--r-- 1 root   root   4096 Jan 25 18:12 dormant
-r--r--r-- 1 root   root   4096 Jan 25 18:12 duplex
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 flags
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 gro_flush_timeout
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 ifalias
-r--r--r-- 1 root   root   4096 Jan 25 18:12 ifindex
-r--r--r-- 1 root   root   4096 Jan 25 18:12 iflink
-r--r--r-- 1 root   root   4096 Jan 25 18:12 link_mode
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 mtu
-r--r--r-- 1 root   root   4096 Jan 25 18:12 name_assign_type
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 netdev_group
-r--r--r-- 1 root   root   4096 Jan 25 18:12 operstate
-r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_port_id
-r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_port_name
-r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_switch_id
drwxr-xr-x 2 root   root      0 Jan 25 18:12 power
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 proto_down
drwxr-xr-x 4 root   root      0 Jan 25 18:12 queues
-r--r--r-- 1 root   root   4096 Jan 25 18:12 speed
drwxr-xr-x 2 root   root      0 Jan 25 18:12 statistics
lrwxrwxrwx 1 nobody nobody    0 Jan 25 18:12 subsystem -> ../../../../class/net
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 tx_queue_len
-r--r--r-- 1 root   root   4096 Jan 25 18:12 type
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 uevent

Now, when creating a network device in a network namespace owned by a
user namespace and moving it to the host the permissions will be set to
the id that the user namespace root user has been mapped to on the host
leading to all sorts of permission issues mentioned above:

458752
drwxr-xr-x 5 458752 458752      0 Jan 25 18:12 .
drwxr-xr-x 9 root   root        0 Jan 25 18:08 ..
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 addr_assign_type
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 addr_len
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 address
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 broadcast
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_changes
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_down_count
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_up_count
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dev_id
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dev_port
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dormant
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 duplex
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 flags
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 gro_flush_timeout
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 ifalias
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 ifindex
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 iflink
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 link_mode
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 mtu
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 name_assign_type
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 netdev_group
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 operstate
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_port_id
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_port_name
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_switch_id
drwxr-xr-x 2 458752 458752      0 Jan 25 18:12 power
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 proto_down
drwxr-xr-x 4 458752 458752      0 Jan 25 18:12 queues
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 speed
drwxr-xr-x 2 458752 458752      0 Jan 25 18:12 statistics
lrwxrwxrwx 1 root   root        0 Jan 25 18:12 subsystem -> ../../../../class/net
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 tx_queue_len
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 type
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 uevent

Fix this by changing the basic sysfs files associated with network
devices when moving them between network namespaces. To this end we add
some infrastructure to sysfs.

The patchset takes care to only do this when the owning user namespaces
changes and the kids differ. So there's only a performance overhead,
when the owning user namespace of the network namespace is different
__and__ the kid mappings for the root user are different for the two
user namespaces:
Assume we have a netdev eth0 which we create in netns1 owned by userns1.
userns1 has an id mapping of 0 100000 100000. Now we move eth0 into
netns2 which is owned by userns2 which also defines an id mapping of 0
100000 100000. In this case sysfs doesn't need updating. The patch will
handle this case and not do any needless work. Now assume eth0 is moved
into netns3 which is owned by userns3 which defines an id mapping of 0
123456 65536. In this case the root user in each namespace corresponds
to different kid and sysfs needs updating.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:07:26 -08:00
Christian Brauner
ef6a4c88e9 net: fix sysfs permssions when device changes network namespace
Now that we moved all the helpers in place and make use netdev_change_owner()
to fixup the permissions when moving network devices between network
namespaces.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:07:26 -08:00
Christian Brauner
d755407d44 net-sysfs: add queue_change_owner()
Add a function to change the owner of the queue entries for a network device
when it is moved between network namespaces.

Currently, when moving network devices between network namespaces the
ownership of the corresponding queue sysfs entries are not changed. This leads
to problems when tools try to operate on the corresponding sysfs files. Fix
this.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:07:26 -08:00