Commit Graph

1030205 Commits

Author SHA1 Message Date
Paolo Abeni
f7918b7901 veth: always report zero combined channels
veth get_channel currently reports for channels being both RX/TX and
combined. As Jakub noted:

"""
ethtool man page is relatively clear, unfortunately the kernel code
is not and few read the man page. A channel is approximately an IRQ,
not a queue, and IRQ can't be dedicated and combined simultaneously
"""

This patch changes the information exposed by veth_get_channels,
setting max_combined to zero, being more consistent with the above
statement. The ethtool_channels is always cleared by the caller, we just
need to avoid setting the 'combined' fields.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 06:11:18 -07:00
Vasily Averin
2c6ad20b58 memcg: enable accounting for scm_fp_list objects
unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 06:00:38 -07:00
Vasily Averin
1b51d82719 memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 06:00:38 -07:00
Vasily Averin
a89893dd7b memcg: enable accounting for VLAN group array
vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 06:00:38 -07:00
Vasily Averin
990c74e3f4 memcg: enable accounting for inet_bin_bucket cache
net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 06:00:38 -07:00
Vasily Averin
6126891c6d memcg: enable accounting for IP address and routing-related objects
An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
>From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 06:00:38 -07:00
Vasily Averin
c948f51c16 memcg: enable accounting for net_device and Tx/Rx queues
Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 06:00:38 -07:00
David S. Miller
2967eed908 Merge branch 'bridge-vlan-multicast'
Nikolay Aleksandrov says:

====================
net: bridge: multicast: add vlan support

This patchset adds initial per-vlan multicast support, most of the code
deals with moving to multicast context pointers from bridge/port pointers.
That allows us to switch them with the per-vlan contexts when a multicast
packet is being processed and vlan multicast snooping has been enabled.
That is controlled by a global bridge option added in patch 06 which is
off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
that this option can change only under RTNL and doesn't require
multicast_lock, so we need to be careful when retrieving mcast contexts
in parallel. For packet processing they are switched only once in
br_multicast_rcv() and then used until the packet has been processed.
For the most part we need these contexts only to read config values and
check if they are disabled. The global mcast state which is maintained
consists of querier and router timers, the rest are config options.
The port mcast state which is maintained consists of query timer and
link to router port list if it's ever marked as a router port. Port
multicast contexts _must_ be used only with their respective global
contexts, that is a bridge port's mcast context must be used only with
bridge's global mcast context and a vlan/port's mcast context must be
used only with that vlan's global mcast context due to the router port
lists. This way a bridge port can be marked as a router in multiple
vlans, but might not be a router in some other vlan. Also this allows us
to have per-vlan querier elections, per-vlan queries and basically the
whole multicast state becomes per-vlan when the option is enabled.
One of the hardest parts is synchronization with vlan's memory
management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
which is changed only under multicast_lock. When a vlan is being
destroyed first that flag is removed under the lock, then the multicast
context is torn down which includes waiting for any outstanding context
timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
it must be checked first if the contexts are vlan and the multicast_lock
has been acquired. That is done by all IGMP/MLD packet processing
functions and timers. When processing a packet we have RCU so the vlan
memory won't be freed, but if the flag is missing we must not process it.
The timers are synchronized in the same way with the addition of waiting
for them to finish in case they are running after removing the flag
under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
snooping requires vlan filtering to be enabled, if it's disabled then
snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
vlan disabled checks. We need both flags because one is controlled by
user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
Since the latter is also used for synchronization between the multicast
and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
rely on it when checking if a vlan context is disabled. The multicast
fast-path has 3 new bit tests on the cache-hot bridge flags field, I
didn't observe any measurable difference. I haven't forced either
context options to be always disabled when the other type is enabled
because the state consists of timers which either expire (router) or
don't affect the normal operation. Some options, like the mcast querier
one, won't be allowed to change for the disabled context type, that will
come with a future patch-set which adds per-vlan querier control.

Another important addition is the global vlan options, so far we had
only per bridge/port vlan options but in order to control vlan multicast
snooping globally we need to add a new type of global vlan options.
They can be changed only on the bridge device and are dumped only when a
special flag is set in the dump request. The first global option is vlan
mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
private flag. It can be set only on master vlan entries. There will be
many more global vlan options in the future both for multicast config
and other per-vlan options (e.g. STP).

There's a lot of room for improvements, I'll do some of the initial
ones but splitting the state to different contexts opens the door
for a lot more. Also any new multicast options become vlan-supported with
very little to no effort by using the same contexts.

Short patch description:
  patches 01-04: initial mcast context add, no functional changes
  patch      05: adds vlan mcast init and control helpers and uses them on
                 vlan create/destroy
  patch      06: adds a global bridge mcast vlan snooping knob (default
                 off)
  patches 07-08: add a helper for users which must derive the contexts
                 based on current bridge and vlan options (e.g. timers)
  patch      09: adds checks for disabled vlan contexts in packet
                 processing and timers
  patch      10: adds support for per-vlan querier and tagged queries
  patch      11: adds router port vlan id in the notifications
  patches 12-14: add global vlan options support (change, dump, notify)
  patch      15: adds per-vlan global mcast snooping control

Future patch-sets which build on this one (in order):
 - vlan state mcast handling
 - user-space mdb contexts (currently only the bridge contexts are used
   there)
 - all bridge multicast config options added per-vlan global and per
   vlan/port
 - iproute2 support for all the new uAPIs
 - selftests

This set has been stress-tested (deleting/adding ports/vlans while changing
vlan mcast snooping while processing IGMP/MLD packets), and also has
passed all bridge self-tests. I'm sending this set as early as possible
since there're a few more related sets that should go in the same
release to get proper and full mcast vlan snooping support.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:29 -07:00
Nikolay Aleksandrov
9dee572c38 net: bridge: vlan: add mcast snooping control
Add a new global vlan option which controls whether multicast snooping
is enabled or disabled for a single vlan. It controls the vlan private
flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
9aba624d7c net: bridge: vlan: notify when global options change
Add support for global options notifications. They use only RTM_NEWVLAN
since global options can only be set and are contained in a separate
vlan global options attribute. Notifications are compressed in ranges
where possible, i.e. the sequential vlan options are equal.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
743a53d963 net: bridge: vlan: add support for dumping global vlan options
Add a new vlan options dump flag which causes only global vlan options
to be dumped. The dumps are done only with bridge devices, ports are
ignored. They support vlan compression if the options in sequential
vlans are equal (currently always true).

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
47ecd2dbd8 net: bridge: vlan: add support for global options
We can have two types of vlan options depending on context:
 - per-device vlan options (split in per-bridge and per-port)
 - global vlan options

The second type wasn't supported in the bridge until now, but we need
them for per-vlan multicast support, per-vlan STP support and other
options which require global vlan context. They are contained in the global
bridge vlan context even if the vlan is not configured on the bridge device
itself. This patch adds initial netlink attributes and support for setting
these global vlan options, they can only be set (RTM_NEWVLAN) and the
operation must use the bridge device. Since there are no such options yet
it shouldn't have any functional effect.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
1e9ca45662 net: bridge: multicast: include router port vlan id in notifications
Use the port multicast context to check if the router port is a vlan and
in case it is include its vlan id in the notification.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
615cc23e62 net: bridge: multicast: add vlan querier and query support
Add basic vlan context querier support, if the contexts passed to
multicast_alloc_query are vlan then the query will be tagged. Also
handle querier start/stop of vlan contexts.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
4cdd0d10f3 net: bridge: multicast: check if should use vlan mcast ctx
Add helpers which check if the current bridge/port multicast context
should be used (i.e. they're not disabled) and use them for Rx IGMP/MLD
processing, timers and new group addition. It is important for vlans to
disable processing of timer/packet after the multicast_lock is obtained
if the vlan context doesn't have BR_VLFLAG_MCAST_ENABLED. There are two
cases when that flag is missing:
 - if the vlan is getting destroyed it will be removed and timers will
   be stopped
 - if the vlan mcast snooping is being disabled

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
eb1593a0b4 net: bridge: multicast: use the port group to port context helper
We need to use the new port group to port context helper in places where
we cannot pass down the proper context (i.e. functions that can be
called by timers or outside the packet snooping paths).

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
74edfd483d net: bridge: multicast: add helper to get port mcast context from port group
Add br_multicast_pg_to_port_ctx() which returns the proper port multicast
context from either port or vlan based on bridge option and vlan flags.
As the comment inside explains the locking is a bit tricky, we rely on
the fact that BR_VLFLAG_MCAST_ENABLED requires multicast_lock to change
and we also require it to be held to call that helper. If we find the
vlan under rcu and it still has the flag then we can be sure it will be
alive until we unlock multicast_lock which should be enough.
Note that the context might change from vlan to bridge between different
calls to this helper as the mcast vlan knob requires only rtnl so it should
be used carefully and for read-only/check purposes.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
f4b7002a70 net: bridge: add vlan mcast snooping knob
Add a global knob that controls if vlan multicast snooping is enabled.
The proper contexts (vlan or bridge-wide) will be chosen based on the knob
when processing packets and changing bridge device state. Note that
vlans have their individual mcast snooping enabled by default, but this
knob is needed to turn on bridge vlan snooping. It is disabled by
default. To enable the knob vlan filtering must also be enabled, it
doesn't make sense to have vlan mcast snooping without vlan filtering
since that would lead to inconsistencies. Disabling vlan filtering will
also automatically disable vlan mcast snooping.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
7b54aaaf53 net: bridge: multicast: add vlan state initialization and control
Add helpers to enable/disable vlan multicast based on its flags, we need
two flags because we need to know if the vlan has multicast enabled
globally (user-controlled) and if it has it enabled on the specific device
(bridge or port). The new private vlan flags are:
 - BR_VLFLAG_MCAST_ENABLED: locally enabled multicast on the device, used
   when removing a vlan, toggling vlan mcast snooping and controlling
   single vlan (kernel-controlled, valid under RTNL and multicast_lock)
 - BR_VLFLAG_GLOBAL_MCAST_ENABLED: globally enabled multicast for the
   vlan, used to control the bridge-wide vlan mcast snooping for a
   single vlan (user-controlled, can be checked under any context)

Bridge vlan contexts are created with multicast snooping enabled by
default to be in line with the current bridge snooping defaults. In
order to actually activate per vlan snooping and context usage a
bridge-wide knob will be added later which will default to disabled.
If that knob is enabled then automatically all vlan snooping will be
enabled. All vlan contexts are initialized with the current bridge
multicast context defaults.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov
613d61dbef net: bridge: vlan: add global and per-port multicast context
Add global and per-port vlan multicast context, only initialized but
still not used. No functional changes intended.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov
adc47037a7 net: bridge: multicast: use multicast contexts instead of bridge or port
Pass multicast context pointers to multicast functions instead of bridge/port.
This would make it easier later to switch these contexts to their per-vlan
versions. The patch is basically search and replace, no functional changes.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov
d3d065c003 net: bridge: multicast: factor out bridge multicast context
Factor out the bridge's global multicast context into a separate
structure which will later be used for per-vlan global context.
No functional changes intended.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov
9632233e7d net: bridge: multicast: factor out port multicast context
Factor out the port's multicast context into a separate structure which
will later be shared for per-port,vlan context. No functional changes
intended. We need the structure even if bridge multicast is not defined
to pass down as pointer to forwarding functions.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Alexandru Tachici
c45c1e82bb
spi: spi-bcm2835: Fix deadlock
The bcm2835_spi_transfer_one function can create a deadlock
if it is called while another thread already has the
CCF lock.

Signed-off-by: Alexandru Tachici <alexandru.tachici@analog.com>
Fixes: f8043872e7 ("spi: add driver for BCM2835")
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210716210245.13240-2-alexandru.tachici@analog.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2021-07-20 13:34:05 +01:00
Kurt Kanzenbach
edd2e9d586 Revert "igc: Export LEDs"
This reverts commit cf8331825a.

There are better Linux interfaces to export the different LED modes
and blinking reasons.

Revert this patch for now and come up with better solution later.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://lore.kernel.org/r/20210719101640.16047-1-kurt@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 13:53:59 +02:00
Jakub Kicinski
97d0931f67 Merge branch 'net-hns3-fixes-for-net'
Guangbin Huang says:

====================
net: hns3: fixes for -net

This series includes some bugfixes for the HNS3 ethernet driver.
====================

Link: https://lore.kernel.org/r/1626685988-25869-1-git-send-email-huangguangbin2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 13:12:50 +02:00
Jian Shen
bbfd4506f9 net: hns3: fix rx VLAN offload state inconsistent issue
Currently, VF doesn't enable rx VLAN offload when initializating,
and PF does it for VFs. If user disable the rx VLAN offload for
VF with ethtool -K, and reload the VF driver, it may cause the
rx VLAN offload state being inconsistent between hardware and
software.

Fixes it by enabling rx VLAN offload when VF initializing.

Fixes: e2cb1dec97 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 13:12:50 +02:00
Jian Shen
184cd221a8 net: hns3: disable port VLAN filter when support function level VLAN filter control
For hardware limitation, port VLAN filter is port level, and
effective for all the functions of the port. So if not support
port VLAN bypass, it's necessary to disable the port VLAN filter,
in order to support function level VLAN filter control.

Fixes: 2ba306627f ("net: hns3: add support for modify VLAN filter state")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 13:12:49 +02:00
Peng Li
4671042f1e net: hns3: add match_id to check mailbox response from PF to VF
When VF need response from PF, VF will wait (1us - 1s) to receive
the response, or it will wait timeout and the VF action fails.
If VF do not receive response in 1st action because timeout,
the 2nd action may receive response for the 1st action, and get
incorrect response data.VF must reciveve the right response from
PF,or it will cause unexpected error.

This patch adds match_id to check mailbox response from PF to VF,
to make sure VF get the right response:
1. The message sent from VF was labelled with match_id which was a
unique 16-bit non-zero value.
2. The response sent from PF will label with match_id which got from
the request.
3. The VF uses the match_id to match request and response message.

This scheme depends on PF driver supports match_id, if PF driver doesn't
support then VF will uses the original scheme.

Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 13:12:49 +02:00
Chengwen Feng
1b713d14dc net: hns3: fix possible mismatches resp of mailbox
Currently, the mailbox synchronous communication between VF and PF use
the following fields to maintain communication:
1. Origin_mbx_msg which was combined by message code and subcode, used
to match request and response.
2. Received_resp which means whether received response.

There may possible mismatches of the following situation:
1. VF sends message A with code=1 subcode=1.
2. PF was blocked about 500ms when processing the message A.
3. VF will detect message A timeout because it can't get the response
within 500ms.
4. VF sends message B with code=1 subcode=1 which equal message A.
5. PF processes the first message A and send the response message to
VF.
6. VF will identify the response matched the message B because the
code/subcode is the same. This will lead to mismatch of request and
response.

To fix the above bug, we use the following scheme:
1. The message sent from VF was labelled with match_id which was a
unique 16-bit non-zero value.
2. The response sent from PF will label with match_id which got from
the request.
3. The VF uses the match_id to match request and response message.

As for PF driver, it only needs to copy the match_id from request to
response.

Fixes: dde1a86e93 ("net: hns3: Add mailbox support to PF driver")
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 13:12:48 +02:00
Vladimir Oltean
cbb56b03ec net: bridge: do not replay fdb entries pointing towards the bridge twice
This simple script:

ip link add br0 type bridge
ip link set swp2 master br0
ip link set br0 address 00:01:02:03:04:05
ip link del br0

produces this result on a DSA switch:

[  421.306399] br0: port 1(swp2) entered blocking state
[  421.311445] br0: port 1(swp2) entered disabled state
[  421.472553] device swp2 entered promiscuous mode
[  421.488986] device swp2 left promiscuous mode
[  421.493508] br0: port 1(swp2) entered disabled state
[  421.886107] sja1105 spi0.1: port 1 failed to delete 00:01:02:03:04:05 vid 1 from fdb: -ENOENT
[  421.894374] sja1105 spi0.1: port 1 failed to delete 00:01:02:03:04:05 vid 0 from fdb: -ENOENT
[  421.943982] br0: port 1(swp2) entered blocking state
[  421.949030] br0: port 1(swp2) entered disabled state
[  422.112504] device swp2 entered promiscuous mode

A very simplified view of what happens is:

(1) the bridge port is created, and the bridge device inherits its MAC
    address

(2) when joining, the bridge port (DSA) requests a replay of the
    addition of all FDB entries towards this bridge port and towards the
    bridge device itself. In fact, DSA calls br_fdb_replay() twice:

	br_fdb_replay(br, brport_dev);
	br_fdb_replay(br, br);

    DSA uses reference counting for the FDB entries. So the MAC address
    of the bridge is simply kept with refcount 2. When the bridge port
    leaves under normal circumstances, everything cancels out since the
    replay of the FDB entry deletion is also done twice per VLAN.

(3) when the bridge MAC address changes, switchdev is notified of the
    deletion of the old address and of the insertion of the new one.
    But the old address does not really go away, since it had refcount
    2, and the new address is added "only" with refcount 1.

(4) when the bridge port leaves now, it will replay a deletion of the
    FDB entries pointing towards the bridge twice. Then DSA will
    complain that it can't delete something that no longer exists.

It is clear that the problem is that the FDB entries towards the bridge
are replayed too many times, so let's fix that problem.

Fixes: 63c51453c8 ("net: dsa: replay the local bridge FDB entries pointing to the bridge dev too")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20210719093916.4099032-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 13:08:45 +02:00
Landen Chao
6c2d125823 net: Update MAINTAINERS for MediaTek switch driver
Update maintainers for MediaTek switch driver with Deng Qingfang who has
contributed many high-quality patches (interrupt, VLAN, GPIO, and etc.)
and will help maintenance.

Signed-off-by: Landen Chao <landen.chao@mediatek.com>
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/49e1aa8aac58dcbf1b5e036d09b3fa3bbb1d94d0.1626751861.git.landen.chao@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 12:24:02 +02:00
Eric Dumazet
e93abb840a net/tcp_fastopen: remove tcp_fastopen_ctx_lock
Remove the (per netns) spinlock in favor of xchg() atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Link: https://lore.kernel.org/r/20210719101107.3203943-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 12:07:07 +02:00
Eric Dumazet
749468760b net/tcp_fastopen: remove obsolete extern
After cited commit, sysctl_tcp_fastopen_blackhole_timeout is no longer
a global variable.

Fixes: 3733be14a3 ("ipv4: Namespaceify tcp_fastopen_blackhole_timeout knob")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Link: https://lore.kernel.org/r/20210719092028.3016745-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 12:06:33 +02:00
Vasily Averin
2d85a1b31d ipv6: ip6_finish_output2: set sk into newly allocated nskb
skb_set_owner_w() should set sk not to old skb but to new nskb.

Fixes: 5796015fa9 ("ipv6: allocate enough headroom in ip6_finish_output2()")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Link: https://lore.kernel.org/r/70c0744f-89ae-1869-7e3e-4fa292158f4b@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 11:52:36 +02:00
Yajun Deng
fef773fc81 netlink: Deal with ESRCH error in nlmsg_notify()
Yonghong Song report:
The bpf selftest tc_bpf failed with latest bpf-next.
The following is the command to run and the result:
$ ./test_progs -n 132
[   40.947571] bpf_testmod: loading out-of-tree module taints kernel.
test_tc_bpf:PASS:test_tc_bpf__open_and_load 0 nsec
test_tc_bpf:PASS:bpf_tc_hook_create(BPF_TC_INGRESS) 0 nsec
test_tc_bpf:PASS:bpf_tc_hook_create invalid hook.attach_point 0 nsec
test_tc_bpf_basic:PASS:bpf_obj_get_info_by_fd 0 nsec
test_tc_bpf_basic:PASS:bpf_tc_attach 0 nsec
test_tc_bpf_basic:PASS:handle set 0 nsec
test_tc_bpf_basic:PASS:priority set 0 nsec
test_tc_bpf_basic:PASS:prog_id set 0 nsec
test_tc_bpf_basic:PASS:bpf_tc_attach replace mode 0 nsec
test_tc_bpf_basic:PASS:bpf_tc_query 0 nsec
test_tc_bpf_basic:PASS:handle set 0 nsec
test_tc_bpf_basic:PASS:priority set 0 nsec
test_tc_bpf_basic:PASS:prog_id set 0 nsec
libbpf: Kernel error message: Failed to send filter delete notification
test_tc_bpf_basic:FAIL:bpf_tc_detach unexpected error: -3 (errno 3)
test_tc_bpf:FAIL:test_tc_internal ingress unexpected error: -3 (errno 3)

The failure seems due to the commit
    cfdf0d9ae7 ("rtnetlink: use nlmsg_notify() in rtnetlink_send()")

Deal with ESRCH error in nlmsg_notify() even the report variable is zero.

Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Link: https://lore.kernel.org/r/20210719051816.11762-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20 11:45:09 +02:00
Eric Sandeen
8cae8cd89f seq_file: disallow extremely large seq buffer allocations
There is no reasonable need for a buffer larger than this, and it avoids
int overflow pitfalls.

Fixes: 058504edd0 ("fs/seq_file: fallback to vmalloc allocation")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-19 17:18:48 -07:00
Subbaraya Sundeep
23109f8dd0 octeontx2-af: Introduce internal packet switching
As of now any communication between CGXs PFs and
their VFs within the system is possible only by
external switches sending packets back to the
system. This patch adds internal switching support.
Broadcast packet replication is not covered here.
RVU admin function (AF) maintains MAC addresses
of all interfaces in the system. When switching is
enabled, MCAM entries are allocated to install rules
such that packets with DMAC matching any of the
internal interface MAC addresses is punted back
into the system via the loopback channel.
On the receive side the default unicast rules
are modified to not check for ingress channel.
So any packet with matching DMAC irrespective of
which interface it is coming from will be forwarded
to the respective PF/VF interface.
The transmit side rules and default unicast rules
are updated if user changes MAC address of an interface.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:24:25 -07:00
Subbaraya Sundeep
cb7a6b3bac octeontx2-af: Prepare for allocating MCAM rules for AF
AF till now only manages the allocation and freeing of
MCAM rules for other PF/VFs in system. To implement
L2 switching between all CGX mapped PF and VFs, AF
requires MCAM entries for DMAC rules for each PF and VF.
This patch modifies AF driver such that AF can also
allocate MCAM rules and install rules for other
PFs and VFs. All the checks like channel verification
for RX rules and PF_FUNC verification for TX rules are
relaxed in case AF is allocating or installing rules.
Also all the entry and counter to owner mappings are
set to NPC_MCAM_INVALID_MAP when they are free indicating
those are not allocated to AF nor PF/VFs.
This patch also ensures that AF allocated and installed
entries are displayed in debugfs.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:24:25 -07:00
Subbaraya Sundeep
fa2bf6baf2 octeontx2-af: Enable transmit side LBK link
For enabling VF-VF switching the packets egressing
out of CGX mapped VFs needed to be sent to LBK
so that same packets are received back to the system.
But the LBK link also needs to be enabled in addition
to a VF's mapped CGX_LMAC link otherwise hardware
raises send error interrupt indicating selected LBK
link is not enabled in NIX_AF_TL3_TL2X_LINKX_CFG register.
Hence this patch enables all LBK links in
TL3_TL2_LINKX_CFG registers.
Also to enable packet flow between PFs/VFs of NIX0
to PFs/VFs of NIX1(in 98xx silicon) the NPC TX DMAC
rules has to be installed such that rules must be hit
for any TX interface i.e., NIX0-TX or NIX1-TX provided
DMAC match creteria is met. Hence this patch changes the
behavior such that MCAM is programmed to match with any
NIX0/1-TX interface for TX rules.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:24:24 -07:00
Eric Dumazet
6f20c8adb1 net/tcp_fastopen: fix data races around tfo_active_disable_stamp
tfo_active_disable_stamp is read and written locklessly.
We need to annotate these accesses appropriately.

Then, we need to perform the atomic_inc(tfo_active_disable_times)
after the timestamp has been updated, and thus add barriers
to make sure tcp_fastopen_active_should_disable() wont read
a stale timestamp.

Fixes: cf1ef3f071 ("net/tcp_fastopen: Disable active side TFO in certain scenarios")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:11:51 -07:00
David S. Miller
a0050653db Merge branch 'dt-bindinga-dwmac'
Joakim Zhang says:

====================
dt-bindings: net: dwmac-imx: convert

This patch set intends to convert imx dwmac binding to schema, and fixes
found by dt_binding_check and dtbs_check.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:08:06 -07:00
Joakim Zhang
77e5253dea arm64: dts: imx8mp: change interrupt order per dt-binding
This patch changs interrupt order which found by dtbs_check.

$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
arch/arm64/boot/dts/freescale/imx8mp-evk.dt.yaml: ethernet@30bf0000: interrupt-names:0: 'macirq' was expected
arch/arm64/boot/dts/freescale/imx8mp-evk.dt.yaml: ethernet@30bf0000: interrupt-names:1: 'eth_wake_irq' was expected

According to Documentation/devicetree/bindings/net/snps,dwmac.yaml, we
should list interrupt in it's order.

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:07:01 -07:00
Joakim Zhang
e314a07ef2 dt-bindings: net: imx-dwmac: convert imx-dwmac bindings to yaml
In order to automate the verification of DT nodes covert imx-dwmac to
nxp,dwmac-imx.yaml, and pass below checking.

$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dt_binding_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:07:01 -07:00
Joakim Zhang
bdad810eb9 dt-bindings: net: snps,dwmac: add missing DWMAC IP version
Add missing DWMAC IP version in snps,dwmac.yaml which found by below
command, as NXP i.MX8 families support SNPS DWMAC 5.10a IP.

$ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- dt_binding_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
Documentation/devicetree/bindings/net/nxp,dwmac-imx.example.dt.yaml:
ethernet@30bf0000: compatible: None of ['nxp,imx8mp-dwmac-eqos', 'snps,dwmac-5.10a'] are valid under the given schema

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:07:01 -07:00
Randy Dunlap
b16f3299ae net: hisilicon: rename CACHE_LINE_MASK to avoid redefinition
Building on ARCH=arc causes a "redefined" warning, so rename this
driver's CACHE_LINE_MASK to avoid the warning.

../drivers/net/ethernet/hisilicon/hip04_eth.c:134: warning: "CACHE_LINE_MASK" redefined
  134 | #define CACHE_LINE_MASK   0x3F
In file included from ../include/linux/cache.h:6,
                 from ../include/linux/printk.h:9,
                 from ../include/linux/kernel.h:19,
                 from ../include/linux/list.h:9,
                 from ../include/linux/module.h:12,
                 from ../drivers/net/ethernet/hisilicon/hip04_eth.c:7:
../arch/arc/include/asm/cache.h:17: note: this is the location of the previous definition
   17 | #define CACHE_LINE_MASK  (~(L1_CACHE_BYTES - 1))

Fixes: d413779cdd ("net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 10:02:12 -07:00
Stefan Assmann
226d528512 iavf: fix locking of critical sections
To avoid races between iavf_init_task(), iavf_reset_task(),
iavf_watchdog_task(), iavf_adminq_task() as well as the shutdown and
remove functions more locking is required.
The current protection by __IAVF_IN_CRITICAL_TASK is needed in
additional places.

- The reset task performs state transitions, therefore needs locking.
- The adminq task acts on replies from the PF in
  iavf_virtchnl_completion() which may alter the states.
- The init task is not only run during probe but also if a VF gets stuck
  to reinitialize it.
- The shutdown function performs a state transition.
- The remove function performs a state transition and also free's
  resources.

iavf_lock_timeout() is introduced to avoid waiting infinitely
and cause a deadlock. Rather unlock and print a warning.

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-19 09:18:30 -07:00
Stefan Assmann
22c8fd71d3 iavf: do not override the adapter state in the watchdog task
The iavf watchdog task overrides adapter->state to __IAVF_RESETTING
when it detects a pending reset. Then schedules iavf_reset_task() which
takes care of the reset.

The reset task is capable of handling the reset without changing
adapter->state. In fact we lose the state information when the watchdog
task prematurely changes the adapter state. This may lead to a crash if
instead of the reset task the iavf_remove() function gets called before
the reset task.
In that case (if we were in state __IAVF_RUNNING previously) the
iavf_remove() function triggers iavf_close() which fails to close the
device because of the incorrect state information.

This may result in a crash due to pending interrupts.
kernel BUG at drivers/pci/msi.c:357!
[...]
Call Trace:
 [<ffffffffbddf24dd>] pci_disable_msix+0x3d/0x50
 [<ffffffffc08d2a63>] iavf_reset_interrupt_capability+0x23/0x40 [iavf]
 [<ffffffffc08d312a>] iavf_remove+0x10a/0x350 [iavf]
 [<ffffffffbddd3359>] pci_device_remove+0x39/0xc0
 [<ffffffffbdeb492f>] __device_release_driver+0x7f/0xf0
 [<ffffffffbdeb49c3>] device_release_driver+0x23/0x30
 [<ffffffffbddcabb4>] pci_stop_bus_device+0x84/0xa0
 [<ffffffffbddcacc2>] pci_stop_and_remove_bus_device+0x12/0x20
 [<ffffffffbddf361f>] pci_iov_remove_virtfn+0xaf/0x160
 [<ffffffffbddf3bcc>] sriov_disable+0x3c/0xf0
 [<ffffffffbddf3ca3>] pci_disable_sriov+0x23/0x30
 [<ffffffffc0667365>] i40e_free_vfs+0x265/0x2d0 [i40e]
 [<ffffffffc0667624>] i40e_pci_sriov_configure+0x144/0x1f0 [i40e]
 [<ffffffffbddd5307>] sriov_numvfs_store+0x177/0x1d0
Code: 00 00 e8 3c 25 e3 ff 49 c7 86 88 08 00 00 00 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 8b 7b 28 e8 0d 44
RIP  [<ffffffffbbbf1068>] free_msi_irqs+0x188/0x190

The solution is to not touch the adapter->state in iavf_watchdog_task()
and let the reset task handle the state transition.

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-19 09:18:30 -07:00
Stefan Assmann
8b4b06919f i40e: improve locking of mac_filter_hash
i40e_config_vf_promiscuous_mode() calls
i40e_getnum_vf_vsi_vlan_filters() without acquiring the
mac_filter_hash_lock spinlock.

This is unsafe because mac_filter_hash may get altered in another thread
while i40e_getnum_vf_vsi_vlan_filters() traverses the hashes.

Simply adding the spinlock in i40e_getnum_vf_vsi_vlan_filters() is not
possible as it already gets called in i40e_get_vlan_list_sync() with the
spinlock held. Therefore adding a wrapper that acquires the spinlock and
call the correct function where appropriate.

Fixes: 37d318d780 ("i40e: Remove scheduling while atomic possibility")
Fix-suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-19 09:18:30 -07:00
David S. Miller
1dd271d9e5 Merge branch 'bnxt_en-fixes'
Michael Chan says:

====================
bnxt_en: Bug fixes

Most of the fixes in this series have to do with error recovery.  They
include error path handling when the error recovery has to abort, and
the rediscovery of capabilities (PTP and RoCE) after firmware reset
that may result in capability changes.

Two other fixes are to reject invalid ETS settings and to validate
VLAN protocol in the RX path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 08:50:35 -07:00