2019-05-27 14:55:01 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
2016-06-05 03:17:07 +08:00
|
|
|
/*
|
|
|
|
* net/dsa/dsa2.c - Hardware switch handling, binding version 2
|
|
|
|
* Copyright (c) 2008-2009 Marvell Semiconductor
|
|
|
|
* Copyright (c) 2013 Florian Fainelli <florian@openwrt.org>
|
|
|
|
* Copyright (c) 2016 Andrew Lunn <andrew@lunn.ch>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/device.h>
|
|
|
|
#include <linux/err.h>
|
|
|
|
#include <linux/list.h>
|
2017-03-29 05:45:06 +08:00
|
|
|
#include <linux/netdevice.h>
|
2016-06-05 03:17:07 +08:00
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/rtnetlink.h>
|
|
|
|
#include <linux/of.h>
|
|
|
|
#include <linux/of_net.h>
|
2019-03-24 18:14:26 +08:00
|
|
|
#include <net/devlink.h>
|
net: dsa: replay master state events in dsa_tree_{setup,teardown}_master
In order for switch driver to be able to make simple and reliable use of
the master tracking operations, they must also be notified of the
initial state of the DSA master, not just of the changes. This is
because they might enable certain features only during the time when
they know that the DSA master is up and running.
Therefore, this change explicitly checks the state of the DSA master
under the same rtnl_mutex as we were holding during the
dsa_master_setup() and dsa_master_teardown() call. The idea being that
if the DSA master became operational in between the moment in which it
became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment
when we checked for the master being up, there is a chance that we
would emit a ->master_state_change() call with no actual state change.
We need to avoid that by serializing the concurrent netdevice event with
us. If the netdevice event started before, we force it to finish before
we begin, because we take rtnl_lock before making netdev_uses_dsa()
return true. So we also handle that early event and do nothing on it.
Similarly, if the dev_open() attempt is concurrent with us, it will
attempt to take the rtnl_mutex, but we're holding it. We'll see that
the master flag IFF_UP isn't set, then when we release the rtnl_mutex
we'll process the NETDEV_UP notifier.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02 08:03:21 +08:00
|
|
|
#include <net/sch_generic.h>
|
2017-05-18 03:46:03 +08:00
|
|
|
|
2016-06-05 03:17:07 +08:00
|
|
|
#include "dsa_priv.h"
|
|
|
|
|
|
|
|
static DEFINE_MUTEX(dsa2_mutex);
|
net: dsa: implement auto-normalization of MTU for bridge hardware datapath
Many switches don't have an explicit knob for configuring the MTU
(maximum transmission unit per interface). Instead, they do the
length-based packet admission checks on the ingress interface, for
reasons that are easy to understand (why would you accept a packet in
the queuing subsystem if you know you're going to drop it anyway).
So it is actually the MRU that these switches permit configuring.
In Linux there only exists the IFLA_MTU netlink attribute and the
associated dev_set_mtu function. The comments like to play blind and say
that it's changing the "maximum transfer unit", which is to say that
there isn't any directionality in the meaning of the MTU word. So that
is the interpretation that this patch is giving to things: MTU == MRU.
When 2 interfaces having different MTUs are bridged, the bridge driver
MTU auto-adjustment logic kicks in: what br_mtu_auto_adjust() does is it
adjusts the MTU of the bridge net device itself (and not that of the
slave net devices) to the minimum value of all slave interfaces, in
order for forwarded packets to not exceed the MTU regardless of the
interface they are received and send on.
The idea behind this behavior, and why the slave MTUs are not adjusted,
is that normal termination from Linux over the L2 forwarding domain
should happen over the bridge net device, which _is_ properly limited by
the minimum MTU. And termination over individual slave devices is
possible even if those are bridged. But that is not "forwarding", so
there's no reason to do normalization there, since only a single
interface sees that packet.
The problem with those switches that can only control the MRU is with
the offloaded data path, where a packet received on an interface with
MRU 9000 would still be forwarded to an interface with MRU 1500. And the
br_mtu_auto_adjust() function does not really help, since the MTU
configured on the bridge net device is ignored.
In order to enforce the de-facto MTU == MRU rule for these switches, we
need to do MTU normalization, which means: in order for no packet larger
than the MTU configured on this port to be sent, then we need to limit
the MRU on all ports that this packet could possibly come from. AKA
since we are configuring the MRU via MTU, it means that all ports within
a bridge forwarding domain should have the same MTU.
And that is exactly what this patch is trying to do.
>From an implementation perspective, we try to follow the intent of the
user, otherwise there is a risk that we might livelock them (they try to
change the MTU on an already-bridged interface, but we just keep
changing it back in an attempt to keep the MTU normalized). So the MTU
that the bridge is normalized to is either:
- The most recently changed one:
ip link set dev swp0 master br0
ip link set dev swp1 master br0
ip link set dev swp0 mtu 1400
This sequence will make swp1 inherit MTU 1400 from swp0.
- The one of the most recently added interface to the bridge:
ip link set dev swp0 master br0
ip link set dev swp1 mtu 1400
ip link set dev swp1 master br0
The above sequence will make swp0 inherit MTU 1400 as well.
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-28 03:55:43 +08:00
|
|
|
LIST_HEAD(dsa_tree_list);
|
2016-06-05 03:17:07 +08:00
|
|
|
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
/* Track the bridges with forwarding offload enabled */
|
|
|
|
static unsigned long dsa_fwd_offloading_bridges;
|
|
|
|
|
2021-01-29 09:00:04 +08:00
|
|
|
/**
|
|
|
|
* dsa_tree_notify - Execute code for all switches in a DSA switch tree.
|
|
|
|
* @dst: collection of struct dsa_switch devices to notify.
|
|
|
|
* @e: event, must be of type DSA_NOTIFIER_*
|
|
|
|
* @v: event-specific value.
|
|
|
|
*
|
|
|
|
* Given a struct dsa_switch_tree, this can be used to run a function once for
|
|
|
|
* each member DSA switch. The other alternative of traversing the tree is only
|
|
|
|
* through its ports list, which does not uniquely list the switches.
|
|
|
|
*/
|
|
|
|
int dsa_tree_notify(struct dsa_switch_tree *dst, unsigned long e, void *v)
|
|
|
|
{
|
|
|
|
struct raw_notifier_head *nh = &dst->nh;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = raw_notifier_call_chain(nh, e, v);
|
|
|
|
|
|
|
|
return notifier_to_errno(err);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* dsa_broadcast - Notify all DSA trees in the system.
|
|
|
|
* @e: event, must be of type DSA_NOTIFIER_*
|
|
|
|
* @v: event-specific value.
|
|
|
|
*
|
|
|
|
* Can be used to notify the switching fabric of events such as cross-chip
|
|
|
|
* bridging between disjoint trees (such as islands of tagger-compatible
|
|
|
|
* switches bridged by an incompatible middle switch).
|
net: dsa: tag_8021q: don't broadcast during setup/teardown
Currently, on my board with multiple sja1105 switches in disjoint trees
described in commit f66a6a69f97a ("net: dsa: permit cross-chip bridging
between all trees in the system"), rebooting the board triggers the
following benign warnings:
[ 12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
[ 12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
[ 12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
[ 12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
[ 12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
[ 12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT
Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and
RX VLANs cannot be found on switch 2's CPU port.
But why would switch 2 even attempt to delete switch 1's TX and RX
tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast,
and it is supposed that it had added those VLANs in the first place
(because in dsa_port_tag_8021q_vlan_match, all CPU ports match
regardless of their tree index or switch index).
The two trees probe asynchronously, and when switch 1 probed, it called
dsa_broadcast which did not notify the tree of switch 2, because that
didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it
_is_ notified of the deletion.
Before jumping to introduce a synchronization mechanism between the
probing across disjoint switch trees, let's take a step back and see
whether we _need_ to do that in the first place.
The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port
only if switch 1 and 2 were part of a cross-chip bridge. And
dsa_tag_8021q_bridge_join takes care precisely of that (but if probing
was synchronous, the bridge_join would just end up bumping the VLANs'
refcount, because they are already installed by the setup path).
Since by the time the ports are bridged, all DSA trees are already set
up, and we don't need the tag_8021q VLANs of one switch installed on the
other switches during probe time, the answer is that we don't need to
fix the synchronization issue.
So make the setup and teardown code paths call dsa_port_notify, which
notifies only the local tree, and the bridge code paths call
dsa_broadcast, which let the other trees know as well.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 21:46:06 +08:00
|
|
|
*
|
|
|
|
* WARNING: this function is not reliable during probe time, because probing
|
|
|
|
* between trees is asynchronous and not all DSA trees might have probed.
|
2021-01-29 09:00:04 +08:00
|
|
|
*/
|
|
|
|
int dsa_broadcast(unsigned long e, void *v)
|
|
|
|
{
|
|
|
|
struct dsa_switch_tree *dst;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
list_for_each_entry(dst, &dsa_tree_list, list) {
|
|
|
|
err = dsa_tree_notify(dst, e, v);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2021-01-13 16:42:53 +08:00
|
|
|
/**
|
|
|
|
* dsa_lag_map() - Map LAG netdev to a linear LAG ID
|
|
|
|
* @dst: Tree in which to record the mapping.
|
|
|
|
* @lag: Netdev that is to be mapped to an ID.
|
|
|
|
*
|
|
|
|
* dsa_lag_id/dsa_lag_dev can then be used to translate between the
|
|
|
|
* two spaces. The size of the mapping space is determined by the
|
|
|
|
* driver by setting ds->num_lag_ids. It is perfectly legal to leave
|
|
|
|
* it unset if it is not needed, in which case these functions become
|
|
|
|
* no-ops.
|
|
|
|
*/
|
|
|
|
void dsa_lag_map(struct dsa_switch_tree *dst, struct net_device *lag)
|
|
|
|
{
|
|
|
|
unsigned int id;
|
|
|
|
|
|
|
|
if (dsa_lag_id(dst, lag) >= 0)
|
|
|
|
/* Already mapped */
|
|
|
|
return;
|
|
|
|
|
|
|
|
for (id = 0; id < dst->lags_len; id++) {
|
|
|
|
if (!dsa_lag_dev(dst, id)) {
|
|
|
|
dst->lags[id] = lag;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* No IDs left, which is OK. Some drivers do not need it. The
|
|
|
|
* ones that do, e.g. mv88e6xxx, will discover that dsa_lag_id
|
|
|
|
* returns an error for this device when joining the LAG. The
|
|
|
|
* driver can then return -EOPNOTSUPP back to DSA, which will
|
|
|
|
* fall back to a software LAG.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* dsa_lag_unmap() - Remove a LAG ID mapping
|
|
|
|
* @dst: Tree in which the mapping is recorded.
|
|
|
|
* @lag: Netdev that was mapped.
|
|
|
|
*
|
|
|
|
* As there may be multiple users of the mapping, it is only removed
|
|
|
|
* if there are no other references to it.
|
|
|
|
*/
|
|
|
|
void dsa_lag_unmap(struct dsa_switch_tree *dst, struct net_device *lag)
|
|
|
|
{
|
|
|
|
struct dsa_port *dp;
|
|
|
|
unsigned int id;
|
|
|
|
|
|
|
|
dsa_lag_foreach_port(dp, dst, lag)
|
|
|
|
/* There are remaining users of this mapping */
|
|
|
|
return;
|
|
|
|
|
|
|
|
dsa_lags_foreach_id(id, dst) {
|
|
|
|
if (dsa_lag_dev(dst, id) == lag) {
|
|
|
|
dst->lags[id] = NULL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 00:57:56 +08:00
|
|
|
struct dsa_bridge *dsa_tree_bridge_find(struct dsa_switch_tree *dst,
|
|
|
|
const struct net_device *br)
|
|
|
|
{
|
|
|
|
struct dsa_port *dp;
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list)
|
|
|
|
if (dsa_port_bridge_dev_get(dp) == br)
|
|
|
|
return dp->bridge;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
static int dsa_bridge_num_find(const struct net_device *bridge_dev)
|
|
|
|
{
|
|
|
|
struct dsa_switch_tree *dst;
|
|
|
|
|
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 00:57:56 +08:00
|
|
|
list_for_each_entry(dst, &dsa_tree_list, list) {
|
|
|
|
struct dsa_bridge *bridge;
|
|
|
|
|
|
|
|
bridge = dsa_tree_bridge_find(dst, bridge_dev);
|
|
|
|
if (bridge)
|
|
|
|
return bridge->num;
|
|
|
|
}
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
|
2021-12-07 00:57:47 +08:00
|
|
|
return 0;
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
}
|
|
|
|
|
2021-12-07 00:57:47 +08:00
|
|
|
unsigned int dsa_bridge_num_get(const struct net_device *bridge_dev, int max)
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
{
|
2021-12-07 00:57:47 +08:00
|
|
|
unsigned int bridge_num = dsa_bridge_num_find(bridge_dev);
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
|
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 00:57:56 +08:00
|
|
|
/* Switches without FDB isolation support don't get unique
|
|
|
|
* bridge numbering
|
|
|
|
*/
|
|
|
|
if (!max)
|
|
|
|
return 0;
|
|
|
|
|
2021-12-07 00:57:47 +08:00
|
|
|
if (!bridge_num) {
|
2021-12-07 00:57:48 +08:00
|
|
|
/* First port that requests FDB isolation or TX forwarding
|
|
|
|
* offload for this bridge
|
|
|
|
*/
|
2021-12-07 00:57:47 +08:00
|
|
|
bridge_num = find_next_zero_bit(&dsa_fwd_offloading_bridges,
|
|
|
|
DSA_MAX_NUM_OFFLOADING_BRIDGES,
|
|
|
|
1);
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
if (bridge_num >= max)
|
2021-12-07 00:57:47 +08:00
|
|
|
return 0;
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
|
|
|
|
set_bit(bridge_num, &dsa_fwd_offloading_bridges);
|
|
|
|
}
|
|
|
|
|
|
|
|
return bridge_num;
|
|
|
|
}
|
|
|
|
|
2021-12-07 00:57:47 +08:00
|
|
|
void dsa_bridge_num_put(const struct net_device *bridge_dev,
|
|
|
|
unsigned int bridge_num)
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
{
|
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 00:57:56 +08:00
|
|
|
/* Since we refcount bridges, we know that when we call this function
|
|
|
|
* it is no longer in use, so we can just go ahead and remove it from
|
|
|
|
* the bit mask.
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
*/
|
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 00:57:56 +08:00
|
|
|
clear_bit(bridge_num, &dsa_fwd_offloading_bridges);
|
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 01:55:00 +08:00
|
|
|
}
|
|
|
|
|
2020-05-11 00:37:42 +08:00
|
|
|
struct dsa_switch *dsa_switch_find(int tree_index, int sw_index)
|
|
|
|
{
|
|
|
|
struct dsa_switch_tree *dst;
|
|
|
|
struct dsa_port *dp;
|
|
|
|
|
|
|
|
list_for_each_entry(dst, &dsa_tree_list, list) {
|
|
|
|
if (dst->index != tree_index)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
|
|
|
if (dp->ds->index != sw_index)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return dp->ds;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(dsa_switch_find);
|
|
|
|
|
2017-11-04 07:05:24 +08:00
|
|
|
static struct dsa_switch_tree *dsa_tree_find(int index)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
|
|
|
struct dsa_switch_tree *dst;
|
|
|
|
|
2017-11-04 07:05:24 +08:00
|
|
|
list_for_each_entry(dst, &dsa_tree_list, list)
|
2017-11-04 07:05:22 +08:00
|
|
|
if (dst->index == index)
|
2016-06-05 03:17:07 +08:00
|
|
|
return dst;
|
2017-11-04 07:05:22 +08:00
|
|
|
|
2016-06-05 03:17:07 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:24 +08:00
|
|
|
static struct dsa_switch_tree *dsa_tree_alloc(int index)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
|
|
|
struct dsa_switch_tree *dst;
|
|
|
|
|
|
|
|
dst = kzalloc(sizeof(*dst), GFP_KERNEL);
|
|
|
|
if (!dst)
|
|
|
|
return NULL;
|
2017-11-04 07:05:24 +08:00
|
|
|
|
2017-11-04 07:05:21 +08:00
|
|
|
dst->index = index;
|
2017-11-04 07:05:24 +08:00
|
|
|
|
2019-10-31 10:09:13 +08:00
|
|
|
INIT_LIST_HEAD(&dst->rtable);
|
|
|
|
|
2019-10-22 04:51:16 +08:00
|
|
|
INIT_LIST_HEAD(&dst->ports);
|
|
|
|
|
2016-06-05 03:17:07 +08:00
|
|
|
INIT_LIST_HEAD(&dst->list);
|
2019-10-19 05:02:46 +08:00
|
|
|
list_add_tail(&dst->list, &dsa_tree_list);
|
2017-11-04 07:05:22 +08:00
|
|
|
|
2016-06-05 03:17:07 +08:00
|
|
|
kref_init(&dst->refcount);
|
|
|
|
|
|
|
|
return dst;
|
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:23 +08:00
|
|
|
static void dsa_tree_free(struct dsa_switch_tree *dst)
|
|
|
|
{
|
net: dsa: make tagging protocols connect to individual switches from a tree
On the NXP Bluebox 3 board which uses a multi-switch setup with sja1105,
the mechanism through which the tagger connects to the switch tree is
broken, due to improper DSA code design. At the time when tag_ops->connect()
is called in dsa_port_parse_cpu(), DSA hasn't finished "touching" all
the ports, so it doesn't know how large the tree is and how many ports
it has. It has just seen the first CPU port by this time. As a result,
this function will call the tagger's ->connect method too early, and the
tagger will connect only to the first switch from the tree.
This could be perhaps addressed a bit more simply by just moving the
tag_ops->connect(dst) call a bit later (for example in dsa_tree_setup),
but there is already a design inconsistency at present: on the switch
side, the notification is on a per-switch basis, but on the tagger side,
it is on a per-tree basis. Furthermore, the persistent storage itself is
per switch (ds->tagger_data). And the tagger connect and disconnect
procedures (at least the ones that exist currently) could see a fair bit
of simplification if they didn't have to iterate through the switches of
a tree.
To fix the issue, this change transforms tag_ops->connect(dst) into
tag_ops->connect(ds) and moves it somewhere where we already iterate
over all switches of a tree. That is in dsa_switch_setup_tag_protocol(),
which is a good placement because we already have there the connection
call to the switch side of things.
As for the dsa_tree_bind_tag_proto() method (called from the code path
that changes the tag protocol), things are a bit more complicated
because we receive the tree as argument, yet when we unwind on errors,
it would be nice to not call tag_ops->disconnect(ds) where we didn't
previously call tag_ops->connect(ds). We didn't have this problem before
because the tag_ops connection operations passed the entire dst before,
and this is more fine grained now. To solve the error rewind case using
the new API, we have to create yet one more cross-chip notifier for
disconnection, and stay connected with the old tag protocol to all the
switches in the tree until we've succeeded to connect with the new one
as well. So if something fails half way, the whole tree is still
connected to the old tagger. But there may still be leaks if the tagger
fails to connect to the 2nd out of 3 switches in a tree: somebody needs
to tell the tagger to disconnect from the first switch. Nothing comes
for free, and this was previously handled privately by the tagging
protocol driver before, but now we need to emit a disconnect cross-chip
notifier for that, because DSA has to take care of the unwind path. We
assume that the tagging protocol has connected to a switch if it has set
ds->tagger_data to something, otherwise we avoid calling its
disconnection method in the error rewind path.
The rest of the changes are in the tagging protocol drivers, and have to
do with the replacement of dst with ds. The iteration is removed and the
error unwind path is simplified, as mentioned above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-14 09:45:36 +08:00
|
|
|
if (dst->tag_ops)
|
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:05 +08:00
|
|
|
dsa_tag_driver_put(dst->tag_ops);
|
2017-11-04 07:05:23 +08:00
|
|
|
list_del(&dst->list);
|
|
|
|
kfree(dst);
|
|
|
|
}
|
|
|
|
|
2017-11-25 00:36:06 +08:00
|
|
|
static struct dsa_switch_tree *dsa_tree_get(struct dsa_switch_tree *dst)
|
2017-11-04 07:05:24 +08:00
|
|
|
{
|
2017-11-25 00:36:06 +08:00
|
|
|
if (dst)
|
|
|
|
kref_get(&dst->refcount);
|
2017-11-04 07:05:24 +08:00
|
|
|
|
|
|
|
return dst;
|
|
|
|
}
|
|
|
|
|
2017-11-25 00:36:06 +08:00
|
|
|
static struct dsa_switch_tree *dsa_tree_touch(int index)
|
2017-11-04 07:05:23 +08:00
|
|
|
{
|
2017-11-25 00:36:06 +08:00
|
|
|
struct dsa_switch_tree *dst;
|
|
|
|
|
|
|
|
dst = dsa_tree_find(index);
|
|
|
|
if (dst)
|
|
|
|
return dsa_tree_get(dst);
|
|
|
|
else
|
|
|
|
return dsa_tree_alloc(index);
|
2017-11-04 07:05:23 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void dsa_tree_release(struct kref *ref)
|
|
|
|
{
|
|
|
|
struct dsa_switch_tree *dst;
|
|
|
|
|
|
|
|
dst = container_of(ref, struct dsa_switch_tree, refcount);
|
|
|
|
|
|
|
|
dsa_tree_free(dst);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void dsa_tree_put(struct dsa_switch_tree *dst)
|
|
|
|
{
|
2017-11-25 00:36:06 +08:00
|
|
|
if (dst)
|
|
|
|
kref_put(&dst->refcount, dsa_tree_release);
|
2017-11-04 07:05:23 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:49 +08:00
|
|
|
static struct dsa_port *dsa_tree_find_port_by_node(struct dsa_switch_tree *dst,
|
|
|
|
struct device_node *dn)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2017-11-07 05:11:49 +08:00
|
|
|
struct dsa_port *dp;
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2019-10-22 04:51:21 +08:00
|
|
|
list_for_each_entry(dp, &dst->ports, list)
|
|
|
|
if (dp->dn == dn)
|
|
|
|
return dp;
|
2016-06-05 03:17:07 +08:00
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2019-12-17 19:20:38 +08:00
|
|
|
static struct dsa_link *dsa_link_touch(struct dsa_port *dp,
|
|
|
|
struct dsa_port *link_dp)
|
2019-10-31 10:09:13 +08:00
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dp->ds;
|
|
|
|
struct dsa_switch_tree *dst;
|
|
|
|
struct dsa_link *dl;
|
|
|
|
|
|
|
|
dst = ds->dst;
|
|
|
|
|
|
|
|
list_for_each_entry(dl, &dst->rtable, list)
|
|
|
|
if (dl->dp == dp && dl->link_dp == link_dp)
|
|
|
|
return dl;
|
|
|
|
|
|
|
|
dl = kzalloc(sizeof(*dl), GFP_KERNEL);
|
|
|
|
if (!dl)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
dl->dp = dp;
|
|
|
|
dl->link_dp = link_dp;
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&dl->list);
|
|
|
|
list_add_tail(&dl->list, &dst->rtable);
|
|
|
|
|
|
|
|
return dl;
|
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:51 +08:00
|
|
|
static bool dsa_port_setup_routing_table(struct dsa_port *dp)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2017-11-07 05:11:51 +08:00
|
|
|
struct dsa_switch *ds = dp->ds;
|
|
|
|
struct dsa_switch_tree *dst = ds->dst;
|
|
|
|
struct device_node *dn = dp->dn;
|
2017-11-07 05:11:50 +08:00
|
|
|
struct of_phandle_iterator it;
|
2017-11-07 05:11:49 +08:00
|
|
|
struct dsa_port *link_dp;
|
2019-10-31 10:09:13 +08:00
|
|
|
struct dsa_link *dl;
|
2017-11-07 05:11:50 +08:00
|
|
|
int err;
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2017-11-07 05:11:50 +08:00
|
|
|
of_for_each_phandle(&it, err, dn, "link", NULL, 0) {
|
|
|
|
link_dp = dsa_tree_find_port_by_node(dst, it.node);
|
|
|
|
if (!link_dp) {
|
|
|
|
of_node_put(it.node);
|
2017-11-07 05:11:51 +08:00
|
|
|
return false;
|
2017-11-07 05:11:50 +08:00
|
|
|
}
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2019-10-31 10:09:13 +08:00
|
|
|
dl = dsa_link_touch(dp, link_dp);
|
|
|
|
if (!dl) {
|
|
|
|
of_node_put(it.node);
|
|
|
|
return false;
|
|
|
|
}
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:51 +08:00
|
|
|
return true;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2019-10-31 10:09:15 +08:00
|
|
|
static bool dsa_tree_setup_routing_table(struct dsa_switch_tree *dst)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2017-11-07 05:11:51 +08:00
|
|
|
bool complete = true;
|
|
|
|
struct dsa_port *dp;
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2019-10-22 04:51:20 +08:00
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
2019-10-31 10:09:15 +08:00
|
|
|
if (dsa_port_is_dsa(dp)) {
|
2017-11-07 05:11:51 +08:00
|
|
|
complete = dsa_port_setup_routing_table(dp);
|
|
|
|
if (!complete)
|
|
|
|
break;
|
|
|
|
}
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:51 +08:00
|
|
|
return complete;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:44 +08:00
|
|
|
static struct dsa_port *dsa_tree_find_first_cpu(struct dsa_switch_tree *dst)
|
|
|
|
{
|
|
|
|
struct dsa_port *dp;
|
|
|
|
|
2019-10-22 04:51:23 +08:00
|
|
|
list_for_each_entry(dp, &dst->ports, list)
|
|
|
|
if (dsa_port_is_cpu(dp))
|
|
|
|
return dp;
|
2017-11-07 05:11:44 +08:00
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
net: dsa: give preference to local CPU ports
Be there an "H" switch topology, where there are 2 switches connected as
follows:
eth0 eth1
| |
CPU port CPU port
| DSA link |
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0
| | | | | |
user user user user user user
port port port port port port
basically one where each switch has its own CPU port for termination,
but there is also a DSA link in case packets need to be forwarded in
hardware between one switch and another.
DSA insists to see this as a daisy chain topology, basically registering
all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
eth1 as a valid DSA master.
This is only half the story, since when asked using dsa_port_is_cpu(),
DSA will respond that sw1p1 is a CPU port, however one which has no
dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
Furthermore, be there a driver for switches which support only one
upstream port. This driver iterates through its ports and checks using
dsa_is_upstream_port() whether the current port is an upstream one.
For switch 1, two ports pass the "is upstream port" checks:
- sw1p4 is an upstream port because it is a routing port towards the
dedicated CPU port assigned using dsa_tree_setup_default_cpu()
- sw1p1 is also an upstream port because it is a CPU port, albeit one
that is disabled. This is because dsa_upstream_port() returns:
if (!cpu_dp)
return port;
which means that if @dp does not have a ->cpu_dp pointer (which is a
characteristic of CPU ports themselves as well as unused ports), then
@dp is its own upstream port.
So the driver for switch 1 rightfully says: I have two upstream ports,
but I don't support multiple upstream ports! So let me error out, I
don't know which one to choose and what to do with the other one.
Generally I am against enforcing any default policy in the kernel in
terms of user to CPU port assignment (like round robin or such) but this
case is different. To solve the conundrum, one would have to:
- Disable sw1p1 in the device tree or mark it as "not a CPU port" in
order to comply with DSA's view of this topology as a daisy chain,
where the termination traffic from switch 1 must pass through switch 0.
This is counter-productive because it wastes 1Gbps of termination
throughput in switch 1.
- Disable the DSA link between sw0p4 and sw1p4 and do software
forwarding between switch 0 and 1, and basically treat the switches as
part of disjoint switch trees. This is counter-productive because it
wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
- Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
work, but it makes cross-chip bridging impossible. In this setup we
would need to have 2 separate bridges, br0 spanning the ports of
switch 0, and br1 spanning the ports of switch 1, and the "DSA links
treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
the gateway ports between one bridge and another. This is hard to
manage from a user's perspective, who wants to have a unified view of
the switching fabric and the ability to transparently add ports to the
same bridge. VLANs would also need to be explicitly managed by the
user on these gateway ports.
So it seems that the only reasonable thing to do is to make DSA prefer
CPU ports that are local to the switch. Meaning that by default, the
user and DSA ports of switch 0 will get assigned to the CPU port from
switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
assigned to the CPU port from switch 1.
The way this solves the problem is that sw1p4 is no longer an upstream
port as far as switch 1 is concerned (it no longer views sw0p1 as its
dedicated CPU port).
So here we are, the first multi-CPU port that DSA supports is also
perhaps the most uneventful one: the individual switches don't support
multiple CPUs, however the DSA switch tree as a whole does have multiple
CPU ports. No user space assignment of user ports to CPU ports is
desirable, necessary, or possible.
Ports that do not have a local CPU port (say there was an extra switch
hanging off of sw0p0) default to the standard implementation of getting
assigned to the first CPU port of the DSA switch tree. Is that good
enough? Probably not (if the downstream switch was hanging off of switch
1, we would most certainly prefer its CPU port to be sw1p1), but in
order to support that use case too, we would need to traverse the
dst->rtable in search of an optimum dedicated CPU port, one that has the
smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
moment, the DSA routing table structure does not keep the number of hops
between dl->dp and dl->link_dp, and while it is probably deducible,
there is zero justification to write that code now. Let's hope DSA will
never have to support that use case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 21:54:30 +08:00
|
|
|
/* Assign the default CPU port (the first one in the tree) to all ports of the
|
|
|
|
* fabric which don't already have one as part of their own switch.
|
|
|
|
*/
|
2017-11-07 05:11:44 +08:00
|
|
|
static int dsa_tree_setup_default_cpu(struct dsa_switch_tree *dst)
|
|
|
|
{
|
2019-10-22 04:51:24 +08:00
|
|
|
struct dsa_port *cpu_dp, *dp;
|
2017-11-07 05:11:44 +08:00
|
|
|
|
2019-10-22 04:51:24 +08:00
|
|
|
cpu_dp = dsa_tree_find_first_cpu(dst);
|
|
|
|
if (!cpu_dp) {
|
|
|
|
pr_err("DSA: tree %d has no CPU port\n", dst->index);
|
2017-11-07 05:11:44 +08:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
net: dsa: give preference to local CPU ports
Be there an "H" switch topology, where there are 2 switches connected as
follows:
eth0 eth1
| |
CPU port CPU port
| DSA link |
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0
| | | | | |
user user user user user user
port port port port port port
basically one where each switch has its own CPU port for termination,
but there is also a DSA link in case packets need to be forwarded in
hardware between one switch and another.
DSA insists to see this as a daisy chain topology, basically registering
all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
eth1 as a valid DSA master.
This is only half the story, since when asked using dsa_port_is_cpu(),
DSA will respond that sw1p1 is a CPU port, however one which has no
dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
Furthermore, be there a driver for switches which support only one
upstream port. This driver iterates through its ports and checks using
dsa_is_upstream_port() whether the current port is an upstream one.
For switch 1, two ports pass the "is upstream port" checks:
- sw1p4 is an upstream port because it is a routing port towards the
dedicated CPU port assigned using dsa_tree_setup_default_cpu()
- sw1p1 is also an upstream port because it is a CPU port, albeit one
that is disabled. This is because dsa_upstream_port() returns:
if (!cpu_dp)
return port;
which means that if @dp does not have a ->cpu_dp pointer (which is a
characteristic of CPU ports themselves as well as unused ports), then
@dp is its own upstream port.
So the driver for switch 1 rightfully says: I have two upstream ports,
but I don't support multiple upstream ports! So let me error out, I
don't know which one to choose and what to do with the other one.
Generally I am against enforcing any default policy in the kernel in
terms of user to CPU port assignment (like round robin or such) but this
case is different. To solve the conundrum, one would have to:
- Disable sw1p1 in the device tree or mark it as "not a CPU port" in
order to comply with DSA's view of this topology as a daisy chain,
where the termination traffic from switch 1 must pass through switch 0.
This is counter-productive because it wastes 1Gbps of termination
throughput in switch 1.
- Disable the DSA link between sw0p4 and sw1p4 and do software
forwarding between switch 0 and 1, and basically treat the switches as
part of disjoint switch trees. This is counter-productive because it
wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
- Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
work, but it makes cross-chip bridging impossible. In this setup we
would need to have 2 separate bridges, br0 spanning the ports of
switch 0, and br1 spanning the ports of switch 1, and the "DSA links
treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
the gateway ports between one bridge and another. This is hard to
manage from a user's perspective, who wants to have a unified view of
the switching fabric and the ability to transparently add ports to the
same bridge. VLANs would also need to be explicitly managed by the
user on these gateway ports.
So it seems that the only reasonable thing to do is to make DSA prefer
CPU ports that are local to the switch. Meaning that by default, the
user and DSA ports of switch 0 will get assigned to the CPU port from
switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
assigned to the CPU port from switch 1.
The way this solves the problem is that sw1p4 is no longer an upstream
port as far as switch 1 is concerned (it no longer views sw0p1 as its
dedicated CPU port).
So here we are, the first multi-CPU port that DSA supports is also
perhaps the most uneventful one: the individual switches don't support
multiple CPUs, however the DSA switch tree as a whole does have multiple
CPU ports. No user space assignment of user ports to CPU ports is
desirable, necessary, or possible.
Ports that do not have a local CPU port (say there was an extra switch
hanging off of sw0p0) default to the standard implementation of getting
assigned to the first CPU port of the DSA switch tree. Is that good
enough? Probably not (if the downstream switch was hanging off of switch
1, we would most certainly prefer its CPU port to be sw1p1), but in
order to support that use case too, we would need to traverse the
dst->rtable in search of an optimum dedicated CPU port, one that has the
smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
moment, the DSA routing table structure does not keep the number of hops
between dl->dp and dl->link_dp, and while it is probably deducible,
there is zero justification to write that code now. Let's hope DSA will
never have to support that use case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 21:54:30 +08:00
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
|
|
|
if (dp->cpu_dp)
|
|
|
|
continue;
|
|
|
|
|
2019-10-22 04:51:24 +08:00
|
|
|
if (dsa_port_is_user(dp) || dsa_port_is_dsa(dp))
|
|
|
|
dp->cpu_dp = cpu_dp;
|
net: dsa: give preference to local CPU ports
Be there an "H" switch topology, where there are 2 switches connected as
follows:
eth0 eth1
| |
CPU port CPU port
| DSA link |
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0
| | | | | |
user user user user user user
port port port port port port
basically one where each switch has its own CPU port for termination,
but there is also a DSA link in case packets need to be forwarded in
hardware between one switch and another.
DSA insists to see this as a daisy chain topology, basically registering
all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
eth1 as a valid DSA master.
This is only half the story, since when asked using dsa_port_is_cpu(),
DSA will respond that sw1p1 is a CPU port, however one which has no
dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
Furthermore, be there a driver for switches which support only one
upstream port. This driver iterates through its ports and checks using
dsa_is_upstream_port() whether the current port is an upstream one.
For switch 1, two ports pass the "is upstream port" checks:
- sw1p4 is an upstream port because it is a routing port towards the
dedicated CPU port assigned using dsa_tree_setup_default_cpu()
- sw1p1 is also an upstream port because it is a CPU port, albeit one
that is disabled. This is because dsa_upstream_port() returns:
if (!cpu_dp)
return port;
which means that if @dp does not have a ->cpu_dp pointer (which is a
characteristic of CPU ports themselves as well as unused ports), then
@dp is its own upstream port.
So the driver for switch 1 rightfully says: I have two upstream ports,
but I don't support multiple upstream ports! So let me error out, I
don't know which one to choose and what to do with the other one.
Generally I am against enforcing any default policy in the kernel in
terms of user to CPU port assignment (like round robin or such) but this
case is different. To solve the conundrum, one would have to:
- Disable sw1p1 in the device tree or mark it as "not a CPU port" in
order to comply with DSA's view of this topology as a daisy chain,
where the termination traffic from switch 1 must pass through switch 0.
This is counter-productive because it wastes 1Gbps of termination
throughput in switch 1.
- Disable the DSA link between sw0p4 and sw1p4 and do software
forwarding between switch 0 and 1, and basically treat the switches as
part of disjoint switch trees. This is counter-productive because it
wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
- Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
work, but it makes cross-chip bridging impossible. In this setup we
would need to have 2 separate bridges, br0 spanning the ports of
switch 0, and br1 spanning the ports of switch 1, and the "DSA links
treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
the gateway ports between one bridge and another. This is hard to
manage from a user's perspective, who wants to have a unified view of
the switching fabric and the ability to transparently add ports to the
same bridge. VLANs would also need to be explicitly managed by the
user on these gateway ports.
So it seems that the only reasonable thing to do is to make DSA prefer
CPU ports that are local to the switch. Meaning that by default, the
user and DSA ports of switch 0 will get assigned to the CPU port from
switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
assigned to the CPU port from switch 1.
The way this solves the problem is that sw1p4 is no longer an upstream
port as far as switch 1 is concerned (it no longer views sw0p1 as its
dedicated CPU port).
So here we are, the first multi-CPU port that DSA supports is also
perhaps the most uneventful one: the individual switches don't support
multiple CPUs, however the DSA switch tree as a whole does have multiple
CPU ports. No user space assignment of user ports to CPU ports is
desirable, necessary, or possible.
Ports that do not have a local CPU port (say there was an extra switch
hanging off of sw0p0) default to the standard implementation of getting
assigned to the first CPU port of the DSA switch tree. Is that good
enough? Probably not (if the downstream switch was hanging off of switch
1, we would most certainly prefer its CPU port to be sw1p1), but in
order to support that use case too, we would need to traverse the
dst->rtable in search of an optimum dedicated CPU port, one that has the
smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
moment, the DSA routing table structure does not keep the number of hops
between dl->dp and dl->link_dp, and while it is probably deducible,
there is zero justification to write that code now. Let's hope DSA will
never have to support that use case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 21:54:30 +08:00
|
|
|
}
|
2017-11-07 05:11:44 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
net: dsa: give preference to local CPU ports
Be there an "H" switch topology, where there are 2 switches connected as
follows:
eth0 eth1
| |
CPU port CPU port
| DSA link |
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0
| | | | | |
user user user user user user
port port port port port port
basically one where each switch has its own CPU port for termination,
but there is also a DSA link in case packets need to be forwarded in
hardware between one switch and another.
DSA insists to see this as a daisy chain topology, basically registering
all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
eth1 as a valid DSA master.
This is only half the story, since when asked using dsa_port_is_cpu(),
DSA will respond that sw1p1 is a CPU port, however one which has no
dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
Furthermore, be there a driver for switches which support only one
upstream port. This driver iterates through its ports and checks using
dsa_is_upstream_port() whether the current port is an upstream one.
For switch 1, two ports pass the "is upstream port" checks:
- sw1p4 is an upstream port because it is a routing port towards the
dedicated CPU port assigned using dsa_tree_setup_default_cpu()
- sw1p1 is also an upstream port because it is a CPU port, albeit one
that is disabled. This is because dsa_upstream_port() returns:
if (!cpu_dp)
return port;
which means that if @dp does not have a ->cpu_dp pointer (which is a
characteristic of CPU ports themselves as well as unused ports), then
@dp is its own upstream port.
So the driver for switch 1 rightfully says: I have two upstream ports,
but I don't support multiple upstream ports! So let me error out, I
don't know which one to choose and what to do with the other one.
Generally I am against enforcing any default policy in the kernel in
terms of user to CPU port assignment (like round robin or such) but this
case is different. To solve the conundrum, one would have to:
- Disable sw1p1 in the device tree or mark it as "not a CPU port" in
order to comply with DSA's view of this topology as a daisy chain,
where the termination traffic from switch 1 must pass through switch 0.
This is counter-productive because it wastes 1Gbps of termination
throughput in switch 1.
- Disable the DSA link between sw0p4 and sw1p4 and do software
forwarding between switch 0 and 1, and basically treat the switches as
part of disjoint switch trees. This is counter-productive because it
wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
- Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
work, but it makes cross-chip bridging impossible. In this setup we
would need to have 2 separate bridges, br0 spanning the ports of
switch 0, and br1 spanning the ports of switch 1, and the "DSA links
treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
the gateway ports between one bridge and another. This is hard to
manage from a user's perspective, who wants to have a unified view of
the switching fabric and the ability to transparently add ports to the
same bridge. VLANs would also need to be explicitly managed by the
user on these gateway ports.
So it seems that the only reasonable thing to do is to make DSA prefer
CPU ports that are local to the switch. Meaning that by default, the
user and DSA ports of switch 0 will get assigned to the CPU port from
switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
assigned to the CPU port from switch 1.
The way this solves the problem is that sw1p4 is no longer an upstream
port as far as switch 1 is concerned (it no longer views sw0p1 as its
dedicated CPU port).
So here we are, the first multi-CPU port that DSA supports is also
perhaps the most uneventful one: the individual switches don't support
multiple CPUs, however the DSA switch tree as a whole does have multiple
CPU ports. No user space assignment of user ports to CPU ports is
desirable, necessary, or possible.
Ports that do not have a local CPU port (say there was an extra switch
hanging off of sw0p0) default to the standard implementation of getting
assigned to the first CPU port of the DSA switch tree. Is that good
enough? Probably not (if the downstream switch was hanging off of switch
1, we would most certainly prefer its CPU port to be sw1p1), but in
order to support that use case too, we would need to traverse the
dst->rtable in search of an optimum dedicated CPU port, one that has the
smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
moment, the DSA routing table structure does not keep the number of hops
between dl->dp and dl->link_dp, and while it is probably deducible,
there is zero justification to write that code now. Let's hope DSA will
never have to support that use case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 21:54:30 +08:00
|
|
|
/* Perform initial assignment of CPU ports to user ports and DSA links in the
|
|
|
|
* fabric, giving preference to CPU ports local to each switch. Default to
|
|
|
|
* using the first CPU port in the switch tree if the port does not have a CPU
|
|
|
|
* port local to this switch.
|
|
|
|
*/
|
|
|
|
static int dsa_tree_setup_cpu_ports(struct dsa_switch_tree *dst)
|
|
|
|
{
|
|
|
|
struct dsa_port *cpu_dp, *dp;
|
|
|
|
|
|
|
|
list_for_each_entry(cpu_dp, &dst->ports, list) {
|
|
|
|
if (!dsa_port_is_cpu(cpu_dp))
|
|
|
|
continue;
|
|
|
|
|
2021-10-21 01:49:51 +08:00
|
|
|
/* Prefer a local CPU port */
|
|
|
|
dsa_switch_for_each_port(dp, cpu_dp->ds) {
|
net: dsa: give preference to local CPU ports
Be there an "H" switch topology, where there are 2 switches connected as
follows:
eth0 eth1
| |
CPU port CPU port
| DSA link |
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0
| | | | | |
user user user user user user
port port port port port port
basically one where each switch has its own CPU port for termination,
but there is also a DSA link in case packets need to be forwarded in
hardware between one switch and another.
DSA insists to see this as a daisy chain topology, basically registering
all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
eth1 as a valid DSA master.
This is only half the story, since when asked using dsa_port_is_cpu(),
DSA will respond that sw1p1 is a CPU port, however one which has no
dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
Furthermore, be there a driver for switches which support only one
upstream port. This driver iterates through its ports and checks using
dsa_is_upstream_port() whether the current port is an upstream one.
For switch 1, two ports pass the "is upstream port" checks:
- sw1p4 is an upstream port because it is a routing port towards the
dedicated CPU port assigned using dsa_tree_setup_default_cpu()
- sw1p1 is also an upstream port because it is a CPU port, albeit one
that is disabled. This is because dsa_upstream_port() returns:
if (!cpu_dp)
return port;
which means that if @dp does not have a ->cpu_dp pointer (which is a
characteristic of CPU ports themselves as well as unused ports), then
@dp is its own upstream port.
So the driver for switch 1 rightfully says: I have two upstream ports,
but I don't support multiple upstream ports! So let me error out, I
don't know which one to choose and what to do with the other one.
Generally I am against enforcing any default policy in the kernel in
terms of user to CPU port assignment (like round robin or such) but this
case is different. To solve the conundrum, one would have to:
- Disable sw1p1 in the device tree or mark it as "not a CPU port" in
order to comply with DSA's view of this topology as a daisy chain,
where the termination traffic from switch 1 must pass through switch 0.
This is counter-productive because it wastes 1Gbps of termination
throughput in switch 1.
- Disable the DSA link between sw0p4 and sw1p4 and do software
forwarding between switch 0 and 1, and basically treat the switches as
part of disjoint switch trees. This is counter-productive because it
wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
- Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
work, but it makes cross-chip bridging impossible. In this setup we
would need to have 2 separate bridges, br0 spanning the ports of
switch 0, and br1 spanning the ports of switch 1, and the "DSA links
treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
the gateway ports between one bridge and another. This is hard to
manage from a user's perspective, who wants to have a unified view of
the switching fabric and the ability to transparently add ports to the
same bridge. VLANs would also need to be explicitly managed by the
user on these gateway ports.
So it seems that the only reasonable thing to do is to make DSA prefer
CPU ports that are local to the switch. Meaning that by default, the
user and DSA ports of switch 0 will get assigned to the CPU port from
switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
assigned to the CPU port from switch 1.
The way this solves the problem is that sw1p4 is no longer an upstream
port as far as switch 1 is concerned (it no longer views sw0p1 as its
dedicated CPU port).
So here we are, the first multi-CPU port that DSA supports is also
perhaps the most uneventful one: the individual switches don't support
multiple CPUs, however the DSA switch tree as a whole does have multiple
CPU ports. No user space assignment of user ports to CPU ports is
desirable, necessary, or possible.
Ports that do not have a local CPU port (say there was an extra switch
hanging off of sw0p0) default to the standard implementation of getting
assigned to the first CPU port of the DSA switch tree. Is that good
enough? Probably not (if the downstream switch was hanging off of switch
1, we would most certainly prefer its CPU port to be sw1p1), but in
order to support that use case too, we would need to traverse the
dst->rtable in search of an optimum dedicated CPU port, one that has the
smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
moment, the DSA routing table structure does not keep the number of hops
between dl->dp and dl->link_dp, and while it is probably deducible,
there is zero justification to write that code now. Let's hope DSA will
never have to support that use case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 21:54:30 +08:00
|
|
|
/* Prefer the first local CPU port found */
|
|
|
|
if (dp->cpu_dp)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (dsa_port_is_user(dp) || dsa_port_is_dsa(dp))
|
|
|
|
dp->cpu_dp = cpu_dp;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return dsa_tree_setup_default_cpu(dst);
|
|
|
|
}
|
|
|
|
|
2021-08-04 21:54:29 +08:00
|
|
|
static void dsa_tree_teardown_cpu_ports(struct dsa_switch_tree *dst)
|
2017-11-07 05:11:44 +08:00
|
|
|
{
|
2019-10-22 04:51:24 +08:00
|
|
|
struct dsa_port *dp;
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list)
|
|
|
|
if (dsa_port_is_user(dp) || dsa_port_is_dsa(dp))
|
|
|
|
dp->cpu_dp = NULL;
|
2017-11-07 05:11:44 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:48 +08:00
|
|
|
static int dsa_port_setup(struct dsa_port *dp)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2019-08-20 04:00:48 +08:00
|
|
|
struct devlink_port *dlp = &dp->devlink_port;
|
2019-08-31 20:46:19 +08:00
|
|
|
bool dsa_port_link_registered = false;
|
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 22:29:16 +08:00
|
|
|
struct dsa_switch *ds = dp->ds;
|
2019-08-31 20:46:19 +08:00
|
|
|
bool dsa_port_enabled = false;
|
|
|
|
int err = 0;
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
if (dp->setup)
|
|
|
|
return 0;
|
|
|
|
|
net: dsa: introduce locking for the address lists on CPU and DSA ports
Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
no one is serializing access to the address lists that DSA keeps for the
purpose of reference counting on shared ports (CPU and cascade ports).
It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
We need to avoid that.
Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
still runs under the rtnl_mutex. But it would be nice if it would not
depend on that being the case. So let's introduce a mutex per port (the
address lists are per port too) and share it between dp->mdbs and
dp->fdbs.
The place where we put the locking is interesting. It could be tempting
to put a DSA-level lock which still serializes calls to
.port_fdb_{add,del}, but it would still not avoid concurrency with other
driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
.port_fast_age). So it would add a very false sense of security (and
adding a global switch-wide lock in DSA to resynchronize with the
rtnl_lock is also counterproductive and hard).
So the locking is intentionally done only where the dp->fdbs and dp->mdbs
lists are traversed. That means, from a driver perspective, that
.port_fdb_add will be called with the dp->addr_lists_lock mutex held on
the CPU port, but not held on user ports. This is done so that driver
writers are not encouraged to rely on any guarantee offered by
dp->addr_lists_lock.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25 01:17:54 +08:00
|
|
|
mutex_init(&dp->addr_lists_lock);
|
net: dsa: add explicit support for host bridge VLANs
Currently, DSA programs VLANs on shared (DSA and CPU) ports each time it
does so on user ports. This is good for basic functionality but has
several limitations:
- the VLAN group which must reach the CPU may be radically different
from the VLAN group that must be autonomously forwarded by the switch.
In other words, the admin may want to isolate noisy stations and avoid
traffic from them going to the control processor of the switch, where
it would just waste useless cycles. The bridge already supports
independent control of VLAN groups on bridge ports and on the bridge
itself, and when VLAN-aware, it will drop packets in software anyway
if their VID isn't added as a 'self' entry towards the bridge device.
- Replaying host FDB entries may depend, for some drivers like mv88e6xxx,
on replaying the host VLANs as well. The 2 VLAN groups are
approximately the same in most regular cases, but there are corner
cases when timing matters, and DSA's approximation of replicating
VLANs on shared ports simply does not work.
- If a user makes the bridge (implicitly the CPU port) join a VLAN by
accident, there is no way for the CPU port to isolate itself from that
noisy VLAN except by rebooting the system. This is because for each
VLAN added on a user port, DSA will add it on shared ports too, but
for each VLAN deletion on a user port, it will remain installed on
shared ports, since DSA has no good indication of whether the VLAN is
still in use or not.
Now that the bridge driver emits well-balanced SWITCHDEV_OBJ_ID_PORT_VLAN
addition and removal events, DSA has a simple and straightforward task
of separating the bridge port VLANs (these have an orig_dev which is a
DSA slave interface, or a LAG interface) from the host VLANs (these have
an orig_dev which is a bridge interface), and to keep a simple reference
count of each VID on each shared port.
Forwarding VLANs must be installed on the bridge ports and on all DSA
ports interconnecting them. We don't have a good view of the exact
topology, so we simply install forwarding VLANs on all DSA ports, which
is what has been done until now.
Host VLANs must be installed primarily on the dedicated CPU port of each
bridge port. More subtly, they must also be installed on upstream-facing
and downstream-facing DSA ports that are connecting the bridge ports and
the CPU. This ensures that the mv88e6xxx's problem (VID of host FDB
entry may be absent from VTU) is still addressed even if that switch is
in a cross-chip setup, and it has no local CPU port.
Therefore:
- user ports contain only bridge port (forwarding) VLANs, and no
refcounting is necessary
- DSA ports contain both forwarding and host VLANs. Refcounting is
necessary among these 2 types.
- CPU ports contain only host VLANs. Refcounting is also necessary.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16 01:02:17 +08:00
|
|
|
mutex_init(&dp->vlans_lock);
|
2021-06-29 22:06:52 +08:00
|
|
|
INIT_LIST_HEAD(&dp->fdbs);
|
net: dsa: reference count the MDB entries at the cross-chip notifier level
Ever since the cross-chip notifiers were introduced, the design was
meant to be simplistic and just get the job done without worrying too
much about dangling resources left behind.
For example, somebody installs an MDB entry on sw0p0 in this daisy chain
topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
sw1p4 and sw2p4.
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Then the same person deletes that MDB entry. The cross-chip notifier for
deletion only matches sw0p0:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
Why?
Because the DSA links are 'trunk' ports, if we just go ahead and delete
the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
entries when they are still needed. Just consider the fact that somebody
does:
- add a multicast MAC address towards sw0p0 [ via the cross-chip
notifiers it gets installed on the DSA links too ]
- add the same multicast MAC address towards sw0p1 (another port of that
same switch)
- delete the same multicast MAC address from sw0p0.
At this point, if we deleted the MAC address from the DSA links, it
would be flooded, even though there is still an entry on switch 0 which
needs it not to.
So that is why deletions only match the targeted source port and nothing
on DSA links. Of course, dangling resources means that the hardware
tables will eventually run out given enough additions/removals, but hey,
at least it's simple.
But there is a bigger concern which needs to be addressed, and that is
our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
on the upstream port, and a similar thing happens on deletion:
dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
upstream port.
When there are 2 VLAN-unaware bridges spanning the same switch (which is
a use case DSA proudly supports), each bridge will install its own
SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
user ports enslaved to br0 and the user ports enslaved to br1. Not good.
The host-trapped multicast addresses installed by br1 will be deleted
when any state changes in br0 (IGMP timers expire, or ports leave, etc).
To avoid this, we could of course go the route of the zero-sum game and
delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
design is to just admit that on shared ports like DSA links and CPU
ports, we should be reference counting calls, even if this consumes some
dynamic memory which DSA has traditionally avoided. On the flip side,
the hardware tables of switches are limited in size, so it would be good
if the OS managed them properly instead of having them eventually
overflow.
To address the memory usage concern, we only apply the refcounting of
MDB entries on ports that are really shared (CPU ports and DSA links)
and not on user ports. In a typical single-switch setup, this means only
the CPU port (and the host MDB entries are not that many, really).
The name of the newly introduced data structures (dsa_mac_addr) is
chosen in such a way that will be reusable for host FDB entries (next
patch).
With this change, we can finally have the same matching logic for the
MDB additions and deletions, as well as for their host-trapped variants.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 22:06:50 +08:00
|
|
|
INIT_LIST_HEAD(&dp->mdbs);
|
net: dsa: add explicit support for host bridge VLANs
Currently, DSA programs VLANs on shared (DSA and CPU) ports each time it
does so on user ports. This is good for basic functionality but has
several limitations:
- the VLAN group which must reach the CPU may be radically different
from the VLAN group that must be autonomously forwarded by the switch.
In other words, the admin may want to isolate noisy stations and avoid
traffic from them going to the control processor of the switch, where
it would just waste useless cycles. The bridge already supports
independent control of VLAN groups on bridge ports and on the bridge
itself, and when VLAN-aware, it will drop packets in software anyway
if their VID isn't added as a 'self' entry towards the bridge device.
- Replaying host FDB entries may depend, for some drivers like mv88e6xxx,
on replaying the host VLANs as well. The 2 VLAN groups are
approximately the same in most regular cases, but there are corner
cases when timing matters, and DSA's approximation of replicating
VLANs on shared ports simply does not work.
- If a user makes the bridge (implicitly the CPU port) join a VLAN by
accident, there is no way for the CPU port to isolate itself from that
noisy VLAN except by rebooting the system. This is because for each
VLAN added on a user port, DSA will add it on shared ports too, but
for each VLAN deletion on a user port, it will remain installed on
shared ports, since DSA has no good indication of whether the VLAN is
still in use or not.
Now that the bridge driver emits well-balanced SWITCHDEV_OBJ_ID_PORT_VLAN
addition and removal events, DSA has a simple and straightforward task
of separating the bridge port VLANs (these have an orig_dev which is a
DSA slave interface, or a LAG interface) from the host VLANs (these have
an orig_dev which is a bridge interface), and to keep a simple reference
count of each VID on each shared port.
Forwarding VLANs must be installed on the bridge ports and on all DSA
ports interconnecting them. We don't have a good view of the exact
topology, so we simply install forwarding VLANs on all DSA ports, which
is what has been done until now.
Host VLANs must be installed primarily on the dedicated CPU port of each
bridge port. More subtly, they must also be installed on upstream-facing
and downstream-facing DSA ports that are connecting the bridge ports and
the CPU. This ensures that the mv88e6xxx's problem (VID of host FDB
entry may be absent from VTU) is still addressed even if that switch is
in a cross-chip setup, and it has no local CPU port.
Therefore:
- user ports contain only bridge port (forwarding) VLANs, and no
refcounting is necessary
- DSA ports contain both forwarding and host VLANs. Refcounting is
necessary among these 2 types.
- CPU ports contain only host VLANs. Refcounting is also necessary.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16 01:02:17 +08:00
|
|
|
INIT_LIST_HEAD(&dp->vlans);
|
net: dsa: reference count the MDB entries at the cross-chip notifier level
Ever since the cross-chip notifiers were introduced, the design was
meant to be simplistic and just get the job done without worrying too
much about dangling resources left behind.
For example, somebody installs an MDB entry on sw0p0 in this daisy chain
topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
sw1p4 and sw2p4.
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Then the same person deletes that MDB entry. The cross-chip notifier for
deletion only matches sw0p0:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
Why?
Because the DSA links are 'trunk' ports, if we just go ahead and delete
the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
entries when they are still needed. Just consider the fact that somebody
does:
- add a multicast MAC address towards sw0p0 [ via the cross-chip
notifiers it gets installed on the DSA links too ]
- add the same multicast MAC address towards sw0p1 (another port of that
same switch)
- delete the same multicast MAC address from sw0p0.
At this point, if we deleted the MAC address from the DSA links, it
would be flooded, even though there is still an entry on switch 0 which
needs it not to.
So that is why deletions only match the targeted source port and nothing
on DSA links. Of course, dangling resources means that the hardware
tables will eventually run out given enough additions/removals, but hey,
at least it's simple.
But there is a bigger concern which needs to be addressed, and that is
our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
on the upstream port, and a similar thing happens on deletion:
dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
upstream port.
When there are 2 VLAN-unaware bridges spanning the same switch (which is
a use case DSA proudly supports), each bridge will install its own
SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
user ports enslaved to br0 and the user ports enslaved to br1. Not good.
The host-trapped multicast addresses installed by br1 will be deleted
when any state changes in br0 (IGMP timers expire, or ports leave, etc).
To avoid this, we could of course go the route of the zero-sum game and
delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
design is to just admit that on shared ports like DSA links and CPU
ports, we should be reference counting calls, even if this consumes some
dynamic memory which DSA has traditionally avoided. On the flip side,
the hardware tables of switches are limited in size, so it would be good
if the OS managed them properly instead of having them eventually
overflow.
To address the memory usage concern, we only apply the refcounting of
MDB entries on ports that are really shared (CPU ports and DSA links)
and not on user ports. In a typical single-switch setup, this means only
the CPU port (and the host MDB entries are not that many, really).
The name of the newly introduced data structures (dsa_mac_addr) is
chosen in such a way that will be reusable for host FDB entries (next
patch).
With this change, we can finally have the same matching logic for the
MDB additions and deletions, as well as for their host-trapped variants.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 22:06:50 +08:00
|
|
|
|
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 22:29:16 +08:00
|
|
|
if (ds->ops->port_setup) {
|
|
|
|
err = ds->ops->port_setup(ds, dp->index);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:48 +08:00
|
|
|
switch (dp->type) {
|
|
|
|
case DSA_PORT_TYPE_UNUSED:
|
2019-08-20 04:00:50 +08:00
|
|
|
dsa_port_disable(dp);
|
2017-11-07 05:11:48 +08:00
|
|
|
break;
|
|
|
|
case DSA_PORT_TYPE_CPU:
|
2018-05-18 15:29:03 +08:00
|
|
|
err = dsa_port_link_register_of(dp);
|
2019-08-20 04:00:50 +08:00
|
|
|
if (err)
|
2019-08-31 20:46:19 +08:00
|
|
|
break;
|
|
|
|
dsa_port_link_registered = true;
|
2019-08-20 04:00:50 +08:00
|
|
|
|
|
|
|
err = dsa_port_enable(dp, NULL);
|
2019-05-30 14:09:07 +08:00
|
|
|
if (err)
|
2019-08-31 20:46:19 +08:00
|
|
|
break;
|
|
|
|
dsa_port_enabled = true;
|
|
|
|
|
2018-05-18 15:29:03 +08:00
|
|
|
break;
|
2017-11-07 05:11:48 +08:00
|
|
|
case DSA_PORT_TYPE_DSA:
|
2018-01-23 23:03:46 +08:00
|
|
|
err = dsa_port_link_register_of(dp);
|
2019-08-20 04:00:50 +08:00
|
|
|
if (err)
|
2019-08-31 20:46:19 +08:00
|
|
|
break;
|
|
|
|
dsa_port_link_registered = true;
|
2019-08-20 04:00:50 +08:00
|
|
|
|
|
|
|
err = dsa_port_enable(dp, NULL);
|
2019-05-30 14:09:07 +08:00
|
|
|
if (err)
|
2019-08-31 20:46:19 +08:00
|
|
|
break;
|
|
|
|
dsa_port_enabled = true;
|
|
|
|
|
2017-11-07 05:11:48 +08:00
|
|
|
break;
|
|
|
|
case DSA_PORT_TYPE_USER:
|
of: net: pass the dst buffer to of_get_mac_address()
of_get_mac_address() returns a "const void*" pointer to a MAC address.
Lately, support to fetch the MAC address by an NVMEM provider was added.
But this will only work with platform devices. It will not work with
PCI devices (e.g. of an integrated root complex) and esp. not with DSA
ports.
There is an of_* variant of the nvmem binding which works without
devices. The returned data of a nvmem_cell_read() has to be freed after
use. On the other hand the return of_get_mac_address() points to some
static data without a lifetime. The trick for now, was to allocate a
device resource managed buffer which is then returned. This will only
work if we have an actual device.
Change it, so that the caller of of_get_mac_address() has to supply a
buffer where the MAC address is written to. Unfortunately, this will
touch all drivers which use the of_get_mac_address().
Usually the code looks like:
const char *addr;
addr = of_get_mac_address(np);
if (!IS_ERR(addr))
ether_addr_copy(ndev->dev_addr, addr);
This can then be simply rewritten as:
of_get_mac_address(np, ndev->dev_addr);
Sometimes is_valid_ether_addr() is used to test the MAC address.
of_get_mac_address() already makes sure, it just returns a valid MAC
address. Thus we can just test its return code. But we have to be
careful if there are still other sources for the MAC address before the
of_get_mac_address(). In this case we have to keep the
is_valid_ether_addr() call.
The following coccinelle patch was used to convert common cases to the
new style. Afterwards, I've manually gone over the drivers and fixed the
return code variable: either used a new one or if one was already
available use that. Mansour Moufid, thanks for that coccinelle patch!
<spml>
@a@
identifier x;
expression y, z;
@@
- x = of_get_mac_address(y);
+ x = of_get_mac_address(y, z);
<...
- ether_addr_copy(z, x);
...>
@@
identifier a.x;
@@
- if (<+... x ...+>) {}
@@
identifier a.x;
@@
if (<+... x ...+>) {
...
}
- else {}
@@
identifier a.x;
expression e;
@@
- if (<+... x ...+>@e)
- {}
- else
+ if (!(e))
{...}
@@
expression x, y, z;
@@
- x = of_get_mac_address(y, z);
+ of_get_mac_address(y, z);
... when != x
</spml>
All drivers, except drivers/net/ethernet/aeroflex/greth.c, were
compile-time tested.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 01:47:17 +08:00
|
|
|
of_get_mac_address(dp->dn, dp->mac);
|
2017-11-07 05:11:48 +08:00
|
|
|
err = dsa_slave_create(dp);
|
|
|
|
if (err)
|
2019-08-31 20:46:19 +08:00
|
|
|
break;
|
2019-08-20 04:00:48 +08:00
|
|
|
|
|
|
|
devlink_port_type_eth_set(dlp, dp->slave);
|
2017-11-07 05:11:48 +08:00
|
|
|
break;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2019-08-31 20:46:19 +08:00
|
|
|
if (err && dsa_port_enabled)
|
|
|
|
dsa_port_disable(dp);
|
|
|
|
if (err && dsa_port_link_registered)
|
|
|
|
dsa_port_link_unregister_of(dp);
|
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 22:29:16 +08:00
|
|
|
if (err) {
|
|
|
|
if (ds->ops->port_teardown)
|
|
|
|
ds->ops->port_teardown(ds, dp->index);
|
2019-10-22 04:51:19 +08:00
|
|
|
return err;
|
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 22:29:16 +08:00
|
|
|
}
|
2019-08-31 20:46:19 +08:00
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
dp->setup = true;
|
|
|
|
|
|
|
|
return 0;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2020-10-05 00:12:53 +08:00
|
|
|
static int dsa_port_devlink_setup(struct dsa_port *dp)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2019-08-20 04:00:48 +08:00
|
|
|
struct devlink_port *dlp = &dp->devlink_port;
|
2020-10-05 00:12:53 +08:00
|
|
|
struct dsa_switch_tree *dst = dp->ds->dst;
|
|
|
|
struct devlink_port_attrs attrs = {};
|
|
|
|
struct devlink *dl = dp->ds->devlink;
|
|
|
|
const unsigned char *id;
|
|
|
|
unsigned char len;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
id = (const unsigned char *)&dst->index;
|
|
|
|
len = sizeof(dst->index);
|
|
|
|
|
|
|
|
attrs.phys.port_number = dp->index;
|
|
|
|
memcpy(attrs.switch_id.id, id, len);
|
|
|
|
attrs.switch_id.id_len = len;
|
|
|
|
memset(dlp, 0, sizeof(*dlp));
|
|
|
|
|
|
|
|
switch (dp->type) {
|
|
|
|
case DSA_PORT_TYPE_UNUSED:
|
|
|
|
attrs.flavour = DEVLINK_PORT_FLAVOUR_UNUSED;
|
|
|
|
break;
|
|
|
|
case DSA_PORT_TYPE_CPU:
|
|
|
|
attrs.flavour = DEVLINK_PORT_FLAVOUR_CPU;
|
|
|
|
break;
|
|
|
|
case DSA_PORT_TYPE_DSA:
|
|
|
|
attrs.flavour = DEVLINK_PORT_FLAVOUR_DSA;
|
|
|
|
break;
|
|
|
|
case DSA_PORT_TYPE_USER:
|
|
|
|
attrs.flavour = DEVLINK_PORT_FLAVOUR_PHYSICAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
devlink_port_attrs_set(dlp, &attrs);
|
|
|
|
err = devlink_port_register(dl, dlp, dp->index);
|
|
|
|
|
|
|
|
if (!err)
|
|
|
|
dp->devlink_port_setup = true;
|
2017-11-07 05:11:48 +08:00
|
|
|
|
2020-10-05 00:12:53 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void dsa_port_teardown(struct dsa_port *dp)
|
|
|
|
{
|
2021-01-12 08:48:31 +08:00
|
|
|
struct devlink_port *dlp = &dp->devlink_port;
|
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 22:29:16 +08:00
|
|
|
struct dsa_switch *ds = dp->ds;
|
net: dsa: reference count the MDB entries at the cross-chip notifier level
Ever since the cross-chip notifiers were introduced, the design was
meant to be simplistic and just get the job done without worrying too
much about dangling resources left behind.
For example, somebody installs an MDB entry on sw0p0 in this daisy chain
topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
sw1p4 and sw2p4.
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Then the same person deletes that MDB entry. The cross-chip notifier for
deletion only matches sw0p0:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
Why?
Because the DSA links are 'trunk' ports, if we just go ahead and delete
the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
entries when they are still needed. Just consider the fact that somebody
does:
- add a multicast MAC address towards sw0p0 [ via the cross-chip
notifiers it gets installed on the DSA links too ]
- add the same multicast MAC address towards sw0p1 (another port of that
same switch)
- delete the same multicast MAC address from sw0p0.
At this point, if we deleted the MAC address from the DSA links, it
would be flooded, even though there is still an entry on switch 0 which
needs it not to.
So that is why deletions only match the targeted source port and nothing
on DSA links. Of course, dangling resources means that the hardware
tables will eventually run out given enough additions/removals, but hey,
at least it's simple.
But there is a bigger concern which needs to be addressed, and that is
our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
on the upstream port, and a similar thing happens on deletion:
dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
upstream port.
When there are 2 VLAN-unaware bridges spanning the same switch (which is
a use case DSA proudly supports), each bridge will install its own
SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
user ports enslaved to br0 and the user ports enslaved to br1. Not good.
The host-trapped multicast addresses installed by br1 will be deleted
when any state changes in br0 (IGMP timers expire, or ports leave, etc).
To avoid this, we could of course go the route of the zero-sum game and
delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
design is to just admit that on shared ports like DSA links and CPU
ports, we should be reference counting calls, even if this consumes some
dynamic memory which DSA has traditionally avoided. On the flip side,
the hardware tables of switches are limited in size, so it would be good
if the OS managed them properly instead of having them eventually
overflow.
To address the memory usage concern, we only apply the refcounting of
MDB entries on ports that are really shared (CPU ports and DSA links)
and not on user ports. In a typical single-switch setup, this means only
the CPU port (and the host MDB entries are not that many, really).
The name of the newly introduced data structures (dsa_mac_addr) is
chosen in such a way that will be reusable for host FDB entries (next
patch).
With this change, we can finally have the same matching logic for the
MDB additions and deletions, as well as for their host-trapped variants.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 22:06:50 +08:00
|
|
|
struct dsa_mac_addr *a, *tmp;
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
struct net_device *slave;
|
2022-02-16 01:02:18 +08:00
|
|
|
struct dsa_vlan *v, *n;
|
2021-01-12 08:48:31 +08:00
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
if (!dp->setup)
|
|
|
|
return;
|
|
|
|
|
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 22:29:16 +08:00
|
|
|
if (ds->ops->port_teardown)
|
|
|
|
ds->ops->port_teardown(ds, dp->index);
|
|
|
|
|
2021-01-12 08:48:31 +08:00
|
|
|
devlink_port_type_clear(dlp);
|
|
|
|
|
2017-11-07 05:11:48 +08:00
|
|
|
switch (dp->type) {
|
|
|
|
case DSA_PORT_TYPE_UNUSED:
|
|
|
|
break;
|
|
|
|
case DSA_PORT_TYPE_CPU:
|
2019-08-20 04:00:50 +08:00
|
|
|
dsa_port_disable(dp);
|
2019-08-20 04:00:48 +08:00
|
|
|
dsa_port_link_unregister_of(dp);
|
|
|
|
break;
|
2017-11-07 05:11:48 +08:00
|
|
|
case DSA_PORT_TYPE_DSA:
|
2019-08-20 04:00:50 +08:00
|
|
|
dsa_port_disable(dp);
|
2018-01-23 23:03:46 +08:00
|
|
|
dsa_port_link_unregister_of(dp);
|
2017-11-07 05:11:48 +08:00
|
|
|
break;
|
|
|
|
case DSA_PORT_TYPE_USER:
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
slave = dp->slave;
|
|
|
|
|
|
|
|
if (slave) {
|
2017-11-07 05:11:48 +08:00
|
|
|
dp->slave = NULL;
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
dsa_slave_destroy(slave);
|
2017-11-07 05:11:48 +08:00
|
|
|
}
|
|
|
|
break;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
2019-10-22 04:51:19 +08:00
|
|
|
|
2021-06-29 22:06:52 +08:00
|
|
|
list_for_each_entry_safe(a, tmp, &dp->fdbs, list) {
|
|
|
|
list_del(&a->list);
|
|
|
|
kfree(a);
|
|
|
|
}
|
|
|
|
|
net: dsa: reference count the MDB entries at the cross-chip notifier level
Ever since the cross-chip notifiers were introduced, the design was
meant to be simplistic and just get the job done without worrying too
much about dangling resources left behind.
For example, somebody installs an MDB entry on sw0p0 in this daisy chain
topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
sw1p4 and sw2p4.
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Then the same person deletes that MDB entry. The cross-chip notifier for
deletion only matches sw0p0:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
Why?
Because the DSA links are 'trunk' ports, if we just go ahead and delete
the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
entries when they are still needed. Just consider the fact that somebody
does:
- add a multicast MAC address towards sw0p0 [ via the cross-chip
notifiers it gets installed on the DSA links too ]
- add the same multicast MAC address towards sw0p1 (another port of that
same switch)
- delete the same multicast MAC address from sw0p0.
At this point, if we deleted the MAC address from the DSA links, it
would be flooded, even though there is still an entry on switch 0 which
needs it not to.
So that is why deletions only match the targeted source port and nothing
on DSA links. Of course, dangling resources means that the hardware
tables will eventually run out given enough additions/removals, but hey,
at least it's simple.
But there is a bigger concern which needs to be addressed, and that is
our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
on the upstream port, and a similar thing happens on deletion:
dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
upstream port.
When there are 2 VLAN-unaware bridges spanning the same switch (which is
a use case DSA proudly supports), each bridge will install its own
SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
user ports enslaved to br0 and the user ports enslaved to br1. Not good.
The host-trapped multicast addresses installed by br1 will be deleted
when any state changes in br0 (IGMP timers expire, or ports leave, etc).
To avoid this, we could of course go the route of the zero-sum game and
delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
design is to just admit that on shared ports like DSA links and CPU
ports, we should be reference counting calls, even if this consumes some
dynamic memory which DSA has traditionally avoided. On the flip side,
the hardware tables of switches are limited in size, so it would be good
if the OS managed them properly instead of having them eventually
overflow.
To address the memory usage concern, we only apply the refcounting of
MDB entries on ports that are really shared (CPU ports and DSA links)
and not on user ports. In a typical single-switch setup, this means only
the CPU port (and the host MDB entries are not that many, really).
The name of the newly introduced data structures (dsa_mac_addr) is
chosen in such a way that will be reusable for host FDB entries (next
patch).
With this change, we can finally have the same matching logic for the
MDB additions and deletions, as well as for their host-trapped variants.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 22:06:50 +08:00
|
|
|
list_for_each_entry_safe(a, tmp, &dp->mdbs, list) {
|
|
|
|
list_del(&a->list);
|
|
|
|
kfree(a);
|
|
|
|
}
|
|
|
|
|
2022-02-16 01:02:18 +08:00
|
|
|
list_for_each_entry_safe(v, n, &dp->vlans, list) {
|
|
|
|
list_del(&v->list);
|
|
|
|
kfree(v);
|
|
|
|
}
|
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
dp->setup = false;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2020-10-05 00:12:53 +08:00
|
|
|
static void dsa_port_devlink_teardown(struct dsa_port *dp)
|
|
|
|
{
|
|
|
|
struct devlink_port *dlp = &dp->devlink_port;
|
|
|
|
|
|
|
|
if (dp->devlink_port_setup)
|
|
|
|
devlink_port_unregister(dlp);
|
|
|
|
dp->devlink_port_setup = false;
|
|
|
|
}
|
|
|
|
|
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 22:29:16 +08:00
|
|
|
/* Destroy the current devlink port, and create a new one which has the UNUSED
|
|
|
|
* flavour. At this point, any call to ds->ops->port_setup has been already
|
|
|
|
* balanced out by a call to ds->ops->port_teardown, so we know that any
|
|
|
|
* devlink port regions the driver had are now unregistered. We then call its
|
|
|
|
* ds->ops->port_setup again, in order for the driver to re-create them on the
|
|
|
|
* new devlink port.
|
|
|
|
*/
|
|
|
|
static int dsa_port_reinit_as_unused(struct dsa_port *dp)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dp->ds;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
dsa_port_devlink_teardown(dp);
|
|
|
|
dp->type = DSA_PORT_TYPE_UNUSED;
|
|
|
|
err = dsa_port_devlink_setup(dp);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
if (ds->ops->port_setup) {
|
|
|
|
/* On error, leave the devlink port registered,
|
|
|
|
* dsa_switch_teardown will clean it up later.
|
|
|
|
*/
|
|
|
|
err = ds->ops->port_setup(ds, dp->index);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-09-19 03:11:08 +08:00
|
|
|
static int dsa_devlink_info_get(struct devlink *dl,
|
|
|
|
struct devlink_info_req *req,
|
|
|
|
struct netlink_ext_ack *extack)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_to_ds(dl);
|
|
|
|
|
|
|
|
if (ds->ops->devlink_info_get)
|
|
|
|
return ds->ops->devlink_info_get(ds, req, extack);
|
|
|
|
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
|
2021-01-15 10:11:13 +08:00
|
|
|
static int dsa_devlink_sb_pool_get(struct devlink *dl,
|
|
|
|
unsigned int sb_index, u16 pool_index,
|
|
|
|
struct devlink_sb_pool_info *pool_info)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_to_ds(dl);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_pool_get)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_pool_get(ds, sb_index, pool_index,
|
|
|
|
pool_info);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dsa_devlink_sb_pool_set(struct devlink *dl, unsigned int sb_index,
|
|
|
|
u16 pool_index, u32 size,
|
|
|
|
enum devlink_sb_threshold_type threshold_type,
|
|
|
|
struct netlink_ext_ack *extack)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_to_ds(dl);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_pool_set)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_pool_set(ds, sb_index, pool_index, size,
|
|
|
|
threshold_type, extack);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dsa_devlink_sb_port_pool_get(struct devlink_port *dlp,
|
|
|
|
unsigned int sb_index, u16 pool_index,
|
|
|
|
u32 *p_threshold)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_port_to_ds(dlp);
|
|
|
|
int port = dsa_devlink_port_to_port(dlp);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_port_pool_get)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_port_pool_get(ds, port, sb_index,
|
|
|
|
pool_index, p_threshold);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dsa_devlink_sb_port_pool_set(struct devlink_port *dlp,
|
|
|
|
unsigned int sb_index, u16 pool_index,
|
|
|
|
u32 threshold,
|
|
|
|
struct netlink_ext_ack *extack)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_port_to_ds(dlp);
|
|
|
|
int port = dsa_devlink_port_to_port(dlp);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_port_pool_set)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_port_pool_set(ds, port, sb_index,
|
|
|
|
pool_index, threshold, extack);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
dsa_devlink_sb_tc_pool_bind_get(struct devlink_port *dlp,
|
|
|
|
unsigned int sb_index, u16 tc_index,
|
|
|
|
enum devlink_sb_pool_type pool_type,
|
|
|
|
u16 *p_pool_index, u32 *p_threshold)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_port_to_ds(dlp);
|
|
|
|
int port = dsa_devlink_port_to_port(dlp);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_tc_pool_bind_get)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_tc_pool_bind_get(ds, port, sb_index,
|
|
|
|
tc_index, pool_type,
|
|
|
|
p_pool_index, p_threshold);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
dsa_devlink_sb_tc_pool_bind_set(struct devlink_port *dlp,
|
|
|
|
unsigned int sb_index, u16 tc_index,
|
|
|
|
enum devlink_sb_pool_type pool_type,
|
|
|
|
u16 pool_index, u32 threshold,
|
|
|
|
struct netlink_ext_ack *extack)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_port_to_ds(dlp);
|
|
|
|
int port = dsa_devlink_port_to_port(dlp);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_tc_pool_bind_set)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_tc_pool_bind_set(ds, port, sb_index,
|
|
|
|
tc_index, pool_type,
|
|
|
|
pool_index, threshold,
|
|
|
|
extack);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dsa_devlink_sb_occ_snapshot(struct devlink *dl,
|
|
|
|
unsigned int sb_index)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_to_ds(dl);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_occ_snapshot)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_occ_snapshot(ds, sb_index);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dsa_devlink_sb_occ_max_clear(struct devlink *dl,
|
|
|
|
unsigned int sb_index)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_to_ds(dl);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_occ_max_clear)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_occ_max_clear(ds, sb_index);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dsa_devlink_sb_occ_port_pool_get(struct devlink_port *dlp,
|
|
|
|
unsigned int sb_index,
|
|
|
|
u16 pool_index, u32 *p_cur,
|
|
|
|
u32 *p_max)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_port_to_ds(dlp);
|
|
|
|
int port = dsa_devlink_port_to_port(dlp);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_occ_port_pool_get)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_occ_port_pool_get(ds, port, sb_index,
|
|
|
|
pool_index, p_cur, p_max);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
dsa_devlink_sb_occ_tc_port_bind_get(struct devlink_port *dlp,
|
|
|
|
unsigned int sb_index, u16 tc_index,
|
|
|
|
enum devlink_sb_pool_type pool_type,
|
|
|
|
u32 *p_cur, u32 *p_max)
|
|
|
|
{
|
|
|
|
struct dsa_switch *ds = dsa_devlink_port_to_ds(dlp);
|
|
|
|
int port = dsa_devlink_port_to_port(dlp);
|
|
|
|
|
|
|
|
if (!ds->ops->devlink_sb_occ_tc_port_bind_get)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ds->ops->devlink_sb_occ_tc_port_bind_get(ds, port,
|
|
|
|
sb_index, tc_index,
|
|
|
|
pool_type, p_cur,
|
|
|
|
p_max);
|
|
|
|
}
|
|
|
|
|
2020-09-19 03:11:08 +08:00
|
|
|
static const struct devlink_ops dsa_devlink_ops = {
|
2021-01-15 10:11:13 +08:00
|
|
|
.info_get = dsa_devlink_info_get,
|
|
|
|
.sb_pool_get = dsa_devlink_sb_pool_get,
|
|
|
|
.sb_pool_set = dsa_devlink_sb_pool_set,
|
|
|
|
.sb_port_pool_get = dsa_devlink_sb_port_pool_get,
|
|
|
|
.sb_port_pool_set = dsa_devlink_sb_port_pool_set,
|
|
|
|
.sb_tc_pool_bind_get = dsa_devlink_sb_tc_pool_bind_get,
|
|
|
|
.sb_tc_pool_bind_set = dsa_devlink_sb_tc_pool_bind_set,
|
|
|
|
.sb_occ_snapshot = dsa_devlink_sb_occ_snapshot,
|
|
|
|
.sb_occ_max_clear = dsa_devlink_sb_occ_max_clear,
|
|
|
|
.sb_occ_port_pool_get = dsa_devlink_sb_occ_port_pool_get,
|
|
|
|
.sb_occ_tc_port_bind_get = dsa_devlink_sb_occ_tc_port_bind_get,
|
2020-09-19 03:11:08 +08:00
|
|
|
};
|
|
|
|
|
2021-04-21 02:53:10 +08:00
|
|
|
static int dsa_switch_setup_tag_protocol(struct dsa_switch *ds)
|
|
|
|
{
|
|
|
|
const struct dsa_device_ops *tag_ops = ds->dst->tag_ops;
|
|
|
|
struct dsa_switch_tree *dst = ds->dst;
|
2021-10-21 01:49:50 +08:00
|
|
|
struct dsa_port *cpu_dp;
|
|
|
|
int err;
|
2021-04-21 02:53:10 +08:00
|
|
|
|
|
|
|
if (tag_ops->proto == dst->default_proto)
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
goto connect;
|
2021-04-21 02:53:10 +08:00
|
|
|
|
2021-10-21 01:49:50 +08:00
|
|
|
dsa_switch_for_each_cpu_port(cpu_dp, ds) {
|
2021-10-09 20:26:07 +08:00
|
|
|
rtnl_lock();
|
2021-10-21 01:49:50 +08:00
|
|
|
err = ds->ops->change_tag_protocol(ds, cpu_dp->index,
|
|
|
|
tag_ops->proto);
|
2021-10-09 20:26:07 +08:00
|
|
|
rtnl_unlock();
|
2021-04-21 02:53:10 +08:00
|
|
|
if (err) {
|
|
|
|
dev_err(ds->dev, "Unable to use tag protocol \"%s\": %pe\n",
|
|
|
|
tag_ops->name, ERR_PTR(err));
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
connect:
|
net: dsa: make tagging protocols connect to individual switches from a tree
On the NXP Bluebox 3 board which uses a multi-switch setup with sja1105,
the mechanism through which the tagger connects to the switch tree is
broken, due to improper DSA code design. At the time when tag_ops->connect()
is called in dsa_port_parse_cpu(), DSA hasn't finished "touching" all
the ports, so it doesn't know how large the tree is and how many ports
it has. It has just seen the first CPU port by this time. As a result,
this function will call the tagger's ->connect method too early, and the
tagger will connect only to the first switch from the tree.
This could be perhaps addressed a bit more simply by just moving the
tag_ops->connect(dst) call a bit later (for example in dsa_tree_setup),
but there is already a design inconsistency at present: on the switch
side, the notification is on a per-switch basis, but on the tagger side,
it is on a per-tree basis. Furthermore, the persistent storage itself is
per switch (ds->tagger_data). And the tagger connect and disconnect
procedures (at least the ones that exist currently) could see a fair bit
of simplification if they didn't have to iterate through the switches of
a tree.
To fix the issue, this change transforms tag_ops->connect(dst) into
tag_ops->connect(ds) and moves it somewhere where we already iterate
over all switches of a tree. That is in dsa_switch_setup_tag_protocol(),
which is a good placement because we already have there the connection
call to the switch side of things.
As for the dsa_tree_bind_tag_proto() method (called from the code path
that changes the tag protocol), things are a bit more complicated
because we receive the tree as argument, yet when we unwind on errors,
it would be nice to not call tag_ops->disconnect(ds) where we didn't
previously call tag_ops->connect(ds). We didn't have this problem before
because the tag_ops connection operations passed the entire dst before,
and this is more fine grained now. To solve the error rewind case using
the new API, we have to create yet one more cross-chip notifier for
disconnection, and stay connected with the old tag protocol to all the
switches in the tree until we've succeeded to connect with the new one
as well. So if something fails half way, the whole tree is still
connected to the old tagger. But there may still be leaks if the tagger
fails to connect to the 2nd out of 3 switches in a tree: somebody needs
to tell the tagger to disconnect from the first switch. Nothing comes
for free, and this was previously handled privately by the tagging
protocol driver before, but now we need to emit a disconnect cross-chip
notifier for that, because DSA has to take care of the unwind path. We
assume that the tagging protocol has connected to a switch if it has set
ds->tagger_data to something, otherwise we avoid calling its
disconnection method in the error rewind path.
The rest of the changes are in the tagging protocol drivers, and have to
do with the replacement of dst with ds. The iteration is removed and the
error unwind path is simplified, as mentioned above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-14 09:45:36 +08:00
|
|
|
if (tag_ops->connect) {
|
|
|
|
err = tag_ops->connect(ds);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
if (ds->ops->connect_tag_protocol) {
|
|
|
|
err = ds->ops->connect_tag_protocol(ds, tag_ops->proto);
|
|
|
|
if (err) {
|
|
|
|
dev_err(ds->dev,
|
|
|
|
"Unable to connect to tag protocol \"%s\": %pe\n",
|
|
|
|
tag_ops->name, ERR_PTR(err));
|
net: dsa: make tagging protocols connect to individual switches from a tree
On the NXP Bluebox 3 board which uses a multi-switch setup with sja1105,
the mechanism through which the tagger connects to the switch tree is
broken, due to improper DSA code design. At the time when tag_ops->connect()
is called in dsa_port_parse_cpu(), DSA hasn't finished "touching" all
the ports, so it doesn't know how large the tree is and how many ports
it has. It has just seen the first CPU port by this time. As a result,
this function will call the tagger's ->connect method too early, and the
tagger will connect only to the first switch from the tree.
This could be perhaps addressed a bit more simply by just moving the
tag_ops->connect(dst) call a bit later (for example in dsa_tree_setup),
but there is already a design inconsistency at present: on the switch
side, the notification is on a per-switch basis, but on the tagger side,
it is on a per-tree basis. Furthermore, the persistent storage itself is
per switch (ds->tagger_data). And the tagger connect and disconnect
procedures (at least the ones that exist currently) could see a fair bit
of simplification if they didn't have to iterate through the switches of
a tree.
To fix the issue, this change transforms tag_ops->connect(dst) into
tag_ops->connect(ds) and moves it somewhere where we already iterate
over all switches of a tree. That is in dsa_switch_setup_tag_protocol(),
which is a good placement because we already have there the connection
call to the switch side of things.
As for the dsa_tree_bind_tag_proto() method (called from the code path
that changes the tag protocol), things are a bit more complicated
because we receive the tree as argument, yet when we unwind on errors,
it would be nice to not call tag_ops->disconnect(ds) where we didn't
previously call tag_ops->connect(ds). We didn't have this problem before
because the tag_ops connection operations passed the entire dst before,
and this is more fine grained now. To solve the error rewind case using
the new API, we have to create yet one more cross-chip notifier for
disconnection, and stay connected with the old tag protocol to all the
switches in the tree until we've succeeded to connect with the new one
as well. So if something fails half way, the whole tree is still
connected to the old tagger. But there may still be leaks if the tagger
fails to connect to the 2nd out of 3 switches in a tree: somebody needs
to tell the tagger to disconnect from the first switch. Nothing comes
for free, and this was previously handled privately by the tagging
protocol driver before, but now we need to emit a disconnect cross-chip
notifier for that, because DSA has to take care of the unwind path. We
assume that the tagging protocol has connected to a switch if it has set
ds->tagger_data to something, otherwise we avoid calling its
disconnection method in the error rewind path.
The rest of the changes are in the tagging protocol drivers, and have to
do with the replacement of dst with ds. The iteration is removed and the
error unwind path is simplified, as mentioned above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-14 09:45:36 +08:00
|
|
|
goto disconnect;
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-04-21 02:53:10 +08:00
|
|
|
return 0;
|
net: dsa: make tagging protocols connect to individual switches from a tree
On the NXP Bluebox 3 board which uses a multi-switch setup with sja1105,
the mechanism through which the tagger connects to the switch tree is
broken, due to improper DSA code design. At the time when tag_ops->connect()
is called in dsa_port_parse_cpu(), DSA hasn't finished "touching" all
the ports, so it doesn't know how large the tree is and how many ports
it has. It has just seen the first CPU port by this time. As a result,
this function will call the tagger's ->connect method too early, and the
tagger will connect only to the first switch from the tree.
This could be perhaps addressed a bit more simply by just moving the
tag_ops->connect(dst) call a bit later (for example in dsa_tree_setup),
but there is already a design inconsistency at present: on the switch
side, the notification is on a per-switch basis, but on the tagger side,
it is on a per-tree basis. Furthermore, the persistent storage itself is
per switch (ds->tagger_data). And the tagger connect and disconnect
procedures (at least the ones that exist currently) could see a fair bit
of simplification if they didn't have to iterate through the switches of
a tree.
To fix the issue, this change transforms tag_ops->connect(dst) into
tag_ops->connect(ds) and moves it somewhere where we already iterate
over all switches of a tree. That is in dsa_switch_setup_tag_protocol(),
which is a good placement because we already have there the connection
call to the switch side of things.
As for the dsa_tree_bind_tag_proto() method (called from the code path
that changes the tag protocol), things are a bit more complicated
because we receive the tree as argument, yet when we unwind on errors,
it would be nice to not call tag_ops->disconnect(ds) where we didn't
previously call tag_ops->connect(ds). We didn't have this problem before
because the tag_ops connection operations passed the entire dst before,
and this is more fine grained now. To solve the error rewind case using
the new API, we have to create yet one more cross-chip notifier for
disconnection, and stay connected with the old tag protocol to all the
switches in the tree until we've succeeded to connect with the new one
as well. So if something fails half way, the whole tree is still
connected to the old tagger. But there may still be leaks if the tagger
fails to connect to the 2nd out of 3 switches in a tree: somebody needs
to tell the tagger to disconnect from the first switch. Nothing comes
for free, and this was previously handled privately by the tagging
protocol driver before, but now we need to emit a disconnect cross-chip
notifier for that, because DSA has to take care of the unwind path. We
assume that the tagging protocol has connected to a switch if it has set
ds->tagger_data to something, otherwise we avoid calling its
disconnection method in the error rewind path.
The rest of the changes are in the tagging protocol drivers, and have to
do with the replacement of dst with ds. The iteration is removed and the
error unwind path is simplified, as mentioned above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-14 09:45:36 +08:00
|
|
|
|
|
|
|
disconnect:
|
|
|
|
if (tag_ops->disconnect)
|
|
|
|
tag_ops->disconnect(ds);
|
|
|
|
|
|
|
|
return err;
|
2021-04-21 02:53:10 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:47 +08:00
|
|
|
static int dsa_switch_setup(struct dsa_switch *ds)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2019-10-25 07:03:51 +08:00
|
|
|
struct dsa_devlink_priv *dl_priv;
|
2020-10-05 00:12:53 +08:00
|
|
|
struct dsa_port *dp;
|
2019-10-22 04:51:19 +08:00
|
|
|
int err;
|
|
|
|
|
|
|
|
if (ds->setup)
|
|
|
|
return 0;
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2016-06-08 07:32:39 +08:00
|
|
|
/* Initialize ds->phys_mii_mask before registering the slave MDIO bus
|
2016-08-24 00:38:56 +08:00
|
|
|
* driver and before ops->setup() has run, since the switch drivers and
|
2016-06-08 07:32:39 +08:00
|
|
|
* the slave MDIO bus driver rely on these values for probing PHY
|
|
|
|
* devices or not
|
|
|
|
*/
|
2017-10-26 23:22:56 +08:00
|
|
|
ds->phys_mii_mask |= dsa_user_ports(ds);
|
2016-06-08 07:32:39 +08:00
|
|
|
|
2017-03-29 05:45:07 +08:00
|
|
|
/* Add the switch to devlink before calling setup, so that setup can
|
|
|
|
* add dpipe tables
|
|
|
|
*/
|
2021-08-09 02:57:43 +08:00
|
|
|
ds->devlink =
|
|
|
|
devlink_alloc(&dsa_devlink_ops, sizeof(*dl_priv), ds->dev);
|
2017-03-29 05:45:07 +08:00
|
|
|
if (!ds->devlink)
|
|
|
|
return -ENOMEM;
|
2019-10-25 07:03:51 +08:00
|
|
|
dl_priv = devlink_priv(ds->devlink);
|
|
|
|
dl_priv->ds = ds;
|
2017-03-29 05:45:07 +08:00
|
|
|
|
2020-10-05 00:12:53 +08:00
|
|
|
/* Setup devlink port instances now, so that the switch
|
|
|
|
* setup() can register regions etc, against the ports
|
|
|
|
*/
|
2021-10-21 01:49:51 +08:00
|
|
|
dsa_switch_for_each_port(dp, ds) {
|
|
|
|
err = dsa_port_devlink_setup(dp);
|
|
|
|
if (err)
|
|
|
|
goto unregister_devlink_ports;
|
2020-10-05 00:12:53 +08:00
|
|
|
}
|
|
|
|
|
2017-02-04 02:20:20 +08:00
|
|
|
err = dsa_switch_register_notifier(ds);
|
|
|
|
if (err)
|
2020-10-05 00:12:53 +08:00
|
|
|
goto unregister_devlink_ports;
|
2017-02-04 02:20:20 +08:00
|
|
|
|
net: dsa: set configure_vlan_while_not_filtering to true by default
As explained in commit 54a0ed0df496 ("net: dsa: provide an option for
drivers to always receive bridge VLANs"), DSA has historically been
skipping VLAN switchdev operations when the bridge wasn't in
vlan_filtering mode, but the reason why it was doing that has never been
clear. So the configure_vlan_while_not_filtering option is there merely
to preserve functionality for existing drivers. It isn't some behavior
that drivers should opt into. Ideally, when all drivers leave this flag
set, we can delete the dsa_port_skip_vlan_configuration() function.
New drivers always seem to omit setting this flag, for some reason. So
let's reverse the logic: the DSA core sets it by default to true before
the .setup() callback, and legacy drivers can turn it off. This way, new
drivers get the new behavior by default, unless they explicitly set the
flag to false, which is more obvious during review.
Remove the assignment from drivers which were setting it to true, and
add the assignment to false for the drivers that didn't previously have
it. This way, it should be easier to see how many we have left.
The following drivers: lan9303, mv88e6060 were skipped from setting this
flag to false, because they didn't have any VLAN offload ops in the
first place.
The Broadcom Starfighter 2 driver calls the common b53_switch_alloc and
therefore also inherits the configure_vlan_while_not_filtering=true
behavior.
Also, print a message through netlink extack every time a VLAN has been
skipped. This is mildly annoying on purpose, so that (a) it is at least
clear that VLANs are being skipped - the legacy behavior in itself is
confusing, and the extack should be much more difficult to miss, unlike
kernel logs - and (b) people have one more incentive to convert to the
new behavior.
No behavior change except for the added prints is intended at this time.
$ ip link add br0 type bridge vlan_filtering 0
$ ip link set sw0p2 master br0
[ 60.315148] br0: port 1(sw0p2) entered blocking state
[ 60.320350] br0: port 1(sw0p2) entered disabled state
[ 60.327839] device sw0p2 entered promiscuous mode
[ 60.334905] br0: port 1(sw0p2) entered blocking state
[ 60.340142] br0: port 1(sw0p2) entered forwarding state
Warning: dsa_core: skipping configuration of VLAN. # This was the pvid
$ bridge vlan add dev sw0p2 vid 100
Warning: dsa_core: skipping configuration of VLAN.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210115231919.43834-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16 07:19:19 +08:00
|
|
|
ds->configure_vlan_while_not_filtering = true;
|
|
|
|
|
2019-05-05 18:19:20 +08:00
|
|
|
err = ds->ops->setup(ds);
|
|
|
|
if (err < 0)
|
2019-05-30 14:09:07 +08:00
|
|
|
goto unregister_notifier;
|
2019-05-05 18:19:20 +08:00
|
|
|
|
2021-04-21 02:53:10 +08:00
|
|
|
err = dsa_switch_setup_tag_protocol(ds);
|
|
|
|
if (err)
|
|
|
|
goto teardown;
|
|
|
|
|
2016-08-24 00:38:56 +08:00
|
|
|
if (!ds->slave_mii_bus && ds->ops->phy_read) {
|
net: dsa: don't allocate the slave_mii_bus using devres
The Linux device model permits both the ->shutdown and ->remove driver
methods to get called during a shutdown procedure. Example: a DSA switch
which sits on an SPI bus, and the SPI bus driver calls this on its
->shutdown method:
spi_unregister_controller
-> device_for_each_child(&ctlr->dev, NULL, __unregister);
-> spi_unregister_device(to_spi_device(dev));
-> device_del(&spi->dev);
So this is a simple pattern which can theoretically appear on any bus,
although the only other buses on which I've been able to find it are
I2C:
i2c_del_adapter
-> device_for_each_child(&adap->dev, NULL, __unregister_client);
-> i2c_unregister_device(client);
-> device_unregister(&client->dev);
The implication of this pattern is that devices on these buses can be
unregistered after having been shut down. The drivers for these devices
might choose to return early either from ->remove or ->shutdown if the
other callback has already run once, and they might choose that the
->shutdown method should only perform a subset of the teardown done by
->remove (to avoid unnecessary delays when rebooting).
So in other words, the device driver may choose on ->remove to not
do anything (therefore to not unregister an MDIO bus it has registered
on ->probe), because this ->remove is actually triggered by the
device_shutdown path, and its ->shutdown method has already run and done
the minimally required cleanup.
This used to be fine until the blamed commit, but now, the following
BUG_ON triggers:
void mdiobus_free(struct mii_bus *bus)
{
/* For compatibility with error handling in drivers. */
if (bus->state == MDIOBUS_ALLOCATED) {
kfree(bus);
return;
}
BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
bus->state = MDIOBUS_RELEASED;
put_device(&bus->dev);
}
In other words, there is an attempt to free an MDIO bus which was not
unregistered. The attempt to free it comes from the devres release
callbacks of the SPI device, which are executed after the device is
unregistered.
I'm not saying that the fact that MDIO buses allocated using devres
would automatically get unregistered wasn't strange. I'm just saying
that the commit didn't care about auditing existing call paths in the
kernel, and now, the following code sequences are potentially buggy:
(a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
located on a bus that unregisters its children on shutdown. After
the blamed patch, either both the alloc and the register should use
devres, or none should.
(b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
mdiobus_unregister at all in the remove path. After the blamed
patch, nobody unregisters the MDIO bus anymore, so this is even more
buggy than the previous case which needs a specific bus
configuration to be seen, this one is an unconditional bug.
In this case, DSA falls into category (a), it tries to be helpful and
registers an MDIO bus on behalf of the switch, which might be on such a
bus. I've no idea why it does it under devres.
It does this on probe:
if (!ds->slave_mii_bus && ds->ops->phy_read)
alloc and register mdio bus
and this on remove:
if (ds->slave_mii_bus && ds->ops->phy_read)
unregister mdio bus
I _could_ imagine using devres because the condition used on remove is
different than the condition used on probe. So strictly speaking, DSA
cannot determine whether the ds->slave_mii_bus it sees on remove is the
ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
have solved that problem. But nonetheless, the existing code already
proceeds to unregister the MDIO bus, even though it might be
unregistering an MDIO bus it has never registered. So I can only guess
that no driver that implements ds->ops->phy_read also allocates and
registers ds->slave_mii_bus itself.
So in that case, if unregistering is fine, freeing must be fine too.
Stop using devres and free the MDIO bus manually. This will make devres
stop attempting to free a still registered MDIO bus on ->shutdown.
Fixes: ac3a68d56651 ("net: phy: don't abuse devres in devm_mdiobus_register()")
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-21 05:42:08 +08:00
|
|
|
ds->slave_mii_bus = mdiobus_alloc();
|
2019-05-30 14:09:07 +08:00
|
|
|
if (!ds->slave_mii_bus) {
|
|
|
|
err = -ENOMEM;
|
2021-02-05 00:33:51 +08:00
|
|
|
goto teardown;
|
2019-05-30 14:09:07 +08:00
|
|
|
}
|
2016-06-08 07:32:40 +08:00
|
|
|
|
|
|
|
dsa_slave_mii_bus_init(ds);
|
|
|
|
|
|
|
|
err = mdiobus_register(ds->slave_mii_bus);
|
|
|
|
if (err < 0)
|
net: dsa: don't allocate the slave_mii_bus using devres
The Linux device model permits both the ->shutdown and ->remove driver
methods to get called during a shutdown procedure. Example: a DSA switch
which sits on an SPI bus, and the SPI bus driver calls this on its
->shutdown method:
spi_unregister_controller
-> device_for_each_child(&ctlr->dev, NULL, __unregister);
-> spi_unregister_device(to_spi_device(dev));
-> device_del(&spi->dev);
So this is a simple pattern which can theoretically appear on any bus,
although the only other buses on which I've been able to find it are
I2C:
i2c_del_adapter
-> device_for_each_child(&adap->dev, NULL, __unregister_client);
-> i2c_unregister_device(client);
-> device_unregister(&client->dev);
The implication of this pattern is that devices on these buses can be
unregistered after having been shut down. The drivers for these devices
might choose to return early either from ->remove or ->shutdown if the
other callback has already run once, and they might choose that the
->shutdown method should only perform a subset of the teardown done by
->remove (to avoid unnecessary delays when rebooting).
So in other words, the device driver may choose on ->remove to not
do anything (therefore to not unregister an MDIO bus it has registered
on ->probe), because this ->remove is actually triggered by the
device_shutdown path, and its ->shutdown method has already run and done
the minimally required cleanup.
This used to be fine until the blamed commit, but now, the following
BUG_ON triggers:
void mdiobus_free(struct mii_bus *bus)
{
/* For compatibility with error handling in drivers. */
if (bus->state == MDIOBUS_ALLOCATED) {
kfree(bus);
return;
}
BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
bus->state = MDIOBUS_RELEASED;
put_device(&bus->dev);
}
In other words, there is an attempt to free an MDIO bus which was not
unregistered. The attempt to free it comes from the devres release
callbacks of the SPI device, which are executed after the device is
unregistered.
I'm not saying that the fact that MDIO buses allocated using devres
would automatically get unregistered wasn't strange. I'm just saying
that the commit didn't care about auditing existing call paths in the
kernel, and now, the following code sequences are potentially buggy:
(a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
located on a bus that unregisters its children on shutdown. After
the blamed patch, either both the alloc and the register should use
devres, or none should.
(b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
mdiobus_unregister at all in the remove path. After the blamed
patch, nobody unregisters the MDIO bus anymore, so this is even more
buggy than the previous case which needs a specific bus
configuration to be seen, this one is an unconditional bug.
In this case, DSA falls into category (a), it tries to be helpful and
registers an MDIO bus on behalf of the switch, which might be on such a
bus. I've no idea why it does it under devres.
It does this on probe:
if (!ds->slave_mii_bus && ds->ops->phy_read)
alloc and register mdio bus
and this on remove:
if (ds->slave_mii_bus && ds->ops->phy_read)
unregister mdio bus
I _could_ imagine using devres because the condition used on remove is
different than the condition used on probe. So strictly speaking, DSA
cannot determine whether the ds->slave_mii_bus it sees on remove is the
ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
have solved that problem. But nonetheless, the existing code already
proceeds to unregister the MDIO bus, even though it might be
unregistering an MDIO bus it has never registered. So I can only guess
that no driver that implements ds->ops->phy_read also allocates and
registers ds->slave_mii_bus itself.
So in that case, if unregistering is fine, freeing must be fine too.
Stop using devres and free the MDIO bus manually. This will make devres
stop attempting to free a still registered MDIO bus on ->shutdown.
Fixes: ac3a68d56651 ("net: phy: don't abuse devres in devm_mdiobus_register()")
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-21 05:42:08 +08:00
|
|
|
goto free_slave_mii_bus;
|
2016-06-08 07:32:40 +08:00
|
|
|
}
|
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
ds->setup = true;
|
2021-09-25 19:23:01 +08:00
|
|
|
devlink_register(ds->devlink);
|
2016-06-05 03:17:07 +08:00
|
|
|
return 0;
|
2019-05-30 14:09:07 +08:00
|
|
|
|
net: dsa: don't allocate the slave_mii_bus using devres
The Linux device model permits both the ->shutdown and ->remove driver
methods to get called during a shutdown procedure. Example: a DSA switch
which sits on an SPI bus, and the SPI bus driver calls this on its
->shutdown method:
spi_unregister_controller
-> device_for_each_child(&ctlr->dev, NULL, __unregister);
-> spi_unregister_device(to_spi_device(dev));
-> device_del(&spi->dev);
So this is a simple pattern which can theoretically appear on any bus,
although the only other buses on which I've been able to find it are
I2C:
i2c_del_adapter
-> device_for_each_child(&adap->dev, NULL, __unregister_client);
-> i2c_unregister_device(client);
-> device_unregister(&client->dev);
The implication of this pattern is that devices on these buses can be
unregistered after having been shut down. The drivers for these devices
might choose to return early either from ->remove or ->shutdown if the
other callback has already run once, and they might choose that the
->shutdown method should only perform a subset of the teardown done by
->remove (to avoid unnecessary delays when rebooting).
So in other words, the device driver may choose on ->remove to not
do anything (therefore to not unregister an MDIO bus it has registered
on ->probe), because this ->remove is actually triggered by the
device_shutdown path, and its ->shutdown method has already run and done
the minimally required cleanup.
This used to be fine until the blamed commit, but now, the following
BUG_ON triggers:
void mdiobus_free(struct mii_bus *bus)
{
/* For compatibility with error handling in drivers. */
if (bus->state == MDIOBUS_ALLOCATED) {
kfree(bus);
return;
}
BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
bus->state = MDIOBUS_RELEASED;
put_device(&bus->dev);
}
In other words, there is an attempt to free an MDIO bus which was not
unregistered. The attempt to free it comes from the devres release
callbacks of the SPI device, which are executed after the device is
unregistered.
I'm not saying that the fact that MDIO buses allocated using devres
would automatically get unregistered wasn't strange. I'm just saying
that the commit didn't care about auditing existing call paths in the
kernel, and now, the following code sequences are potentially buggy:
(a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
located on a bus that unregisters its children on shutdown. After
the blamed patch, either both the alloc and the register should use
devres, or none should.
(b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
mdiobus_unregister at all in the remove path. After the blamed
patch, nobody unregisters the MDIO bus anymore, so this is even more
buggy than the previous case which needs a specific bus
configuration to be seen, this one is an unconditional bug.
In this case, DSA falls into category (a), it tries to be helpful and
registers an MDIO bus on behalf of the switch, which might be on such a
bus. I've no idea why it does it under devres.
It does this on probe:
if (!ds->slave_mii_bus && ds->ops->phy_read)
alloc and register mdio bus
and this on remove:
if (ds->slave_mii_bus && ds->ops->phy_read)
unregister mdio bus
I _could_ imagine using devres because the condition used on remove is
different than the condition used on probe. So strictly speaking, DSA
cannot determine whether the ds->slave_mii_bus it sees on remove is the
ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
have solved that problem. But nonetheless, the existing code already
proceeds to unregister the MDIO bus, even though it might be
unregistering an MDIO bus it has never registered. So I can only guess
that no driver that implements ds->ops->phy_read also allocates and
registers ds->slave_mii_bus itself.
So in that case, if unregistering is fine, freeing must be fine too.
Stop using devres and free the MDIO bus manually. This will make devres
stop attempting to free a still registered MDIO bus on ->shutdown.
Fixes: ac3a68d56651 ("net: phy: don't abuse devres in devm_mdiobus_register()")
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-21 05:42:08 +08:00
|
|
|
free_slave_mii_bus:
|
|
|
|
if (ds->slave_mii_bus && ds->ops->phy_read)
|
|
|
|
mdiobus_free(ds->slave_mii_bus);
|
2021-02-05 00:33:51 +08:00
|
|
|
teardown:
|
|
|
|
if (ds->ops->teardown)
|
|
|
|
ds->ops->teardown(ds);
|
2019-05-30 14:09:07 +08:00
|
|
|
unregister_notifier:
|
|
|
|
dsa_switch_unregister_notifier(ds);
|
2020-10-05 00:12:53 +08:00
|
|
|
unregister_devlink_ports:
|
2021-10-21 01:49:51 +08:00
|
|
|
dsa_switch_for_each_port(dp, ds)
|
|
|
|
dsa_port_devlink_teardown(dp);
|
2019-05-30 14:09:07 +08:00
|
|
|
devlink_free(ds->devlink);
|
|
|
|
ds->devlink = NULL;
|
|
|
|
return err;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:47 +08:00
|
|
|
static void dsa_switch_teardown(struct dsa_switch *ds)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2020-10-05 00:12:53 +08:00
|
|
|
struct dsa_port *dp;
|
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
if (!ds->setup)
|
|
|
|
return;
|
|
|
|
|
2021-09-25 19:23:01 +08:00
|
|
|
if (ds->devlink)
|
|
|
|
devlink_unregister(ds->devlink);
|
|
|
|
|
net: dsa: don't allocate the slave_mii_bus using devres
The Linux device model permits both the ->shutdown and ->remove driver
methods to get called during a shutdown procedure. Example: a DSA switch
which sits on an SPI bus, and the SPI bus driver calls this on its
->shutdown method:
spi_unregister_controller
-> device_for_each_child(&ctlr->dev, NULL, __unregister);
-> spi_unregister_device(to_spi_device(dev));
-> device_del(&spi->dev);
So this is a simple pattern which can theoretically appear on any bus,
although the only other buses on which I've been able to find it are
I2C:
i2c_del_adapter
-> device_for_each_child(&adap->dev, NULL, __unregister_client);
-> i2c_unregister_device(client);
-> device_unregister(&client->dev);
The implication of this pattern is that devices on these buses can be
unregistered after having been shut down. The drivers for these devices
might choose to return early either from ->remove or ->shutdown if the
other callback has already run once, and they might choose that the
->shutdown method should only perform a subset of the teardown done by
->remove (to avoid unnecessary delays when rebooting).
So in other words, the device driver may choose on ->remove to not
do anything (therefore to not unregister an MDIO bus it has registered
on ->probe), because this ->remove is actually triggered by the
device_shutdown path, and its ->shutdown method has already run and done
the minimally required cleanup.
This used to be fine until the blamed commit, but now, the following
BUG_ON triggers:
void mdiobus_free(struct mii_bus *bus)
{
/* For compatibility with error handling in drivers. */
if (bus->state == MDIOBUS_ALLOCATED) {
kfree(bus);
return;
}
BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
bus->state = MDIOBUS_RELEASED;
put_device(&bus->dev);
}
In other words, there is an attempt to free an MDIO bus which was not
unregistered. The attempt to free it comes from the devres release
callbacks of the SPI device, which are executed after the device is
unregistered.
I'm not saying that the fact that MDIO buses allocated using devres
would automatically get unregistered wasn't strange. I'm just saying
that the commit didn't care about auditing existing call paths in the
kernel, and now, the following code sequences are potentially buggy:
(a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
located on a bus that unregisters its children on shutdown. After
the blamed patch, either both the alloc and the register should use
devres, or none should.
(b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
mdiobus_unregister at all in the remove path. After the blamed
patch, nobody unregisters the MDIO bus anymore, so this is even more
buggy than the previous case which needs a specific bus
configuration to be seen, this one is an unconditional bug.
In this case, DSA falls into category (a), it tries to be helpful and
registers an MDIO bus on behalf of the switch, which might be on such a
bus. I've no idea why it does it under devres.
It does this on probe:
if (!ds->slave_mii_bus && ds->ops->phy_read)
alloc and register mdio bus
and this on remove:
if (ds->slave_mii_bus && ds->ops->phy_read)
unregister mdio bus
I _could_ imagine using devres because the condition used on remove is
different than the condition used on probe. So strictly speaking, DSA
cannot determine whether the ds->slave_mii_bus it sees on remove is the
ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
have solved that problem. But nonetheless, the existing code already
proceeds to unregister the MDIO bus, even though it might be
unregistering an MDIO bus it has never registered. So I can only guess
that no driver that implements ds->ops->phy_read also allocates and
registers ds->slave_mii_bus itself.
So in that case, if unregistering is fine, freeing must be fine too.
Stop using devres and free the MDIO bus manually. This will make devres
stop attempting to free a still registered MDIO bus on ->shutdown.
Fixes: ac3a68d56651 ("net: phy: don't abuse devres in devm_mdiobus_register()")
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-21 05:42:08 +08:00
|
|
|
if (ds->slave_mii_bus && ds->ops->phy_read) {
|
2016-06-08 07:32:40 +08:00
|
|
|
mdiobus_unregister(ds->slave_mii_bus);
|
net: dsa: don't allocate the slave_mii_bus using devres
The Linux device model permits both the ->shutdown and ->remove driver
methods to get called during a shutdown procedure. Example: a DSA switch
which sits on an SPI bus, and the SPI bus driver calls this on its
->shutdown method:
spi_unregister_controller
-> device_for_each_child(&ctlr->dev, NULL, __unregister);
-> spi_unregister_device(to_spi_device(dev));
-> device_del(&spi->dev);
So this is a simple pattern which can theoretically appear on any bus,
although the only other buses on which I've been able to find it are
I2C:
i2c_del_adapter
-> device_for_each_child(&adap->dev, NULL, __unregister_client);
-> i2c_unregister_device(client);
-> device_unregister(&client->dev);
The implication of this pattern is that devices on these buses can be
unregistered after having been shut down. The drivers for these devices
might choose to return early either from ->remove or ->shutdown if the
other callback has already run once, and they might choose that the
->shutdown method should only perform a subset of the teardown done by
->remove (to avoid unnecessary delays when rebooting).
So in other words, the device driver may choose on ->remove to not
do anything (therefore to not unregister an MDIO bus it has registered
on ->probe), because this ->remove is actually triggered by the
device_shutdown path, and its ->shutdown method has already run and done
the minimally required cleanup.
This used to be fine until the blamed commit, but now, the following
BUG_ON triggers:
void mdiobus_free(struct mii_bus *bus)
{
/* For compatibility with error handling in drivers. */
if (bus->state == MDIOBUS_ALLOCATED) {
kfree(bus);
return;
}
BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
bus->state = MDIOBUS_RELEASED;
put_device(&bus->dev);
}
In other words, there is an attempt to free an MDIO bus which was not
unregistered. The attempt to free it comes from the devres release
callbacks of the SPI device, which are executed after the device is
unregistered.
I'm not saying that the fact that MDIO buses allocated using devres
would automatically get unregistered wasn't strange. I'm just saying
that the commit didn't care about auditing existing call paths in the
kernel, and now, the following code sequences are potentially buggy:
(a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
located on a bus that unregisters its children on shutdown. After
the blamed patch, either both the alloc and the register should use
devres, or none should.
(b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
mdiobus_unregister at all in the remove path. After the blamed
patch, nobody unregisters the MDIO bus anymore, so this is even more
buggy than the previous case which needs a specific bus
configuration to be seen, this one is an unconditional bug.
In this case, DSA falls into category (a), it tries to be helpful and
registers an MDIO bus on behalf of the switch, which might be on such a
bus. I've no idea why it does it under devres.
It does this on probe:
if (!ds->slave_mii_bus && ds->ops->phy_read)
alloc and register mdio bus
and this on remove:
if (ds->slave_mii_bus && ds->ops->phy_read)
unregister mdio bus
I _could_ imagine using devres because the condition used on remove is
different than the condition used on probe. So strictly speaking, DSA
cannot determine whether the ds->slave_mii_bus it sees on remove is the
ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
have solved that problem. But nonetheless, the existing code already
proceeds to unregister the MDIO bus, even though it might be
unregistering an MDIO bus it has never registered. So I can only guess
that no driver that implements ds->ops->phy_read also allocates and
registers ds->slave_mii_bus itself.
So in that case, if unregistering is fine, freeing must be fine too.
Stop using devres and free the MDIO bus manually. This will make devres
stop attempting to free a still registered MDIO bus on ->shutdown.
Fixes: ac3a68d56651 ("net: phy: don't abuse devres in devm_mdiobus_register()")
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-21 05:42:08 +08:00
|
|
|
mdiobus_free(ds->slave_mii_bus);
|
|
|
|
ds->slave_mii_bus = NULL;
|
|
|
|
}
|
2017-02-04 02:20:20 +08:00
|
|
|
|
2019-06-08 20:04:28 +08:00
|
|
|
if (ds->ops->teardown)
|
|
|
|
ds->ops->teardown(ds);
|
|
|
|
|
2021-10-12 20:37:35 +08:00
|
|
|
dsa_switch_unregister_notifier(ds);
|
|
|
|
|
2017-03-29 05:45:07 +08:00
|
|
|
if (ds->devlink) {
|
2021-10-21 01:49:51 +08:00
|
|
|
dsa_switch_for_each_port(dp, ds)
|
|
|
|
dsa_port_devlink_teardown(dp);
|
2017-03-29 05:45:07 +08:00
|
|
|
devlink_free(ds->devlink);
|
|
|
|
ds->devlink = NULL;
|
|
|
|
}
|
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
ds->setup = false;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports
Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
(and similarly for other ports)
What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.
But deleting that cleanup code, the warnings go away.
=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.
The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.
The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.
In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.
So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.
Fixes: 161ca59d39e9 ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-14 21:47:26 +08:00
|
|
|
/* First tear down the non-shared, then the shared ports. This ensures that
|
|
|
|
* all work items scheduled by our switchdev handlers for user ports have
|
|
|
|
* completed before we destroy the refcounting kept on the shared ports.
|
|
|
|
*/
|
|
|
|
static void dsa_tree_teardown_ports(struct dsa_switch_tree *dst)
|
|
|
|
{
|
|
|
|
struct dsa_port *dp;
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list)
|
|
|
|
if (dsa_port_is_user(dp) || dsa_port_is_unused(dp))
|
|
|
|
dsa_port_teardown(dp);
|
|
|
|
|
|
|
|
dsa_flush_workqueue();
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list)
|
|
|
|
if (dsa_port_is_dsa(dp) || dsa_port_is_cpu(dp))
|
|
|
|
dsa_port_teardown(dp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void dsa_tree_teardown_switches(struct dsa_switch_tree *dst)
|
|
|
|
{
|
|
|
|
struct dsa_port *dp;
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list)
|
|
|
|
dsa_switch_teardown(dp->ds);
|
|
|
|
}
|
|
|
|
|
2022-01-06 07:11:16 +08:00
|
|
|
/* Bring shared ports up first, then non-shared ports */
|
|
|
|
static int dsa_tree_setup_ports(struct dsa_switch_tree *dst)
|
2017-11-07 05:11:47 +08:00
|
|
|
{
|
2017-11-07 05:11:48 +08:00
|
|
|
struct dsa_port *dp;
|
2022-01-06 07:11:16 +08:00
|
|
|
int err = 0;
|
2017-11-07 05:11:47 +08:00
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
2022-01-06 07:11:16 +08:00
|
|
|
if (dsa_port_is_dsa(dp) || dsa_port_is_cpu(dp)) {
|
|
|
|
err = dsa_port_setup(dp);
|
|
|
|
if (err)
|
|
|
|
goto teardown;
|
|
|
|
}
|
2019-10-22 04:51:19 +08:00
|
|
|
}
|
2017-11-07 05:11:48 +08:00
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
2022-01-06 07:11:16 +08:00
|
|
|
if (dsa_port_is_user(dp) || dsa_port_is_unused(dp)) {
|
|
|
|
err = dsa_port_setup(dp);
|
|
|
|
if (err) {
|
|
|
|
err = dsa_port_reinit_as_unused(dp);
|
|
|
|
if (err)
|
|
|
|
goto teardown;
|
|
|
|
}
|
2021-03-29 23:30:16 +08:00
|
|
|
}
|
2017-11-07 05:11:47 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2019-05-30 14:09:07 +08:00
|
|
|
|
2019-10-22 04:51:19 +08:00
|
|
|
teardown:
|
net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports
Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
(and similarly for other ports)
What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.
But deleting that cleanup code, the warnings go away.
=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.
The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.
The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.
In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.
So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.
Fixes: 161ca59d39e9 ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-14 21:47:26 +08:00
|
|
|
dsa_tree_teardown_ports(dst);
|
2019-05-30 14:09:07 +08:00
|
|
|
|
2022-01-06 07:11:16 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dsa_tree_setup_switches(struct dsa_switch_tree *dst)
|
|
|
|
{
|
|
|
|
struct dsa_port *dp;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
|
|
|
err = dsa_switch_setup(dp->ds);
|
|
|
|
if (err) {
|
|
|
|
dsa_tree_teardown_switches(dst);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2019-05-30 14:09:07 +08:00
|
|
|
|
|
|
|
return err;
|
2017-11-07 05:11:47 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:45 +08:00
|
|
|
static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
|
|
|
|
{
|
2019-10-22 04:51:22 +08:00
|
|
|
struct dsa_port *dp;
|
|
|
|
int err;
|
2017-11-07 05:11:45 +08:00
|
|
|
|
2022-01-06 07:11:15 +08:00
|
|
|
rtnl_lock();
|
|
|
|
|
2019-10-22 04:51:22 +08:00
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
|
|
|
if (dsa_port_is_cpu(dp)) {
|
net: dsa: replay master state events in dsa_tree_{setup,teardown}_master
In order for switch driver to be able to make simple and reliable use of
the master tracking operations, they must also be notified of the
initial state of the DSA master, not just of the changes. This is
because they might enable certain features only during the time when
they know that the DSA master is up and running.
Therefore, this change explicitly checks the state of the DSA master
under the same rtnl_mutex as we were holding during the
dsa_master_setup() and dsa_master_teardown() call. The idea being that
if the DSA master became operational in between the moment in which it
became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment
when we checked for the master being up, there is a chance that we
would emit a ->master_state_change() call with no actual state change.
We need to avoid that by serializing the concurrent netdevice event with
us. If the netdevice event started before, we force it to finish before
we begin, because we take rtnl_lock before making netdev_uses_dsa()
return true. So we also handle that early event and do nothing on it.
Similarly, if the dev_open() attempt is concurrent with us, it will
attempt to take the rtnl_mutex, but we're holding it. We'll see that
the master flag IFF_UP isn't set, then when we release the rtnl_mutex
we'll process the NETDEV_UP notifier.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02 08:03:21 +08:00
|
|
|
struct net_device *master = dp->master;
|
|
|
|
bool admin_up = (master->flags & IFF_UP) &&
|
|
|
|
!qdisc_tx_is_noop(master);
|
|
|
|
|
|
|
|
err = dsa_master_setup(master, dp);
|
2019-10-22 04:51:22 +08:00
|
|
|
if (err)
|
|
|
|
return err;
|
net: dsa: replay master state events in dsa_tree_{setup,teardown}_master
In order for switch driver to be able to make simple and reliable use of
the master tracking operations, they must also be notified of the
initial state of the DSA master, not just of the changes. This is
because they might enable certain features only during the time when
they know that the DSA master is up and running.
Therefore, this change explicitly checks the state of the DSA master
under the same rtnl_mutex as we were holding during the
dsa_master_setup() and dsa_master_teardown() call. The idea being that
if the DSA master became operational in between the moment in which it
became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment
when we checked for the master being up, there is a chance that we
would emit a ->master_state_change() call with no actual state change.
We need to avoid that by serializing the concurrent netdevice event with
us. If the netdevice event started before, we force it to finish before
we begin, because we take rtnl_lock before making netdev_uses_dsa()
return true. So we also handle that early event and do nothing on it.
Similarly, if the dev_open() attempt is concurrent with us, it will
attempt to take the rtnl_mutex, but we're holding it. We'll see that
the master flag IFF_UP isn't set, then when we release the rtnl_mutex
we'll process the NETDEV_UP notifier.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02 08:03:21 +08:00
|
|
|
|
|
|
|
/* Replay master state event */
|
|
|
|
dsa_tree_master_admin_state_change(dst, master, admin_up);
|
|
|
|
dsa_tree_master_oper_state_change(dst, master,
|
|
|
|
netif_oper_up(master));
|
2019-10-22 04:51:22 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-01-06 07:11:15 +08:00
|
|
|
rtnl_unlock();
|
|
|
|
|
2019-10-22 04:51:22 +08:00
|
|
|
return 0;
|
2017-11-07 05:11:45 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void dsa_tree_teardown_master(struct dsa_switch_tree *dst)
|
|
|
|
{
|
2019-10-22 04:51:22 +08:00
|
|
|
struct dsa_port *dp;
|
2017-11-07 05:11:45 +08:00
|
|
|
|
2022-01-06 07:11:15 +08:00
|
|
|
rtnl_lock();
|
|
|
|
|
net: dsa: replay master state events in dsa_tree_{setup,teardown}_master
In order for switch driver to be able to make simple and reliable use of
the master tracking operations, they must also be notified of the
initial state of the DSA master, not just of the changes. This is
because they might enable certain features only during the time when
they know that the DSA master is up and running.
Therefore, this change explicitly checks the state of the DSA master
under the same rtnl_mutex as we were holding during the
dsa_master_setup() and dsa_master_teardown() call. The idea being that
if the DSA master became operational in between the moment in which it
became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment
when we checked for the master being up, there is a chance that we
would emit a ->master_state_change() call with no actual state change.
We need to avoid that by serializing the concurrent netdevice event with
us. If the netdevice event started before, we force it to finish before
we begin, because we take rtnl_lock before making netdev_uses_dsa()
return true. So we also handle that early event and do nothing on it.
Similarly, if the dev_open() attempt is concurrent with us, it will
attempt to take the rtnl_mutex, but we're holding it. We'll see that
the master flag IFF_UP isn't set, then when we release the rtnl_mutex
we'll process the NETDEV_UP notifier.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02 08:03:21 +08:00
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
|
|
|
if (dsa_port_is_cpu(dp)) {
|
|
|
|
struct net_device *master = dp->master;
|
|
|
|
|
|
|
|
/* Synthesizing an "admin down" state is sufficient for
|
|
|
|
* the switches to get a notification if the master is
|
|
|
|
* currently up and running.
|
|
|
|
*/
|
|
|
|
dsa_tree_master_admin_state_change(dst, master, false);
|
|
|
|
|
|
|
|
dsa_master_teardown(master);
|
|
|
|
}
|
|
|
|
}
|
2022-01-06 07:11:15 +08:00
|
|
|
|
|
|
|
rtnl_unlock();
|
2017-11-07 05:11:45 +08:00
|
|
|
}
|
|
|
|
|
2021-01-13 16:42:53 +08:00
|
|
|
static int dsa_tree_setup_lags(struct dsa_switch_tree *dst)
|
|
|
|
{
|
|
|
|
unsigned int len = 0;
|
|
|
|
struct dsa_port *dp;
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
|
|
|
if (dp->ds->num_lag_ids > len)
|
|
|
|
len = dp->ds->num_lag_ids;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!len)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
dst->lags = kcalloc(len, sizeof(*dst->lags), GFP_KERNEL);
|
|
|
|
if (!dst->lags)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
dst->lags_len = len;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void dsa_tree_teardown_lags(struct dsa_switch_tree *dst)
|
|
|
|
{
|
|
|
|
kfree(dst->lags);
|
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:46 +08:00
|
|
|
static int dsa_tree_setup(struct dsa_switch_tree *dst)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2017-11-07 05:11:51 +08:00
|
|
|
bool complete;
|
2016-06-05 03:17:07 +08:00
|
|
|
int err;
|
|
|
|
|
2017-11-07 05:11:46 +08:00
|
|
|
if (dst->setup) {
|
|
|
|
pr_err("DSA: tree %d already setup! Disjoint trees?\n",
|
|
|
|
dst->index);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:51 +08:00
|
|
|
complete = dsa_tree_setup_routing_table(dst);
|
|
|
|
if (!complete)
|
|
|
|
return 0;
|
|
|
|
|
net: dsa: give preference to local CPU ports
Be there an "H" switch topology, where there are 2 switches connected as
follows:
eth0 eth1
| |
CPU port CPU port
| DSA link |
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0
| | | | | |
user user user user user user
port port port port port port
basically one where each switch has its own CPU port for termination,
but there is also a DSA link in case packets need to be forwarded in
hardware between one switch and another.
DSA insists to see this as a daisy chain topology, basically registering
all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding
eth1 as a valid DSA master.
This is only half the story, since when asked using dsa_port_is_cpu(),
DSA will respond that sw1p1 is a CPU port, however one which has no
dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used.
Furthermore, be there a driver for switches which support only one
upstream port. This driver iterates through its ports and checks using
dsa_is_upstream_port() whether the current port is an upstream one.
For switch 1, two ports pass the "is upstream port" checks:
- sw1p4 is an upstream port because it is a routing port towards the
dedicated CPU port assigned using dsa_tree_setup_default_cpu()
- sw1p1 is also an upstream port because it is a CPU port, albeit one
that is disabled. This is because dsa_upstream_port() returns:
if (!cpu_dp)
return port;
which means that if @dp does not have a ->cpu_dp pointer (which is a
characteristic of CPU ports themselves as well as unused ports), then
@dp is its own upstream port.
So the driver for switch 1 rightfully says: I have two upstream ports,
but I don't support multiple upstream ports! So let me error out, I
don't know which one to choose and what to do with the other one.
Generally I am against enforcing any default policy in the kernel in
terms of user to CPU port assignment (like round robin or such) but this
case is different. To solve the conundrum, one would have to:
- Disable sw1p1 in the device tree or mark it as "not a CPU port" in
order to comply with DSA's view of this topology as a daisy chain,
where the termination traffic from switch 1 must pass through switch 0.
This is counter-productive because it wastes 1Gbps of termination
throughput in switch 1.
- Disable the DSA link between sw0p4 and sw1p4 and do software
forwarding between switch 0 and 1, and basically treat the switches as
part of disjoint switch trees. This is counter-productive because it
wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1.
- Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could
work, but it makes cross-chip bridging impossible. In this setup we
would need to have 2 separate bridges, br0 spanning the ports of
switch 0, and br1 spanning the ports of switch 1, and the "DSA links
treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are
the gateway ports between one bridge and another. This is hard to
manage from a user's perspective, who wants to have a unified view of
the switching fabric and the ability to transparently add ports to the
same bridge. VLANs would also need to be explicitly managed by the
user on these gateway ports.
So it seems that the only reasonable thing to do is to make DSA prefer
CPU ports that are local to the switch. Meaning that by default, the
user and DSA ports of switch 0 will get assigned to the CPU port from
switch 0 (sw0p1) and the user and DSA ports of switch 1 will get
assigned to the CPU port from switch 1.
The way this solves the problem is that sw1p4 is no longer an upstream
port as far as switch 1 is concerned (it no longer views sw0p1 as its
dedicated CPU port).
So here we are, the first multi-CPU port that DSA supports is also
perhaps the most uneventful one: the individual switches don't support
multiple CPUs, however the DSA switch tree as a whole does have multiple
CPU ports. No user space assignment of user ports to CPU ports is
desirable, necessary, or possible.
Ports that do not have a local CPU port (say there was an extra switch
hanging off of sw0p0) default to the standard implementation of getting
assigned to the first CPU port of the DSA switch tree. Is that good
enough? Probably not (if the downstream switch was hanging off of switch
1, we would most certainly prefer its CPU port to be sw1p1), but in
order to support that use case too, we would need to traverse the
dst->rtable in search of an optimum dedicated CPU port, one that has the
smallest number of hops between dp->ds and dp->cpu_dp->ds. At the
moment, the DSA routing table structure does not keep the number of hops
between dl->dp and dl->link_dp, and while it is probably deducible,
there is zero justification to write that code now. Let's hope DSA will
never have to support that use case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 21:54:30 +08:00
|
|
|
err = dsa_tree_setup_cpu_ports(dst);
|
2017-11-07 05:11:44 +08:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2017-11-07 05:11:47 +08:00
|
|
|
err = dsa_tree_setup_switches(dst);
|
|
|
|
if (err)
|
2021-08-04 21:54:29 +08:00
|
|
|
goto teardown_cpu_ports;
|
2016-06-05 03:17:07 +08:00
|
|
|
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
err = dsa_tree_setup_master(dst);
|
2017-09-19 23:56:59 +08:00
|
|
|
if (err)
|
2019-05-30 14:09:07 +08:00
|
|
|
goto teardown_switches;
|
2017-09-19 23:56:59 +08:00
|
|
|
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
err = dsa_tree_setup_ports(dst);
|
2022-01-06 07:11:16 +08:00
|
|
|
if (err)
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
goto teardown_master;
|
2022-01-06 07:11:16 +08:00
|
|
|
|
2021-01-13 16:42:53 +08:00
|
|
|
err = dsa_tree_setup_lags(dst);
|
|
|
|
if (err)
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
goto teardown_ports;
|
2021-01-13 16:42:53 +08:00
|
|
|
|
2017-11-07 05:11:46 +08:00
|
|
|
dst->setup = true;
|
|
|
|
|
|
|
|
pr_info("DSA: tree %d setup\n", dst->index);
|
2016-06-05 03:17:07 +08:00
|
|
|
|
|
|
|
return 0;
|
2019-05-30 14:09:07 +08:00
|
|
|
|
2022-01-06 07:11:16 +08:00
|
|
|
teardown_ports:
|
2021-09-21 06:49:18 +08:00
|
|
|
dsa_tree_teardown_ports(dst);
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
teardown_master:
|
|
|
|
dsa_tree_teardown_master(dst);
|
2022-01-06 07:11:16 +08:00
|
|
|
teardown_switches:
|
2019-05-30 14:09:07 +08:00
|
|
|
dsa_tree_teardown_switches(dst);
|
2021-08-04 21:54:29 +08:00
|
|
|
teardown_cpu_ports:
|
|
|
|
dsa_tree_teardown_cpu_ports(dst);
|
2019-05-30 14:09:07 +08:00
|
|
|
|
|
|
|
return err;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:46 +08:00
|
|
|
static void dsa_tree_teardown(struct dsa_switch_tree *dst)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2019-10-31 10:09:13 +08:00
|
|
|
struct dsa_link *dl, *next;
|
|
|
|
|
2017-11-07 05:11:46 +08:00
|
|
|
if (!dst->setup)
|
2016-06-05 03:17:07 +08:00
|
|
|
return;
|
|
|
|
|
2021-01-13 16:42:53 +08:00
|
|
|
dsa_tree_teardown_lags(dst);
|
|
|
|
|
net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports
Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
(and similarly for other ports)
What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.
But deleting that cleanup code, the warnings go away.
=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.
The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.
The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.
In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.
So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.
Fixes: 161ca59d39e9 ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-14 21:47:26 +08:00
|
|
|
dsa_tree_teardown_ports(dst);
|
|
|
|
|
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06 07:11:17 +08:00
|
|
|
dsa_tree_teardown_master(dst);
|
|
|
|
|
2017-11-07 05:11:47 +08:00
|
|
|
dsa_tree_teardown_switches(dst);
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2021-08-04 21:54:29 +08:00
|
|
|
dsa_tree_teardown_cpu_ports(dst);
|
2016-06-08 07:32:42 +08:00
|
|
|
|
2019-10-31 10:09:13 +08:00
|
|
|
list_for_each_entry_safe(dl, next, &dst->rtable, list) {
|
|
|
|
list_del(&dl->list);
|
|
|
|
kfree(dl);
|
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:46 +08:00
|
|
|
pr_info("DSA: tree %d torn down\n", dst->index);
|
|
|
|
|
|
|
|
dst->setup = false;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
static int dsa_tree_bind_tag_proto(struct dsa_switch_tree *dst,
|
|
|
|
const struct dsa_device_ops *tag_ops)
|
|
|
|
{
|
|
|
|
const struct dsa_device_ops *old_tag_ops = dst->tag_ops;
|
|
|
|
struct dsa_notifier_tag_proto_info info;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
dst->tag_ops = tag_ops;
|
|
|
|
|
|
|
|
/* Notify the switches from this tree about the connection
|
|
|
|
* to the new tagger
|
|
|
|
*/
|
|
|
|
info.tag_ops = tag_ops;
|
|
|
|
err = dsa_tree_notify(dst, DSA_NOTIFIER_TAG_PROTO_CONNECT, &info);
|
|
|
|
if (err && err != -EOPNOTSUPP)
|
|
|
|
goto out_disconnect;
|
|
|
|
|
|
|
|
/* Notify the old tagger about the disconnection from this tree */
|
net: dsa: make tagging protocols connect to individual switches from a tree
On the NXP Bluebox 3 board which uses a multi-switch setup with sja1105,
the mechanism through which the tagger connects to the switch tree is
broken, due to improper DSA code design. At the time when tag_ops->connect()
is called in dsa_port_parse_cpu(), DSA hasn't finished "touching" all
the ports, so it doesn't know how large the tree is and how many ports
it has. It has just seen the first CPU port by this time. As a result,
this function will call the tagger's ->connect method too early, and the
tagger will connect only to the first switch from the tree.
This could be perhaps addressed a bit more simply by just moving the
tag_ops->connect(dst) call a bit later (for example in dsa_tree_setup),
but there is already a design inconsistency at present: on the switch
side, the notification is on a per-switch basis, but on the tagger side,
it is on a per-tree basis. Furthermore, the persistent storage itself is
per switch (ds->tagger_data). And the tagger connect and disconnect
procedures (at least the ones that exist currently) could see a fair bit
of simplification if they didn't have to iterate through the switches of
a tree.
To fix the issue, this change transforms tag_ops->connect(dst) into
tag_ops->connect(ds) and moves it somewhere where we already iterate
over all switches of a tree. That is in dsa_switch_setup_tag_protocol(),
which is a good placement because we already have there the connection
call to the switch side of things.
As for the dsa_tree_bind_tag_proto() method (called from the code path
that changes the tag protocol), things are a bit more complicated
because we receive the tree as argument, yet when we unwind on errors,
it would be nice to not call tag_ops->disconnect(ds) where we didn't
previously call tag_ops->connect(ds). We didn't have this problem before
because the tag_ops connection operations passed the entire dst before,
and this is more fine grained now. To solve the error rewind case using
the new API, we have to create yet one more cross-chip notifier for
disconnection, and stay connected with the old tag protocol to all the
switches in the tree until we've succeeded to connect with the new one
as well. So if something fails half way, the whole tree is still
connected to the old tagger. But there may still be leaks if the tagger
fails to connect to the 2nd out of 3 switches in a tree: somebody needs
to tell the tagger to disconnect from the first switch. Nothing comes
for free, and this was previously handled privately by the tagging
protocol driver before, but now we need to emit a disconnect cross-chip
notifier for that, because DSA has to take care of the unwind path. We
assume that the tagging protocol has connected to a switch if it has set
ds->tagger_data to something, otherwise we avoid calling its
disconnection method in the error rewind path.
The rest of the changes are in the tagging protocol drivers, and have to
do with the replacement of dst with ds. The iteration is removed and the
error unwind path is simplified, as mentioned above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-14 09:45:36 +08:00
|
|
|
info.tag_ops = old_tag_ops;
|
|
|
|
dsa_tree_notify(dst, DSA_NOTIFIER_TAG_PROTO_DISCONNECT, &info);
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_disconnect:
|
net: dsa: make tagging protocols connect to individual switches from a tree
On the NXP Bluebox 3 board which uses a multi-switch setup with sja1105,
the mechanism through which the tagger connects to the switch tree is
broken, due to improper DSA code design. At the time when tag_ops->connect()
is called in dsa_port_parse_cpu(), DSA hasn't finished "touching" all
the ports, so it doesn't know how large the tree is and how many ports
it has. It has just seen the first CPU port by this time. As a result,
this function will call the tagger's ->connect method too early, and the
tagger will connect only to the first switch from the tree.
This could be perhaps addressed a bit more simply by just moving the
tag_ops->connect(dst) call a bit later (for example in dsa_tree_setup),
but there is already a design inconsistency at present: on the switch
side, the notification is on a per-switch basis, but on the tagger side,
it is on a per-tree basis. Furthermore, the persistent storage itself is
per switch (ds->tagger_data). And the tagger connect and disconnect
procedures (at least the ones that exist currently) could see a fair bit
of simplification if they didn't have to iterate through the switches of
a tree.
To fix the issue, this change transforms tag_ops->connect(dst) into
tag_ops->connect(ds) and moves it somewhere where we already iterate
over all switches of a tree. That is in dsa_switch_setup_tag_protocol(),
which is a good placement because we already have there the connection
call to the switch side of things.
As for the dsa_tree_bind_tag_proto() method (called from the code path
that changes the tag protocol), things are a bit more complicated
because we receive the tree as argument, yet when we unwind on errors,
it would be nice to not call tag_ops->disconnect(ds) where we didn't
previously call tag_ops->connect(ds). We didn't have this problem before
because the tag_ops connection operations passed the entire dst before,
and this is more fine grained now. To solve the error rewind case using
the new API, we have to create yet one more cross-chip notifier for
disconnection, and stay connected with the old tag protocol to all the
switches in the tree until we've succeeded to connect with the new one
as well. So if something fails half way, the whole tree is still
connected to the old tagger. But there may still be leaks if the tagger
fails to connect to the 2nd out of 3 switches in a tree: somebody needs
to tell the tagger to disconnect from the first switch. Nothing comes
for free, and this was previously handled privately by the tagging
protocol driver before, but now we need to emit a disconnect cross-chip
notifier for that, because DSA has to take care of the unwind path. We
assume that the tagging protocol has connected to a switch if it has set
ds->tagger_data to something, otherwise we avoid calling its
disconnection method in the error rewind path.
The rest of the changes are in the tagging protocol drivers, and have to
do with the replacement of dst with ds. The iteration is removed and the
error unwind path is simplified, as mentioned above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-14 09:45:36 +08:00
|
|
|
info.tag_ops = tag_ops;
|
|
|
|
dsa_tree_notify(dst, DSA_NOTIFIER_TAG_PROTO_DISCONNECT, &info);
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
dst->tag_ops = old_tag_ops;
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:06 +08:00
|
|
|
/* Since the dsa/tagging sysfs device attribute is per master, the assumption
|
|
|
|
* is that all DSA switches within a tree share the same tagger, otherwise
|
|
|
|
* they would have formed disjoint trees (different "dsa,member" values).
|
|
|
|
*/
|
|
|
|
int dsa_tree_change_tag_proto(struct dsa_switch_tree *dst,
|
|
|
|
struct net_device *master,
|
|
|
|
const struct dsa_device_ops *tag_ops,
|
|
|
|
const struct dsa_device_ops *old_tag_ops)
|
|
|
|
{
|
|
|
|
struct dsa_notifier_tag_proto_info info;
|
|
|
|
struct dsa_port *dp;
|
|
|
|
int err = -EBUSY;
|
|
|
|
|
|
|
|
if (!rtnl_trylock())
|
|
|
|
return restart_syscall();
|
|
|
|
|
|
|
|
/* At the moment we don't allow changing the tag protocol under
|
|
|
|
* traffic. The rtnl_mutex also happens to serialize concurrent
|
|
|
|
* attempts to change the tagging protocol. If we ever lift the IFF_UP
|
|
|
|
* restriction, there needs to be another mutex which serializes this.
|
|
|
|
*/
|
|
|
|
if (master->flags & IFF_UP)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
list_for_each_entry(dp, &dst->ports, list) {
|
2021-10-21 01:49:50 +08:00
|
|
|
if (!dsa_port_is_user(dp))
|
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:06 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
if (dp->slave->flags & IFF_UP)
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
/* Notify the tag protocol change */
|
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:06 +08:00
|
|
|
info.tag_ops = tag_ops;
|
|
|
|
err = dsa_tree_notify(dst, DSA_NOTIFIER_TAG_PROTO, &info);
|
|
|
|
if (err)
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
return err;
|
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:06 +08:00
|
|
|
|
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 07:34:37 +08:00
|
|
|
err = dsa_tree_bind_tag_proto(dst, tag_ops);
|
|
|
|
if (err)
|
|
|
|
goto out_unwind_tagger;
|
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:06 +08:00
|
|
|
|
|
|
|
rtnl_unlock();
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_unwind_tagger:
|
|
|
|
info.tag_ops = old_tag_ops;
|
|
|
|
dsa_tree_notify(dst, DSA_NOTIFIER_TAG_PROTO, &info);
|
|
|
|
out_unlock:
|
|
|
|
rtnl_unlock();
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net: dsa: provide switch operations for tracking the master state
Certain drivers may need to send management traffic to the switch for
things like register access, FDB dump, etc, to accelerate what their
slow bus (SPI, I2C, MDIO) can already do.
Ethernet is faster (especially in bulk transactions) but is also more
unreliable, since the user may decide to bring the DSA master down (or
not bring it up), therefore severing the link between the host and the
attached switch.
Drivers needing Ethernet-based register access already should have
fallback logic to the slow bus if the Ethernet method fails, but that
fallback may be based on a timeout, and the I/O to the switch may slow
down to a halt if the master is down, because every Ethernet packet will
have to time out. The driver also doesn't have the option to turn off
Ethernet-based I/O momentarily, because it wouldn't know when to turn it
back on.
Which is where this change comes in. By tracking NETDEV_CHANGE,
NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know
the exact interval of time during which this interface is reliably
available for traffic. Provide this information to switches so they can
use it as they wish.
An helper is added dsa_port_master_is_operational() to check if a master
port is operational.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02 08:03:20 +08:00
|
|
|
static void dsa_tree_master_state_change(struct dsa_switch_tree *dst,
|
|
|
|
struct net_device *master)
|
|
|
|
{
|
|
|
|
struct dsa_notifier_master_state_info info;
|
|
|
|
struct dsa_port *cpu_dp = master->dsa_ptr;
|
|
|
|
|
|
|
|
info.master = master;
|
|
|
|
info.operational = dsa_port_master_is_operational(cpu_dp);
|
|
|
|
|
|
|
|
dsa_tree_notify(dst, DSA_NOTIFIER_MASTER_STATE_CHANGE, &info);
|
|
|
|
}
|
|
|
|
|
|
|
|
void dsa_tree_master_admin_state_change(struct dsa_switch_tree *dst,
|
|
|
|
struct net_device *master,
|
|
|
|
bool up)
|
|
|
|
{
|
|
|
|
struct dsa_port *cpu_dp = master->dsa_ptr;
|
|
|
|
bool notify = false;
|
|
|
|
|
|
|
|
if ((dsa_port_master_is_operational(cpu_dp)) !=
|
|
|
|
(up && cpu_dp->master_oper_up))
|
|
|
|
notify = true;
|
|
|
|
|
|
|
|
cpu_dp->master_admin_up = up;
|
|
|
|
|
|
|
|
if (notify)
|
|
|
|
dsa_tree_master_state_change(dst, master);
|
|
|
|
}
|
|
|
|
|
|
|
|
void dsa_tree_master_oper_state_change(struct dsa_switch_tree *dst,
|
|
|
|
struct net_device *master,
|
|
|
|
bool up)
|
|
|
|
{
|
|
|
|
struct dsa_port *cpu_dp = master->dsa_ptr;
|
|
|
|
bool notify = false;
|
|
|
|
|
|
|
|
if ((dsa_port_master_is_operational(cpu_dp)) !=
|
|
|
|
(cpu_dp->master_admin_up && up))
|
|
|
|
notify = true;
|
|
|
|
|
|
|
|
cpu_dp->master_oper_up = up;
|
|
|
|
|
|
|
|
if (notify)
|
|
|
|
dsa_tree_master_state_change(dst, master);
|
|
|
|
}
|
|
|
|
|
2019-10-22 04:51:16 +08:00
|
|
|
static struct dsa_port *dsa_port_touch(struct dsa_switch *ds, int index)
|
|
|
|
{
|
|
|
|
struct dsa_switch_tree *dst = ds->dst;
|
|
|
|
struct dsa_port *dp;
|
|
|
|
|
2021-10-21 01:49:51 +08:00
|
|
|
dsa_switch_for_each_port(dp, ds)
|
|
|
|
if (dp->index == index)
|
2019-10-22 04:51:29 +08:00
|
|
|
return dp;
|
|
|
|
|
|
|
|
dp = kzalloc(sizeof(*dp), GFP_KERNEL);
|
|
|
|
if (!dp)
|
|
|
|
return NULL;
|
2019-10-22 04:51:16 +08:00
|
|
|
|
|
|
|
dp->ds = ds;
|
|
|
|
dp->index = index;
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&dp->list);
|
|
|
|
list_add_tail(&dp->list, &dst->ports);
|
|
|
|
|
|
|
|
return dp;
|
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:29 +08:00
|
|
|
static int dsa_port_parse_user(struct dsa_port *dp, const char *name)
|
|
|
|
{
|
|
|
|
if (!name)
|
|
|
|
name = "eth%d";
|
|
|
|
|
|
|
|
dp->type = DSA_PORT_TYPE_USER;
|
|
|
|
dp->name = name;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dsa_port_parse_dsa(struct dsa_port *dp)
|
|
|
|
{
|
|
|
|
dp->type = DSA_PORT_TYPE_DSA;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-01-08 13:06:05 +08:00
|
|
|
static enum dsa_tag_protocol dsa_get_tag_protocol(struct dsa_port *dp,
|
|
|
|
struct net_device *master)
|
|
|
|
{
|
|
|
|
enum dsa_tag_protocol tag_protocol = DSA_TAG_PROTO_NONE;
|
|
|
|
struct dsa_switch *mds, *ds = dp->ds;
|
|
|
|
unsigned int mdp_upstream;
|
|
|
|
struct dsa_port *mdp;
|
|
|
|
|
|
|
|
/* It is possible to stack DSA switches onto one another when that
|
|
|
|
* happens the switch driver may want to know if its tagging protocol
|
|
|
|
* is going to work in such a configuration.
|
|
|
|
*/
|
|
|
|
if (dsa_slave_dev_check(master)) {
|
|
|
|
mdp = dsa_slave_to_port(master);
|
|
|
|
mds = mdp->ds;
|
|
|
|
mdp_upstream = dsa_upstream_port(mds, mdp->index);
|
|
|
|
tag_protocol = mds->ops->get_tag_protocol(mds, mdp_upstream,
|
|
|
|
DSA_TAG_PROTO_NONE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If the master device is not itself a DSA slave in a disjoint DSA
|
|
|
|
* tree, then return immediately.
|
|
|
|
*/
|
|
|
|
return ds->ops->get_tag_protocol(ds, dp->index, tag_protocol);
|
|
|
|
}
|
|
|
|
|
2021-04-21 02:53:10 +08:00
|
|
|
static int dsa_port_parse_cpu(struct dsa_port *dp, struct net_device *master,
|
|
|
|
const char *user_protocol)
|
2017-11-04 07:05:29 +08:00
|
|
|
{
|
2017-11-04 07:05:30 +08:00
|
|
|
struct dsa_switch *ds = dp->ds;
|
|
|
|
struct dsa_switch_tree *dst = ds->dst;
|
2021-03-23 04:26:50 +08:00
|
|
|
const struct dsa_device_ops *tag_ops;
|
2021-04-21 02:53:10 +08:00
|
|
|
enum dsa_tag_protocol default_proto;
|
|
|
|
|
|
|
|
/* Find out which protocol the switch would prefer. */
|
|
|
|
default_proto = dsa_get_tag_protocol(dp, master);
|
|
|
|
if (dst->default_proto) {
|
|
|
|
if (dst->default_proto != default_proto) {
|
|
|
|
dev_err(ds->dev,
|
|
|
|
"A DSA switch tree can have only one tagging protocol\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
dst->default_proto = default_proto;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* See if the user wants to override that preference. */
|
|
|
|
if (user_protocol) {
|
|
|
|
if (!ds->ops->change_tag_protocol) {
|
|
|
|
dev_err(ds->dev, "Tag protocol cannot be modified\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
tag_ops = dsa_find_tagger_by_name(user_protocol);
|
|
|
|
} else {
|
|
|
|
tag_ops = dsa_tag_driver_get(default_proto);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IS_ERR(tag_ops)) {
|
|
|
|
if (PTR_ERR(tag_ops) == -ENOPROTOOPT)
|
|
|
|
return -EPROBE_DEFER;
|
|
|
|
|
|
|
|
dev_warn(ds->dev, "No tagger for this switch\n");
|
|
|
|
return PTR_ERR(tag_ops);
|
|
|
|
}
|
2017-11-04 07:05:30 +08:00
|
|
|
|
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:05 +08:00
|
|
|
if (dst->tag_ops) {
|
2021-04-21 02:53:10 +08:00
|
|
|
if (dst->tag_ops != tag_ops) {
|
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:05 +08:00
|
|
|
dev_err(ds->dev,
|
|
|
|
"A DSA switch tree can have only one tagging protocol\n");
|
2021-04-21 02:53:10 +08:00
|
|
|
|
|
|
|
dsa_tag_driver_put(tag_ops);
|
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:05 +08:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2021-04-21 02:53:10 +08:00
|
|
|
|
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:05 +08:00
|
|
|
/* In the case of multiple CPU ports per switch, the tagging
|
2021-04-21 02:53:10 +08:00
|
|
|
* protocol is still reference-counted only per switch tree.
|
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:05 +08:00
|
|
|
*/
|
2021-04-21 02:53:10 +08:00
|
|
|
dsa_tag_driver_put(tag_ops);
|
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:05 +08:00
|
|
|
} else {
|
2021-03-23 04:26:50 +08:00
|
|
|
dst->tag_ops = tag_ops;
|
2017-11-04 07:05:30 +08:00
|
|
|
}
|
|
|
|
|
2020-01-08 13:06:05 +08:00
|
|
|
dp->master = master;
|
2017-11-04 07:05:29 +08:00
|
|
|
dp->type = DSA_PORT_TYPE_CPU;
|
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 09:00:06 +08:00
|
|
|
dsa_port_set_tag_protocol(dp, dst->tag_ops);
|
2017-11-04 07:05:30 +08:00
|
|
|
dp->dst = dst;
|
2017-11-04 07:05:29 +08:00
|
|
|
|
2021-04-21 02:53:10 +08:00
|
|
|
/* At this point, the tree may be configured to use a different
|
|
|
|
* tagger than the one chosen by the switch driver during
|
|
|
|
* .setup, in the case when a user selects a custom protocol
|
|
|
|
* through the DT.
|
|
|
|
*
|
|
|
|
* This is resolved by syncing the driver with the tree in
|
|
|
|
* dsa_switch_setup_tag_protocol once .setup has run and the
|
|
|
|
* driver is ready to accept calls to .change_tag_protocol. If
|
|
|
|
* the driver does not support the custom protocol at that
|
|
|
|
* point, the tree is wholly rejected, thereby ensuring that the
|
|
|
|
* tree and driver are always in agreement on the protocol to
|
|
|
|
* use.
|
|
|
|
*/
|
2017-11-04 07:05:29 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-10-28 03:55:14 +08:00
|
|
|
static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn)
|
|
|
|
{
|
2017-10-28 03:55:15 +08:00
|
|
|
struct device_node *ethernet = of_parse_phandle(dn, "ethernet", 0);
|
2017-10-28 03:55:18 +08:00
|
|
|
const char *name = of_get_property(dn, "label", NULL);
|
2017-11-04 07:05:28 +08:00
|
|
|
bool link = of_property_read_bool(dn, "link");
|
2017-10-28 03:55:15 +08:00
|
|
|
|
2017-11-04 07:05:29 +08:00
|
|
|
dp->dn = dn;
|
|
|
|
|
2017-10-28 03:55:15 +08:00
|
|
|
if (ethernet) {
|
2017-10-28 03:55:17 +08:00
|
|
|
struct net_device *master;
|
2021-04-21 02:53:10 +08:00
|
|
|
const char *user_protocol;
|
2017-10-28 03:55:17 +08:00
|
|
|
|
|
|
|
master = of_find_net_device_by_node(ethernet);
|
|
|
|
if (!master)
|
|
|
|
return -EPROBE_DEFER;
|
|
|
|
|
2021-04-21 02:53:10 +08:00
|
|
|
user_protocol = of_get_property(dn, "dsa-tag-protocol", NULL);
|
|
|
|
return dsa_port_parse_cpu(dp, master, user_protocol);
|
2017-10-28 03:55:15 +08:00
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:29 +08:00
|
|
|
if (link)
|
|
|
|
return dsa_port_parse_dsa(dp);
|
2017-10-28 03:55:14 +08:00
|
|
|
|
2017-11-04 07:05:29 +08:00
|
|
|
return dsa_port_parse_user(dp, name);
|
2017-10-28 03:55:14 +08:00
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
static int dsa_switch_parse_ports_of(struct dsa_switch *ds,
|
|
|
|
struct device_node *dn)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2017-10-28 03:55:13 +08:00
|
|
|
struct device_node *ports, *port;
|
2017-10-28 03:55:14 +08:00
|
|
|
struct dsa_port *dp;
|
2019-02-25 15:22:19 +08:00
|
|
|
int err = 0;
|
2016-06-05 03:17:07 +08:00
|
|
|
u32 reg;
|
2017-10-28 03:55:13 +08:00
|
|
|
|
|
|
|
ports = of_get_child_by_name(dn, "ports");
|
|
|
|
if (!ports) {
|
2020-07-20 20:49:39 +08:00
|
|
|
/* The second possibility is "ethernet-ports" */
|
|
|
|
ports = of_get_child_by_name(dn, "ethernet-ports");
|
|
|
|
if (!ports) {
|
|
|
|
dev_err(ds->dev, "no ports child node found\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2017-10-28 03:55:13 +08:00
|
|
|
}
|
2016-06-05 03:17:07 +08:00
|
|
|
|
|
|
|
for_each_available_child_of_node(ports, port) {
|
|
|
|
err = of_property_read_u32(port, "reg", ®);
|
2021-10-19 03:59:00 +08:00
|
|
|
if (err) {
|
|
|
|
of_node_put(port);
|
2019-02-25 15:22:19 +08:00
|
|
|
goto out_put_node;
|
2021-10-19 03:59:00 +08:00
|
|
|
}
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2019-02-25 15:22:19 +08:00
|
|
|
if (reg >= ds->num_ports) {
|
net: dsa: make dsa_switch :: num_ports an unsigned int
Currently, num_ports is declared as size_t, which is defined as
__kernel_ulong_t, therefore it occupies 8 bytes of memory.
Even switches with port numbers in the range of tens are exotic, so
there is no need for this amount of storage.
Additionally, because the max_num_bridges member right above it is also
4 bytes, it means the compiler needs to add padding between the last 2
fields. By reducing the size, we don't need that padding and can reduce
the struct size.
Before:
pahole -C dsa_switch net/dsa/slave.o
struct dsa_switch {
struct device * dev; /* 0 8 */
struct dsa_switch_tree * dst; /* 8 8 */
unsigned int index; /* 16 4 */
u32 setup:1; /* 20: 0 4 */
u32 vlan_filtering_is_global:1; /* 20: 1 4 */
u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */
u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */
u32 untag_bridge_pvid:1; /* 20: 4 4 */
u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */
u32 vlan_filtering:1; /* 20: 6 4 */
u32 pcs_poll:1; /* 20: 7 4 */
u32 mtu_enforcement_ingress:1; /* 20: 8 4 */
/* XXX 23 bits hole, try to pack */
struct notifier_block nb; /* 24 24 */
/* XXX last struct has 4 bytes of padding */
void * priv; /* 48 8 */
void * tagger_data; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct dsa_chip_data * cd; /* 64 8 */
const struct dsa_switch_ops * ops; /* 72 8 */
u32 phys_mii_mask; /* 80 4 */
/* XXX 4 bytes hole, try to pack */
struct mii_bus * slave_mii_bus; /* 88 8 */
unsigned int ageing_time_min; /* 96 4 */
unsigned int ageing_time_max; /* 100 4 */
struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */
struct devlink * devlink; /* 112 8 */
unsigned int num_tx_queues; /* 120 4 */
unsigned int num_lag_ids; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
unsigned int max_num_bridges; /* 128 4 */
/* XXX 4 bytes hole, try to pack */
size_t num_ports; /* 136 8 */
/* size: 144, cachelines: 3, members: 27 */
/* sum members: 132, holes: 2, sum holes: 8 */
/* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 16 bytes */
};
After:
pahole -C dsa_switch net/dsa/slave.o
struct dsa_switch {
struct device * dev; /* 0 8 */
struct dsa_switch_tree * dst; /* 8 8 */
unsigned int index; /* 16 4 */
u32 setup:1; /* 20: 0 4 */
u32 vlan_filtering_is_global:1; /* 20: 1 4 */
u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */
u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */
u32 untag_bridge_pvid:1; /* 20: 4 4 */
u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */
u32 vlan_filtering:1; /* 20: 6 4 */
u32 pcs_poll:1; /* 20: 7 4 */
u32 mtu_enforcement_ingress:1; /* 20: 8 4 */
/* XXX 23 bits hole, try to pack */
struct notifier_block nb; /* 24 24 */
/* XXX last struct has 4 bytes of padding */
void * priv; /* 48 8 */
void * tagger_data; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct dsa_chip_data * cd; /* 64 8 */
const struct dsa_switch_ops * ops; /* 72 8 */
u32 phys_mii_mask; /* 80 4 */
/* XXX 4 bytes hole, try to pack */
struct mii_bus * slave_mii_bus; /* 88 8 */
unsigned int ageing_time_min; /* 96 4 */
unsigned int ageing_time_max; /* 100 4 */
struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */
struct devlink * devlink; /* 112 8 */
unsigned int num_tx_queues; /* 120 4 */
unsigned int num_lag_ids; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
unsigned int max_num_bridges; /* 128 4 */
unsigned int num_ports; /* 132 4 */
/* size: 136, cachelines: 3, members: 27 */
/* sum members: 128, holes: 1, sum holes: 4 */
/* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 8 bytes */
};
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05 21:21:39 +08:00
|
|
|
dev_err(ds->dev, "port %pOF index %u exceeds num_ports (%u)\n",
|
2021-01-06 17:09:15 +08:00
|
|
|
port, reg, ds->num_ports);
|
2021-10-19 03:59:00 +08:00
|
|
|
of_node_put(port);
|
2019-02-25 15:22:19 +08:00
|
|
|
err = -EINVAL;
|
|
|
|
goto out_put_node;
|
|
|
|
}
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2019-10-22 04:51:15 +08:00
|
|
|
dp = dsa_to_port(ds, reg);
|
2017-10-28 03:55:14 +08:00
|
|
|
|
|
|
|
err = dsa_port_parse_of(dp, port);
|
2021-10-19 03:59:00 +08:00
|
|
|
if (err) {
|
|
|
|
of_node_put(port);
|
2019-02-25 15:22:19 +08:00
|
|
|
goto out_put_node;
|
2021-10-19 03:59:00 +08:00
|
|
|
}
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2019-02-25 15:22:19 +08:00
|
|
|
out_put_node:
|
|
|
|
of_node_put(ports);
|
|
|
|
return err;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
static int dsa_switch_parse_member_of(struct dsa_switch *ds,
|
|
|
|
struct device_node *dn)
|
|
|
|
{
|
|
|
|
u32 m[2] = { 0, 0 };
|
|
|
|
int sz;
|
|
|
|
|
|
|
|
/* Don't error out if this optional property isn't found */
|
|
|
|
sz = of_property_read_variable_u32_array(dn, "dsa,member", m, 2, 2);
|
|
|
|
if (sz < 0 && sz != -EINVAL)
|
|
|
|
return sz;
|
|
|
|
|
|
|
|
ds->index = m[1];
|
|
|
|
|
|
|
|
ds->dst = dsa_tree_touch(m[0]);
|
|
|
|
if (!ds->dst)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2021-06-22 00:42:14 +08:00
|
|
|
if (dsa_switch_find(ds->dst->index, ds->index)) {
|
|
|
|
dev_err(ds->dev,
|
|
|
|
"A DSA switch with index %d already exists in tree %d\n",
|
|
|
|
ds->index, ds->dst->index);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
|
2021-07-22 23:55:39 +08:00
|
|
|
if (ds->dst->last_switch < ds->index)
|
|
|
|
ds->dst->last_switch = ds->index;
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-10-22 04:51:16 +08:00
|
|
|
static int dsa_switch_touch_ports(struct dsa_switch *ds)
|
|
|
|
{
|
|
|
|
struct dsa_port *dp;
|
|
|
|
int port;
|
|
|
|
|
|
|
|
for (port = 0; port < ds->num_ports; port++) {
|
|
|
|
dp = dsa_port_touch(ds, port);
|
|
|
|
if (!dp)
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
static int dsa_switch_parse_of(struct dsa_switch *ds, struct device_node *dn)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = dsa_switch_parse_member_of(ds, dn);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2019-10-22 04:51:16 +08:00
|
|
|
err = dsa_switch_touch_ports(ds);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
return dsa_switch_parse_ports_of(ds, dn);
|
|
|
|
}
|
|
|
|
|
2017-10-28 03:55:14 +08:00
|
|
|
static int dsa_port_parse(struct dsa_port *dp, const char *name,
|
|
|
|
struct device *dev)
|
|
|
|
{
|
2017-10-28 03:55:15 +08:00
|
|
|
if (!strcmp(name, "cpu")) {
|
2017-10-28 03:55:17 +08:00
|
|
|
struct net_device *master;
|
|
|
|
|
|
|
|
master = dsa_dev_to_net_device(dev);
|
|
|
|
if (!master)
|
|
|
|
return -EPROBE_DEFER;
|
|
|
|
|
|
|
|
dev_put(master);
|
|
|
|
|
2021-04-21 02:53:10 +08:00
|
|
|
return dsa_port_parse_cpu(dp, master, NULL);
|
2017-10-28 03:55:15 +08:00
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:29 +08:00
|
|
|
if (!strcmp(name, "dsa"))
|
|
|
|
return dsa_port_parse_dsa(dp);
|
2017-10-28 03:55:14 +08:00
|
|
|
|
2017-11-04 07:05:29 +08:00
|
|
|
return dsa_port_parse_user(dp, name);
|
2017-10-28 03:55:14 +08:00
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
static int dsa_switch_parse_ports(struct dsa_switch *ds,
|
|
|
|
struct dsa_chip_data *cd)
|
2017-02-05 05:02:43 +08:00
|
|
|
{
|
|
|
|
bool valid_name_found = false;
|
2017-10-28 03:55:14 +08:00
|
|
|
struct dsa_port *dp;
|
|
|
|
struct device *dev;
|
|
|
|
const char *name;
|
2017-02-05 05:02:43 +08:00
|
|
|
unsigned int i;
|
2017-10-28 03:55:14 +08:00
|
|
|
int err;
|
2017-02-05 05:02:43 +08:00
|
|
|
|
|
|
|
for (i = 0; i < DSA_MAX_PORTS; i++) {
|
2017-10-28 03:55:14 +08:00
|
|
|
name = cd->port_names[i];
|
|
|
|
dev = cd->netdev[i];
|
2019-10-22 04:51:15 +08:00
|
|
|
dp = dsa_to_port(ds, i);
|
2017-10-28 03:55:14 +08:00
|
|
|
|
|
|
|
if (!name)
|
2017-02-05 05:02:43 +08:00
|
|
|
continue;
|
|
|
|
|
2017-10-28 03:55:14 +08:00
|
|
|
err = dsa_port_parse(dp, name, dev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2017-02-05 05:02:43 +08:00
|
|
|
valid_name_found = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!valid_name_found && i == DSA_MAX_PORTS)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
static int dsa_switch_parse(struct dsa_switch *ds, struct dsa_chip_data *cd)
|
2017-02-05 05:02:43 +08:00
|
|
|
{
|
2019-10-22 04:51:16 +08:00
|
|
|
int err;
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
ds->cd = cd;
|
2017-02-05 05:02:43 +08:00
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
/* We don't support interconnected switches nor multiple trees via
|
|
|
|
* platform data, so this is the unique switch of the tree.
|
|
|
|
*/
|
|
|
|
ds->index = 0;
|
|
|
|
ds->dst = dsa_tree_touch(0);
|
|
|
|
if (!ds->dst)
|
|
|
|
return -ENOMEM;
|
2017-02-05 05:02:43 +08:00
|
|
|
|
2019-10-22 04:51:16 +08:00
|
|
|
err = dsa_switch_touch_ports(ds);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
return dsa_switch_parse_ports(ds, cd);
|
2017-02-05 05:02:43 +08:00
|
|
|
}
|
|
|
|
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
static void dsa_switch_release_ports(struct dsa_switch *ds)
|
|
|
|
{
|
|
|
|
struct dsa_port *dp, *next;
|
|
|
|
|
2021-10-21 01:49:51 +08:00
|
|
|
dsa_switch_for_each_port_safe(dp, next, ds) {
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
list_del(&dp->list);
|
|
|
|
kfree(dp);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-11-07 05:11:53 +08:00
|
|
|
static int dsa_switch_probe(struct dsa_switch *ds)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
2019-10-31 10:09:17 +08:00
|
|
|
struct dsa_switch_tree *dst;
|
2019-10-24 18:32:18 +08:00
|
|
|
struct dsa_chip_data *pdata;
|
|
|
|
struct device_node *np;
|
2017-11-07 05:11:51 +08:00
|
|
|
int err;
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2019-10-22 04:51:30 +08:00
|
|
|
if (!ds->dev)
|
|
|
|
return -ENODEV;
|
|
|
|
|
2019-10-24 18:32:18 +08:00
|
|
|
pdata = ds->dev->platform_data;
|
|
|
|
np = ds->dev->of_node;
|
|
|
|
|
2019-10-22 04:51:30 +08:00
|
|
|
if (!ds->num_ports)
|
|
|
|
return -EINVAL;
|
|
|
|
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
if (np) {
|
2017-11-04 07:05:27 +08:00
|
|
|
err = dsa_switch_parse_of(ds, np);
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
if (err)
|
|
|
|
dsa_switch_release_ports(ds);
|
|
|
|
} else if (pdata) {
|
2017-11-04 07:05:27 +08:00
|
|
|
err = dsa_switch_parse(ds, pdata);
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
if (err)
|
|
|
|
dsa_switch_release_ports(ds);
|
|
|
|
} else {
|
2017-11-04 07:05:27 +08:00
|
|
|
err = -ENODEV;
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
}
|
2016-06-05 03:17:07 +08:00
|
|
|
|
2017-11-04 07:05:27 +08:00
|
|
|
if (err)
|
|
|
|
return err;
|
2016-07-07 08:03:54 +08:00
|
|
|
|
2019-10-31 10:09:17 +08:00
|
|
|
dst = ds->dst;
|
|
|
|
dsa_tree_get(dst);
|
|
|
|
err = dsa_tree_setup(dst);
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
if (err) {
|
|
|
|
dsa_switch_release_ports(ds);
|
2019-10-31 10:09:17 +08:00
|
|
|
dsa_tree_put(dst);
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
}
|
2019-10-31 10:09:17 +08:00
|
|
|
|
|
|
|
return err;
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
2017-05-27 06:12:51 +08:00
|
|
|
int dsa_register_switch(struct dsa_switch *ds)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
mutex_lock(&dsa2_mutex);
|
2017-11-07 05:11:53 +08:00
|
|
|
err = dsa_switch_probe(ds);
|
2017-11-25 00:36:06 +08:00
|
|
|
dsa_tree_put(ds->dst);
|
2016-06-05 03:17:07 +08:00
|
|
|
mutex_unlock(&dsa2_mutex);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(dsa_register_switch);
|
|
|
|
|
2017-11-07 05:11:53 +08:00
|
|
|
static void dsa_switch_remove(struct dsa_switch *ds)
|
2016-06-05 03:17:07 +08:00
|
|
|
{
|
|
|
|
struct dsa_switch_tree *dst = ds->dst;
|
2019-10-22 04:51:29 +08:00
|
|
|
|
2019-11-03 11:13:26 +08:00
|
|
|
dsa_tree_teardown(dst);
|
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
[ 5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[ 6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[ 6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[ 7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 05:01:11 +08:00
|
|
|
dsa_switch_release_ports(ds);
|
2019-10-31 10:09:17 +08:00
|
|
|
dsa_tree_put(dst);
|
2016-06-05 03:17:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void dsa_unregister_switch(struct dsa_switch *ds)
|
|
|
|
{
|
|
|
|
mutex_lock(&dsa2_mutex);
|
2017-11-07 05:11:53 +08:00
|
|
|
dsa_switch_remove(ds);
|
2016-06-05 03:17:07 +08:00
|
|
|
mutex_unlock(&dsa2_mutex);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(dsa_unregister_switch);
|
net: dsa: be compatible with masters which unregister on shutdown
Lino reports that on his system with bcmgenet as DSA master and KSZ9897
as a switch, rebooting or shutting down never works properly.
What does the bcmgenet driver have special to trigger this, that other
DSA masters do not? It has an implementation of ->shutdown which simply
calls its ->remove implementation. Otherwise said, it unregisters its
network interface on shutdown.
This message can be seen in a loop, and it hangs the reboot process there:
unregister_netdevice: waiting for eth0 to become free. Usage count = 3
So why 3?
A usage count of 1 is normal for a registered network interface, and any
virtual interface which links itself as an upper of that will increment
it via dev_hold. In the case of DSA, this is the call path:
dsa_slave_create
-> netdev_upper_dev_link
-> __netdev_upper_dev_link
-> __netdev_adjacent_dev_insert
-> dev_hold
So a DSA switch with 3 interfaces will result in a usage count elevated
by two, and netdev_wait_allrefs will wait until they have gone away.
Other stacked interfaces, like VLAN, watch NETDEV_UNREGISTER events and
delete themselves, but DSA cannot just vanish and go poof, at most it
can unbind itself from the switch devices, but that must happen strictly
earlier compared to when the DSA master unregisters its net_device, so
reacting on the NETDEV_UNREGISTER event is way too late.
It seems that it is a pretty established pattern to have a driver's
->shutdown hook redirect to its ->remove hook, so the same code is
executed regardless of whether the driver is unbound from the device, or
the system is just shutting down. As Florian puts it, it is quite a big
hammer for bcmgenet to unregister its net_device during shutdown, but
having a common code path with the driver unbind helps ensure it is well
tested.
So DSA, for better or for worse, has to live with that and engage in an
arms race of implementing the ->shutdown hook too, from all individual
drivers, and do something sane when paired with masters that unregister
their net_device there. The only sane thing to do, of course, is to
unlink from the master.
However, complications arise really quickly.
The pattern of redirecting ->shutdown to ->remove is not unique to
bcmgenet or even to net_device drivers. In fact, SPI controllers do it
too (see dspi_shutdown -> dspi_remove), and presumably, I2C controllers
and MDIO controllers do it too (this is something I have not researched
too deeply, but even if this is not the case today, it is certainly
plausible to happen in the future, and must be taken into consideration).
Since DSA switches might be SPI devices, I2C devices, MDIO devices, the
insane implication is that for the exact same DSA switch device, we
might have both ->shutdown and ->remove getting called.
So we need to do something with that insane environment. The pattern
I've come up with is "if this, then not that", so if either ->shutdown
or ->remove gets called, we set the device's drvdata to NULL, and in the
other hook, we check whether the drvdata is NULL and just do nothing.
This is probably not necessary for platform devices, just for devices on
buses, but I would really insist for consistency among drivers, because
when code is copy-pasted, it is not always copy-pasted from the best
sources.
So depending on whether the DSA switch's ->remove or ->shutdown will get
called first, we cannot really guarantee even for the same driver if
rebooting will result in the same code path on all platforms. But
nonetheless, we need to do something minimally reasonable on ->shutdown
too to fix the bug. Of course, the ->remove will do more (a full
teardown of the tree, with all data structures freed, and this is why
the bug was not caught for so long). The new ->shutdown method is kept
separate from dsa_unregister_switch not because we couldn't have
unregistered the switch, but simply in the interest of doing something
quick and to the point.
The big question is: does the DSA switch's ->shutdown get called earlier
than the DSA master's ->shutdown? If not, there is still a risk that we
might still trigger the WARN_ON in unregister_netdevice that says we are
attempting to unregister a net_device which has uppers. That's no good.
Although the reference to the master net_device won't physically go away
even if DSA's ->shutdown comes afterwards, remember we have a dev_hold
on it.
The answer to that question lies in this comment above device_link_add:
* A side effect of the link creation is re-ordering of dpm_list and the
* devices_kset list by moving the consumer device and all devices depending
* on it to the ends of these lists (that does not happen to devices that have
* not been registered when this function is called).
so the fact that DSA uses device_link_add towards its master is not
exactly for nothing. device_shutdown() walks devices_kset from the back,
so this is our guarantee that DSA's shutdown happens before the master's
shutdown.
Fixes: 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Link: https://lore.kernel.org/netdev/20210909095324.12978-1-LinoSanfilippo@gmx.de/
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 21:34:33 +08:00
|
|
|
|
|
|
|
/* If the DSA master chooses to unregister its net_device on .shutdown, DSA is
|
|
|
|
* blocking that operation from completion, due to the dev_hold taken inside
|
|
|
|
* netdev_upper_dev_link. Unlink the DSA slave interfaces from being uppers of
|
|
|
|
* the DSA master, so that the system can reboot successfully.
|
|
|
|
*/
|
|
|
|
void dsa_switch_shutdown(struct dsa_switch *ds)
|
|
|
|
{
|
|
|
|
struct net_device *master, *slave_dev;
|
|
|
|
struct dsa_port *dp;
|
|
|
|
|
|
|
|
mutex_lock(&dsa2_mutex);
|
|
|
|
rtnl_lock();
|
|
|
|
|
2021-10-21 01:49:51 +08:00
|
|
|
dsa_switch_for_each_user_port(dp, ds) {
|
net: dsa: be compatible with masters which unregister on shutdown
Lino reports that on his system with bcmgenet as DSA master and KSZ9897
as a switch, rebooting or shutting down never works properly.
What does the bcmgenet driver have special to trigger this, that other
DSA masters do not? It has an implementation of ->shutdown which simply
calls its ->remove implementation. Otherwise said, it unregisters its
network interface on shutdown.
This message can be seen in a loop, and it hangs the reboot process there:
unregister_netdevice: waiting for eth0 to become free. Usage count = 3
So why 3?
A usage count of 1 is normal for a registered network interface, and any
virtual interface which links itself as an upper of that will increment
it via dev_hold. In the case of DSA, this is the call path:
dsa_slave_create
-> netdev_upper_dev_link
-> __netdev_upper_dev_link
-> __netdev_adjacent_dev_insert
-> dev_hold
So a DSA switch with 3 interfaces will result in a usage count elevated
by two, and netdev_wait_allrefs will wait until they have gone away.
Other stacked interfaces, like VLAN, watch NETDEV_UNREGISTER events and
delete themselves, but DSA cannot just vanish and go poof, at most it
can unbind itself from the switch devices, but that must happen strictly
earlier compared to when the DSA master unregisters its net_device, so
reacting on the NETDEV_UNREGISTER event is way too late.
It seems that it is a pretty established pattern to have a driver's
->shutdown hook redirect to its ->remove hook, so the same code is
executed regardless of whether the driver is unbound from the device, or
the system is just shutting down. As Florian puts it, it is quite a big
hammer for bcmgenet to unregister its net_device during shutdown, but
having a common code path with the driver unbind helps ensure it is well
tested.
So DSA, for better or for worse, has to live with that and engage in an
arms race of implementing the ->shutdown hook too, from all individual
drivers, and do something sane when paired with masters that unregister
their net_device there. The only sane thing to do, of course, is to
unlink from the master.
However, complications arise really quickly.
The pattern of redirecting ->shutdown to ->remove is not unique to
bcmgenet or even to net_device drivers. In fact, SPI controllers do it
too (see dspi_shutdown -> dspi_remove), and presumably, I2C controllers
and MDIO controllers do it too (this is something I have not researched
too deeply, but even if this is not the case today, it is certainly
plausible to happen in the future, and must be taken into consideration).
Since DSA switches might be SPI devices, I2C devices, MDIO devices, the
insane implication is that for the exact same DSA switch device, we
might have both ->shutdown and ->remove getting called.
So we need to do something with that insane environment. The pattern
I've come up with is "if this, then not that", so if either ->shutdown
or ->remove gets called, we set the device's drvdata to NULL, and in the
other hook, we check whether the drvdata is NULL and just do nothing.
This is probably not necessary for platform devices, just for devices on
buses, but I would really insist for consistency among drivers, because
when code is copy-pasted, it is not always copy-pasted from the best
sources.
So depending on whether the DSA switch's ->remove or ->shutdown will get
called first, we cannot really guarantee even for the same driver if
rebooting will result in the same code path on all platforms. But
nonetheless, we need to do something minimally reasonable on ->shutdown
too to fix the bug. Of course, the ->remove will do more (a full
teardown of the tree, with all data structures freed, and this is why
the bug was not caught for so long). The new ->shutdown method is kept
separate from dsa_unregister_switch not because we couldn't have
unregistered the switch, but simply in the interest of doing something
quick and to the point.
The big question is: does the DSA switch's ->shutdown get called earlier
than the DSA master's ->shutdown? If not, there is still a risk that we
might still trigger the WARN_ON in unregister_netdevice that says we are
attempting to unregister a net_device which has uppers. That's no good.
Although the reference to the master net_device won't physically go away
even if DSA's ->shutdown comes afterwards, remember we have a dev_hold
on it.
The answer to that question lies in this comment above device_link_add:
* A side effect of the link creation is re-ordering of dpm_list and the
* devices_kset list by moving the consumer device and all devices depending
* on it to the ends of these lists (that does not happen to devices that have
* not been registered when this function is called).
so the fact that DSA uses device_link_add towards its master is not
exactly for nothing. device_shutdown() walks devices_kset from the back,
so this is our guarantee that DSA's shutdown happens before the master's
shutdown.
Fixes: 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Link: https://lore.kernel.org/netdev/20210909095324.12978-1-LinoSanfilippo@gmx.de/
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 21:34:33 +08:00
|
|
|
master = dp->cpu_dp->master;
|
|
|
|
slave_dev = dp->slave;
|
|
|
|
|
|
|
|
netdev_upper_dev_unlink(master, slave_dev);
|
|
|
|
}
|
net: dsa: fix panic when DSA master device unbinds on shutdown
Rafael reports that on a system with LX2160A and Marvell DSA switches,
if a reboot occurs while the DSA master (dpaa2-eth) is up, the following
panic can be seen:
systemd-shutdown[1]: Rebooting.
Unable to handle kernel paging request at virtual address 00a0000800000041
[00a0000800000041] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] PREEMPT SMP
CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 5.16.5-00042-g8f5585009b24 #32
pc : dsa_slave_netdevice_event+0x130/0x3e4
lr : raw_notifier_call_chain+0x50/0x6c
Call trace:
dsa_slave_netdevice_event+0x130/0x3e4
raw_notifier_call_chain+0x50/0x6c
call_netdevice_notifiers_info+0x54/0xa0
__dev_close_many+0x50/0x130
dev_close_many+0x84/0x120
unregister_netdevice_many+0x130/0x710
unregister_netdevice_queue+0x8c/0xd0
unregister_netdev+0x20/0x30
dpaa2_eth_remove+0x68/0x190
fsl_mc_driver_remove+0x20/0x5c
__device_release_driver+0x21c/0x220
device_release_driver_internal+0xac/0xb0
device_links_unbind_consumers+0xd4/0x100
__device_release_driver+0x94/0x220
device_release_driver+0x28/0x40
bus_remove_device+0x118/0x124
device_del+0x174/0x420
fsl_mc_device_remove+0x24/0x40
__fsl_mc_device_remove+0xc/0x20
device_for_each_child+0x58/0xa0
dprc_remove+0x90/0xb0
fsl_mc_driver_remove+0x20/0x5c
__device_release_driver+0x21c/0x220
device_release_driver+0x28/0x40
bus_remove_device+0x118/0x124
device_del+0x174/0x420
fsl_mc_bus_remove+0x80/0x100
fsl_mc_bus_shutdown+0xc/0x1c
platform_shutdown+0x20/0x30
device_shutdown+0x154/0x330
__do_sys_reboot+0x1cc/0x250
__arm64_sys_reboot+0x20/0x30
invoke_syscall.constprop.0+0x4c/0xe0
do_el0_svc+0x4c/0x150
el0_svc+0x24/0xb0
el0t_64_sync_handler+0xa8/0xb0
el0t_64_sync+0x178/0x17c
It can be seen from the stack trace that the problem is that the
deregistration of the master causes a dev_close(), which gets notified
as NETDEV_GOING_DOWN to dsa_slave_netdevice_event().
But dsa_switch_shutdown() has already run, and this has unregistered the
DSA slave interfaces, and yet, the NETDEV_GOING_DOWN handler attempts to
call dev_close_many() on those slave interfaces, leading to the problem.
The previous attempt to avoid the NETDEV_GOING_DOWN on the master after
dsa_switch_shutdown() was called seems improper. Unregistering the slave
interfaces is unnecessary and unhelpful. Instead, after the slaves have
stopped being uppers of the DSA master, we can now reset to NULL the
master->dsa_ptr pointer, which will make DSA start ignoring all future
notifier events on the master.
Fixes: 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown")
Reported-by: Rafael Richter <rafael.richter@gin.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09 20:04:33 +08:00
|
|
|
|
|
|
|
/* Disconnect from further netdevice notifiers on the master,
|
|
|
|
* since netdev_uses_dsa() will now return false.
|
|
|
|
*/
|
|
|
|
dsa_switch_for_each_cpu_port(dp, ds)
|
|
|
|
dp->master->dsa_ptr = NULL;
|
net: dsa: be compatible with masters which unregister on shutdown
Lino reports that on his system with bcmgenet as DSA master and KSZ9897
as a switch, rebooting or shutting down never works properly.
What does the bcmgenet driver have special to trigger this, that other
DSA masters do not? It has an implementation of ->shutdown which simply
calls its ->remove implementation. Otherwise said, it unregisters its
network interface on shutdown.
This message can be seen in a loop, and it hangs the reboot process there:
unregister_netdevice: waiting for eth0 to become free. Usage count = 3
So why 3?
A usage count of 1 is normal for a registered network interface, and any
virtual interface which links itself as an upper of that will increment
it via dev_hold. In the case of DSA, this is the call path:
dsa_slave_create
-> netdev_upper_dev_link
-> __netdev_upper_dev_link
-> __netdev_adjacent_dev_insert
-> dev_hold
So a DSA switch with 3 interfaces will result in a usage count elevated
by two, and netdev_wait_allrefs will wait until they have gone away.
Other stacked interfaces, like VLAN, watch NETDEV_UNREGISTER events and
delete themselves, but DSA cannot just vanish and go poof, at most it
can unbind itself from the switch devices, but that must happen strictly
earlier compared to when the DSA master unregisters its net_device, so
reacting on the NETDEV_UNREGISTER event is way too late.
It seems that it is a pretty established pattern to have a driver's
->shutdown hook redirect to its ->remove hook, so the same code is
executed regardless of whether the driver is unbound from the device, or
the system is just shutting down. As Florian puts it, it is quite a big
hammer for bcmgenet to unregister its net_device during shutdown, but
having a common code path with the driver unbind helps ensure it is well
tested.
So DSA, for better or for worse, has to live with that and engage in an
arms race of implementing the ->shutdown hook too, from all individual
drivers, and do something sane when paired with masters that unregister
their net_device there. The only sane thing to do, of course, is to
unlink from the master.
However, complications arise really quickly.
The pattern of redirecting ->shutdown to ->remove is not unique to
bcmgenet or even to net_device drivers. In fact, SPI controllers do it
too (see dspi_shutdown -> dspi_remove), and presumably, I2C controllers
and MDIO controllers do it too (this is something I have not researched
too deeply, but even if this is not the case today, it is certainly
plausible to happen in the future, and must be taken into consideration).
Since DSA switches might be SPI devices, I2C devices, MDIO devices, the
insane implication is that for the exact same DSA switch device, we
might have both ->shutdown and ->remove getting called.
So we need to do something with that insane environment. The pattern
I've come up with is "if this, then not that", so if either ->shutdown
or ->remove gets called, we set the device's drvdata to NULL, and in the
other hook, we check whether the drvdata is NULL and just do nothing.
This is probably not necessary for platform devices, just for devices on
buses, but I would really insist for consistency among drivers, because
when code is copy-pasted, it is not always copy-pasted from the best
sources.
So depending on whether the DSA switch's ->remove or ->shutdown will get
called first, we cannot really guarantee even for the same driver if
rebooting will result in the same code path on all platforms. But
nonetheless, we need to do something minimally reasonable on ->shutdown
too to fix the bug. Of course, the ->remove will do more (a full
teardown of the tree, with all data structures freed, and this is why
the bug was not caught for so long). The new ->shutdown method is kept
separate from dsa_unregister_switch not because we couldn't have
unregistered the switch, but simply in the interest of doing something
quick and to the point.
The big question is: does the DSA switch's ->shutdown get called earlier
than the DSA master's ->shutdown? If not, there is still a risk that we
might still trigger the WARN_ON in unregister_netdevice that says we are
attempting to unregister a net_device which has uppers. That's no good.
Although the reference to the master net_device won't physically go away
even if DSA's ->shutdown comes afterwards, remember we have a dev_hold
on it.
The answer to that question lies in this comment above device_link_add:
* A side effect of the link creation is re-ordering of dpm_list and the
* devices_kset list by moving the consumer device and all devices depending
* on it to the ends of these lists (that does not happen to devices that have
* not been registered when this function is called).
so the fact that DSA uses device_link_add towards its master is not
exactly for nothing. device_shutdown() walks devices_kset from the back,
so this is our guarantee that DSA's shutdown happens before the master's
shutdown.
Fixes: 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Link: https://lore.kernel.org/netdev/20210909095324.12978-1-LinoSanfilippo@gmx.de/
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 21:34:33 +08:00
|
|
|
|
|
|
|
rtnl_unlock();
|
|
|
|
mutex_unlock(&dsa2_mutex);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(dsa_switch_shutdown);
|