mirror of
https://github.com/edk2-porting/linux-next.git
synced 2024-12-27 14:43:58 +08:00
e2f01bfe14
1159 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Alvin Šipraga
|
43a4b4dbd4 |
net: dsa: fix spurious error message when unoffloaded port leaves bridge
Flip the sign of a return value check, thereby suppressing the following
spurious error:
port 2 failed to notify DSA_NOTIFIER_BRIDGE_LEAVE: -EOPNOTSUPP
... which is emitted when removing an unoffloaded DSA switch port from a
bridge.
Fixes:
|
||
Vladimir Oltean
|
1951b3f19c |
net: dsa: hold rtnl_lock in dsa_switch_setup_tag_protocol
It was a documented fact that ds->ops->change_tag_protocol() offered
rtnetlink mutex protection to the switch driver, since there was an
ASSERT_RTNL right before the call in dsa_switch_change_tag_proto()
(initiated from sysfs).
The blamed commit introduced another call path for
ds->ops->change_tag_protocol() which does not hold the rtnl_mutex.
This is:
dsa_tree_setup
-> dsa_tree_setup_switches
-> dsa_switch_setup
-> dsa_switch_setup_tag_protocol
-> ds->ops->change_tag_protocol()
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice(slave_dev)
-> dsa_tree_setup_master
-> dsa_master_setup
-> dev->dsa_ptr = cpu_dp
The reason why the rtnl_mutex is held in the sysfs call path is to
ensure that, once the master and all the DSA interfaces are down (which
is required so that no packets flow), they remain down during the
tagging protocol change.
The above calling order illustrates the fact that it should not be risky
to change the initial tagging protocol to the one specified in the
device tree at the given time:
- packets cannot enter the dsa_switch_rcv() packet type handler since
netdev_uses_dsa() for the master will not yet return true, since
dev->dsa_ptr has not yet been populated
- packets cannot enter the dsa_slave_xmit() function because no DSA
interface has yet been registered
So from the DSA core's perspective, holding the rtnl_mutex is indeed not
necessary.
Yet, drivers may need to do things which need rtnl_mutex protection. For
example:
felix_set_tag_protocol
-> felix_setup_tag_8021q
-> dsa_tag_8021q_register
-> dsa_tag_8021q_setup
-> dsa_tag_8021q_port_setup
-> vlan_vid_add
-> ASSERT_RTNL
These drivers do not really have a choice to take the rtnl_mutex
themselves, since in the sysfs case, the rtnl_mutex is already held.
Fixes:
|
||
Vladimir Oltean
|
5bded8259e |
net: dsa: mv88e6xxx: isolate the ATU databases of standalone and bridged ports
Similar to commit |
||
Vladimir Oltean
|
c7709a02c1 |
net: dsa: tag_dsa: send packets with TX fwd offload from VLAN-unaware bridges using VID 0
The present code is structured this way due to an incomplete thought
process. In Documentation/networking/switchdev.rst we document that if a
bridge is VLAN-unaware, then the presence or lack of a pvid on a bridge
port (or on the bridge itself, for that matter) should not affect the
ability to receive and transmit tagged or untagged packets.
If the bridge on behalf of which we are sending this packet is
VLAN-aware, then the TX forwarding offload API ensures that the skb will
be VLAN-tagged (if the packet was sent by user space as untagged, it
will get transmitted town to the driver as tagged with the bridge
device's pvid). But if the bridge is VLAN-unaware, it may or may not be
VLAN-tagged. In fact the logic to insert the bridge's PVID came from the
idea that we should emulate what is being done in the VLAN-aware case.
But we shouldn't.
It appears that injecting packets using a VLAN ID of 0 serves the
purpose of forwarding the packets to the egress port with no VLAN tag
added or stripped by the hardware, and no filtering being performed.
So we can simply remove the superfluous logic.
One reason why this logic is broken is that when CONFIG_BRIDGE_VLAN_FILTERING=n,
we call br_vlan_get_pvid_rcu() but that returns an error and we do error
out, dropping all packets on xmit. Not really smart. This is also an
issue when the user deletes the bridge pvid:
$ bridge vlan del dev br0 vid 1 self
As mentioned, in both cases, packets should still flow freely, and they
do just that on any net device where the bridge is not offloaded, but on
mv88e6xxx they don't.
Fixes:
|
||
Vladimir Oltean
|
1bec0f0506 |
net: dsa: fix bridge_num not getting cleared after ports leaving the bridge
The dp->bridge_num is zero-based, with -1 being the encoding for an
invalid value. But dsa_bridge_num_put used to check for an invalid value
by comparing bridge_num with 0, which is of course incorrect.
The result is that the bridge_num will never get cleared by
dsa_bridge_num_put, and further port joins to other bridges will get a
bridge_num larger than the previous one, and once all the available
bridges with TX forwarding offload supported by the hardware get
exhausted, the TX forwarding offload feature is simply disabled.
In the case of sja1105, 7 iterations of the loop below are enough to
exhaust the TX forwarding offload bits, and further bridge joins operate
without that feature.
ip link add br0 type bridge vlan_filtering 1
while :; do
ip link set sw0p2 master br0 && sleep 1
ip link set sw0p2 nomaster && sleep 1
done
This issue is enough of an indication that having the dp->bridge_num
invalid encoding be a negative number is prone to bugs, so this will be
changed to a one-based value, with the dp->bridge_num of zero being the
indication of no bridge. However, that is material for net-next.
Fixes:
|
||
Jakub Kicinski
|
9fe1155233 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Andrew Lunn
|
b44d52a50b |
dsa: tag_dsa: Fix mask for trunked packets
A packet received on a trunk will have bit 2 set in Forward DSA tagged
frame. Bit 1 can be either 0 or 1 and is otherwise undefined and bit 0
indicates the frame CFI. Masking with 7 thus results in frames as
being identified as being from a trunk when in fact they are not. Fix
the mask to just look at bit 2.
Fixes:
|
||
Jakub Kicinski
|
e35b8d7dbb |
net: use eth_hw_addr_set() instead of ether_addr_copy()
Convert from ether_addr_copy() to eth_hw_addr_set(): @@ expression dev, np; @@ - ether_addr_copy(dev->dev_addr, np) + eth_hw_addr_set(dev, np) Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
5ca721c54d |
net: dsa: tag_ocelot: set the classified VLAN during xmit
Currently, all packets injected into Ocelot switches are classified to VLAN 0, regardless of whether they are VLAN-tagged or not. This is because the switch only looks at the VLAN TCI from the DSA tag. VLAN 0 is then stripped on egress due to REW_TAG_CFG_TAG_CFG. There are 2 cases really, below is the explanation for ocelot_port_set_native_vlan: - Port is VLAN-aware, we set REW_TAG_CFG_TAG_CFG to 1 (egress-tag all frames except VID 0 and the native VLAN) if a native VLAN exists, or to 3 otherwise (tag all frames, including VID 0). - Port is VLAN-unaware, we set REW_TAG_CFG_TAG_CFG to 0 (port tagging disabled, classified VLAN never appears in the packet). One can already see an inconsistency: when a native VLAN exists, VID 0 is egress-untagged, but when it doesn't, VID 0 is egress-tagged. So when we do this: ip link add br0 type bridge vlan_filtering 1 ip link set swp0 master br0 bridge vlan del dev swp0 vid 1 bridge vlan add dev swp0 vid 1 pvid # but not untagged and we ping through swp0, packets will look like this: MAC > 33:33:00:00:00:02, ethertype 802.1Q (0x8100): vlan 0, p 0, ethertype 802.1Q (0x8100), vlan 1, p 0, ethertype IPv6 (0x86dd), ICMP6, router solicitation, length 16 So VID 1 frames (sent that way by the Linux bridge) are encapsulated in a VID 0 header - the classified VLAN of the packets as far as the hw is concerned. To avoid that, what we really need to do is stop injecting packets using the classified VLAN of 0. This patch strips the VLAN header from the skb payload, if that VLAN exists and if the port is under a VLAN-aware bridge. Then it copies that VLAN header into the DSA injection frame header. A positive side effect is that VCAP ES0 VLAN rewriting rules now work for packets injected from the CPU into a port that's under a VLAN-aware bridge, and we are able to match those packets by the VLAN ID that was sent by the network stack, and not by VLAN ID 0. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Mianhan Liu
|
ca4b0649be |
net/dsa/tag_ksz.c: remove superfluous headers
tag_ksz.c hasn't use any macro or function declared in linux/slab.h. Thus, these files can be removed from tag_ksz.c safely without affecting the compilation of the ./net/dsa module Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Mianhan Liu
|
6f8b64f86e |
net/dsa/tag_8021q.c: remove superfluous headers
tag_8021q.c hasn't use any macro or function declared in linux/if_bridge.h. Thus, these files can be removed from tag_8021q.c safely without affecting the compilation of the ./net/dsa module Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Leon Romanovsky
|
bd936bd53b |
net: dsa: Move devlink registration to be last devlink command
This change prevents from users to access device before devlink is fully configured. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Jakub Kicinski
|
2fcd14d0f7 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/mptcp/protocol.c |
||
Vladimir Oltean
|
f5aef42415 |
net: dsa: sja1105: break dependency between dsa_port_is_sja1105 and switch driver
It's nice to be able to test a tagging protocol with dsa_loop, but not
at the cost of losing the ability of building the tagging protocol and
switch driver as modules, because as things stand, there is a circular
dependency between the two. Tagging protocol drivers cannot depend on
switch drivers, that is a hard fact.
The reasoning behind the blamed patch was that accessing dp->priv should
first make sure that the structure behind that pointer is what we really
think it is.
Currently the "sja1105" and "sja1110" tagging protocols only operate
with the sja1105 switch driver, just like any other tagging protocol and
switch combination. The only way to mix and match them is by modifying
the code, and this applies to dsa_loop as well (by default that uses
DSA_TAG_PROTO_NONE). So while in principle there is an issue, in
practice there isn't one.
Until we extend dsa_loop to allow user space configuration, treat the
problem as a non-issue and just say that DSA ports found by tag_sja1105
are always sja1105 ports, which is in fact true. But keep the
dsa_port_is_sja1105 function so that it's easy to patch it during
testing, and rely on dead code elimination.
Fixes:
|
||
Vladimir Oltean
|
6d709cadfd |
net: dsa: move sja1110_process_meta_tstamp inside the tagging protocol driver
The problem is that DSA tagging protocols really must not depend on the
switch driver, because this creates a circular dependency at insmod
time, and the switch driver will effectively not load when the tagging
protocol driver is missing.
The code was structured in the way it was for a reason, though. The DSA
driver-facing API for PTP timestamping relies on the assumption that
two-step TX timestamps are provided by the hardware in an out-of-band
manner, typically by raising an interrupt and making that timestamp
available inside some sort of FIFO which is to be accessed over
SPI/MDIO/etc.
So the API puts .port_txtstamp into dsa_switch_ops, because it is
expected that the switch driver needs to save some state (like put the
skb into a queue until its TX timestamp arrives).
On SJA1110, TX timestamps are provided by the switch as Ethernet
packets, so this makes them be received and processed by the tagging
protocol driver. This in itself is great, because the timestamps are
full 64-bit and do not require reconstruction, and since Ethernet is the
fastest I/O method available to/from the switch, PTP timestamps arrive
very quickly, no matter how bottlenecked the SPI connection is, because
SPI interaction is not needed at all.
DSA's code structure and strict isolation between the tagging protocol
driver and the switch driver break the natural code organization.
When the tagging protocol driver receives a packet which is classified
as a metadata packet containing timestamps, it passes those timestamps
one by one to the switch driver, which then proceeds to compare them
based on the recorded timestamp ID that was generated in .port_txtstamp.
The communication between the tagging protocol and the switch driver is
done through a method exported by the switch driver, sja1110_process_meta_tstamp.
To satisfy build requirements, we force a dependency to build the
tagging protocol driver as a module when the switch driver is a module.
However, as explained in the first paragraph, that causes the circular
dependency.
To solve this, move the skb queue from struct sja1105_private :: struct
sja1105_ptp_data to struct sja1105_private :: struct sja1105_tagger_data.
The latter is a data structure for which hacks have already been put
into place to be able to create persistent storage per switch that is
accessible from the tagging protocol driver (see sja1105_setup_ports).
With the skb queue directly accessible from the tagging protocol driver,
we can now move sja1110_process_meta_tstamp into the tagging driver
itself, and avoid exporting a symbol.
Fixes:
|
||
Leon Romanovsky
|
db4278c55f |
devlink: Make devlink_register to be void
devlink_register() can't fail and always returns success, but all drivers are obligated to check returned status anyway. This adds a lot of boilerplate code to handle impossible flow. Make devlink_register() void and simplify the drivers that use that API call. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Simon Horman <simon.horman@corigine.com> Acked-by: Vladimir Oltean <olteanv@gmail.com> # dsa Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
5135e96a3d |
net: dsa: don't allocate the slave_mii_bus using devres
The Linux device model permits both the ->shutdown and ->remove driver
methods to get called during a shutdown procedure. Example: a DSA switch
which sits on an SPI bus, and the SPI bus driver calls this on its
->shutdown method:
spi_unregister_controller
-> device_for_each_child(&ctlr->dev, NULL, __unregister);
-> spi_unregister_device(to_spi_device(dev));
-> device_del(&spi->dev);
So this is a simple pattern which can theoretically appear on any bus,
although the only other buses on which I've been able to find it are
I2C:
i2c_del_adapter
-> device_for_each_child(&adap->dev, NULL, __unregister_client);
-> i2c_unregister_device(client);
-> device_unregister(&client->dev);
The implication of this pattern is that devices on these buses can be
unregistered after having been shut down. The drivers for these devices
might choose to return early either from ->remove or ->shutdown if the
other callback has already run once, and they might choose that the
->shutdown method should only perform a subset of the teardown done by
->remove (to avoid unnecessary delays when rebooting).
So in other words, the device driver may choose on ->remove to not
do anything (therefore to not unregister an MDIO bus it has registered
on ->probe), because this ->remove is actually triggered by the
device_shutdown path, and its ->shutdown method has already run and done
the minimally required cleanup.
This used to be fine until the blamed commit, but now, the following
BUG_ON triggers:
void mdiobus_free(struct mii_bus *bus)
{
/* For compatibility with error handling in drivers. */
if (bus->state == MDIOBUS_ALLOCATED) {
kfree(bus);
return;
}
BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
bus->state = MDIOBUS_RELEASED;
put_device(&bus->dev);
}
In other words, there is an attempt to free an MDIO bus which was not
unregistered. The attempt to free it comes from the devres release
callbacks of the SPI device, which are executed after the device is
unregistered.
I'm not saying that the fact that MDIO buses allocated using devres
would automatically get unregistered wasn't strange. I'm just saying
that the commit didn't care about auditing existing call paths in the
kernel, and now, the following code sequences are potentially buggy:
(a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
located on a bus that unregisters its children on shutdown. After
the blamed patch, either both the alloc and the register should use
devres, or none should.
(b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
mdiobus_unregister at all in the remove path. After the blamed
patch, nobody unregisters the MDIO bus anymore, so this is even more
buggy than the previous case which needs a specific bus
configuration to be seen, this one is an unconditional bug.
In this case, DSA falls into category (a), it tries to be helpful and
registers an MDIO bus on behalf of the switch, which might be on such a
bus. I've no idea why it does it under devres.
It does this on probe:
if (!ds->slave_mii_bus && ds->ops->phy_read)
alloc and register mdio bus
and this on remove:
if (ds->slave_mii_bus && ds->ops->phy_read)
unregister mdio bus
I _could_ imagine using devres because the condition used on remove is
different than the condition used on probe. So strictly speaking, DSA
cannot determine whether the ds->slave_mii_bus it sees on remove is the
ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
have solved that problem. But nonetheless, the existing code already
proceeds to unregister the MDIO bus, even though it might be
unregistering an MDIO bus it has never registered. So I can only guess
that no driver that implements ds->ops->phy_read also allocates and
registers ds->slave_mii_bus itself.
So in that case, if unregistering is fine, freeing must be fine too.
Stop using devres and free the MDIO bus manually. This will make devres
stop attempting to free a still registered MDIO bus on ->shutdown.
Fixes:
|
||
Vladimir Oltean
|
e5845aa0ea |
net: dsa: fix dsa_tree_setup error path
Since the blamed commit, dsa_tree_teardown_switches() was split into two
smaller functions, dsa_tree_teardown_switches and dsa_tree_teardown_ports.
However, the error path of dsa_tree_setup stopped calling dsa_tree_teardown_ports.
Fixes:
|
||
Vladimir Oltean
|
fd292c189a |
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit |
||
Vladimir Oltean
|
0650bf52b3 |
net: dsa: be compatible with masters which unregister on shutdown
Lino reports that on his system with bcmgenet as DSA master and KSZ9897
as a switch, rebooting or shutting down never works properly.
What does the bcmgenet driver have special to trigger this, that other
DSA masters do not? It has an implementation of ->shutdown which simply
calls its ->remove implementation. Otherwise said, it unregisters its
network interface on shutdown.
This message can be seen in a loop, and it hangs the reboot process there:
unregister_netdevice: waiting for eth0 to become free. Usage count = 3
So why 3?
A usage count of 1 is normal for a registered network interface, and any
virtual interface which links itself as an upper of that will increment
it via dev_hold. In the case of DSA, this is the call path:
dsa_slave_create
-> netdev_upper_dev_link
-> __netdev_upper_dev_link
-> __netdev_adjacent_dev_insert
-> dev_hold
So a DSA switch with 3 interfaces will result in a usage count elevated
by two, and netdev_wait_allrefs will wait until they have gone away.
Other stacked interfaces, like VLAN, watch NETDEV_UNREGISTER events and
delete themselves, but DSA cannot just vanish and go poof, at most it
can unbind itself from the switch devices, but that must happen strictly
earlier compared to when the DSA master unregisters its net_device, so
reacting on the NETDEV_UNREGISTER event is way too late.
It seems that it is a pretty established pattern to have a driver's
->shutdown hook redirect to its ->remove hook, so the same code is
executed regardless of whether the driver is unbound from the device, or
the system is just shutting down. As Florian puts it, it is quite a big
hammer for bcmgenet to unregister its net_device during shutdown, but
having a common code path with the driver unbind helps ensure it is well
tested.
So DSA, for better or for worse, has to live with that and engage in an
arms race of implementing the ->shutdown hook too, from all individual
drivers, and do something sane when paired with masters that unregister
their net_device there. The only sane thing to do, of course, is to
unlink from the master.
However, complications arise really quickly.
The pattern of redirecting ->shutdown to ->remove is not unique to
bcmgenet or even to net_device drivers. In fact, SPI controllers do it
too (see dspi_shutdown -> dspi_remove), and presumably, I2C controllers
and MDIO controllers do it too (this is something I have not researched
too deeply, but even if this is not the case today, it is certainly
plausible to happen in the future, and must be taken into consideration).
Since DSA switches might be SPI devices, I2C devices, MDIO devices, the
insane implication is that for the exact same DSA switch device, we
might have both ->shutdown and ->remove getting called.
So we need to do something with that insane environment. The pattern
I've come up with is "if this, then not that", so if either ->shutdown
or ->remove gets called, we set the device's drvdata to NULL, and in the
other hook, we check whether the drvdata is NULL and just do nothing.
This is probably not necessary for platform devices, just for devices on
buses, but I would really insist for consistency among drivers, because
when code is copy-pasted, it is not always copy-pasted from the best
sources.
So depending on whether the DSA switch's ->remove or ->shutdown will get
called first, we cannot really guarantee even for the same driver if
rebooting will result in the same code path on all platforms. But
nonetheless, we need to do something minimally reasonable on ->shutdown
too to fix the bug. Of course, the ->remove will do more (a full
teardown of the tree, with all data structures freed, and this is why
the bug was not caught for so long). The new ->shutdown method is kept
separate from dsa_unregister_switch not because we couldn't have
unregistered the switch, but simply in the interest of doing something
quick and to the point.
The big question is: does the DSA switch's ->shutdown get called earlier
than the DSA master's ->shutdown? If not, there is still a risk that we
might still trigger the WARN_ON in unregister_netdevice that says we are
attempting to unregister a net_device which has uppers. That's no good.
Although the reference to the master net_device won't physically go away
even if DSA's ->shutdown comes afterwards, remember we have a dev_hold
on it.
The answer to that question lies in this comment above device_link_add:
* A side effect of the link creation is re-ordering of dpm_list and the
* devices_kset list by moving the consumer device and all devices depending
* on it to the ends of these lists (that does not happen to devices that have
* not been registered when this function is called).
so the fact that DSA uses device_link_add towards its master is not
exactly for nothing. device_shutdown() walks devices_kset from the back,
so this is our guarantee that DSA's shutdown happens before the master's
shutdown.
Fixes:
|
||
Vladimir Oltean
|
3c9cfb5269 |
net: update NXP copyright text
NXP Legal insists that the following are not fine: - Saying "NXP Semiconductors" instead of "NXP", since the company's registered name is "NXP" - Putting a "(c)" sign in the copyright string - Putting a comma in the copyright string The only accepted copyright string format is "Copyright <year-range> NXP". This patch changes the copyright headers in the networking files that were sent by me, or derived from code sent by me. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Jakub Kicinski
|
561bed688b |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts! Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Vladimir Oltean
|
a57d8c217a |
net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports
Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
(and similarly for other ports)
What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.
But deleting that cleanup code, the warnings go away.
=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.
The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.
The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.
In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.
So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.
Fixes:
|
||
Vladimir Oltean
|
6a52e73368 |
net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup
DSA supports connecting to a phy-handle, and has a fallback to a non-OF
based method of connecting to an internal PHY on the switch's own MDIO
bus, if no phy-handle and no fixed-link nodes were present.
The -ENODEV error code from the first attempt (phylink_of_phy_connect)
is what triggers the second attempt (phylink_connect_phy).
However, when the first attempt returns a different error code than
-ENODEV, this results in an unbalance of calls to phylink_create and
phylink_destroy by the time we exit the function. The phylink instance
has leaked.
There are many other error codes that can be returned by
phylink_of_phy_connect. For example, phylink_validate returns -EINVAL.
So this is a practical issue too.
Fixes:
|
||
Linus Walleij
|
339133f6c3 |
net: dsa: tag_rtl4_a: Drop bit 9 from egress frames
This drops the code setting bit 9 on egress frames on the Realtek "type A" (RTL8366RB) frames. This bit was set on ingress frames for unknown reason, and was set on egress frames as the format of ingress and egress frames was believed to be the same. As that assumption turned out to be false, and since this bit seems to have zero effect on the behaviour of the switch let's drop this bit entirely. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210913143156.1264570-1-linus.walleij@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Linus Walleij
|
0e90dfa7a8 |
net: dsa: tag_rtl4_a: Fix egress tags
I noticed that only port 0 worked on the RTL8366RB since we
started to use custom tags.
It turns out that the format of egress custom tags is actually
different from ingress custom tags. While the lower bits just
contain the port number in ingress tags, egress tags need to
indicate destination port by setting the bit for the
corresponding port.
It was working on port 0 because port 0 added 0x00 as port
number in the lower bits, and if you do this the packet appears
at all ports, including the intended port. Ooops.
Fix this and all ports work again. Use the define for shifting
the "type A" into place while we're at it.
Tested on the D-Link DIR-685 by sending traffic to each of
the ports in turn. It works.
Fixes:
|
||
Vladimir Oltean
|
8ded916092 |
net: dsa: tag_sja1105: stop asking the sja1105 driver in sja1105_xmit_tpid
Introduced in commit
|
||
Vladimir Oltean
|
b0b8c67eaa |
net: dsa: sja1105: drop untagged packets on the CPU and DSA ports
The sja1105 driver is a bit special in its use of VLAN headers as DSA tags. This is because in VLAN-aware mode, the VLAN headers use an actual TPID of 0x8100, which is understood even by the DSA master as an actual VLAN header. Furthermore, control packets such as PTP and STP are transmitted with no VLAN header as a DSA tag, because, depending on switch generation, there are ways to steer these control packets towards a precise egress port other than VLAN tags. Transmitting control packets as untagged means leaving a door open for traffic in general to be transmitted as untagged from the DSA master, and for it to traverse the switch and exit a random switch port according to the FDB lookup. This behavior is a bit out of line with other DSA drivers which have native support for DSA tagging. There, it is to be expected that the switch only accepts DSA-tagged packets on its CPU port, dropping everything that does not match this pattern. We perhaps rely a bit too much on the switches' hardware dropping on the CPU port, and place no other restrictions in the kernel data path to avoid that. For example, sja1105 is also a bit special in that STP/PTP packets are transmitted using "management routes" (sja1105_port_deferred_xmit): when sending a link-local packet from the CPU, we must first write a SPI message to the switch to tell it to expect a packet towards multicast MAC DA 01-80-c2-00-00-0e, and to route it towards port 3 when it gets it. This entry expires as soon as it matches a packet received by the switch, and it needs to be reinstalled for the next packet etc. All in all quite a ghetto mechanism, but it is all that the sja1105 switches offer for injecting a control packet. The driver takes a mutex for serializing control packets and making the pairs of SPI writes of a management route and its associated skb atomic, but to be honest, a mutex is only relevant as long as all parties agree to take it. With the DSA design, it is possible to open an AF_PACKET socket on the DSA master net device, and blast packets towards 01-80-c2-00-00-0e, and whatever locking the DSA switch driver might use, it all goes kaput because management routes installed by the driver will match skbs sent by the DSA master, and not skbs generated by the driver itself. So they will end up being routed on the wrong port. So through the lens of that, maybe it would make sense to avoid that from happening by doing something in the network stack, like: introduce a new bit in struct sk_buff, like xmit_from_dsa. Then, somewhere around dev_hard_start_xmit(), introduce the following check: if (netdev_uses_dsa(dev) && !skb->xmit_from_dsa) kfree_skb(skb); Ok, maybe that is a bit drastic, but that would at least prevent a bunch of problems. For example, right now, even though the majority of DSA switches drop packets without DSA tags sent by the DSA master (and therefore the majority of garbage that user space daemons like avahi and udhcpcd and friends create), it is still conceivable that an aggressive user space program can open an AF_PACKET socket and inject a spoofed DSA tag directly on the DSA master. We have no protection against that; the packet will be understood by the switch and be routed wherever user space says. Furthermore: there are some DSA switches where we even have register access over Ethernet, using DSA tags. So even user space drivers are possible in this way. This is a huge hole. However, the biggest thing that bothers me is that udhcpcd attempts to ask for an IP address on all interfaces by default, and with sja1105, it will attempt to get a valid IP address on both the DSA master as well as on sja1105 switch ports themselves. So with IP addresses in the same subnet on multiple interfaces, the routing table will be messed up and the system will be unusable for traffic until it is configured manually to not ask for an IP address on the DSA master itself. It turns out that it is possible to avoid that in the sja1105 driver, at least very superficially, by requesting the switch to drop VLAN-untagged packets on the CPU port. With the exception of control packets, all traffic originated from tag_sja1105.c is already VLAN-tagged, so only STP and PTP packets need to be converted. For that, we need to uphold the equivalence between an untagged and a pvid-tagged packet, and to remember that the CPU port of sja1105 uses a pvid of 4095. Now that we drop untagged traffic on the CPU port, non-aggressive user space applications like udhcpcd stop bothering us, and sja1105 effectively becomes just as vulnerable to the aggressive kind of user space programs as other DSA switches are (ok, users can also create 8021q uppers on top of the DSA master in the case of sja1105, but in future patches we can easily deny that, but it still doesn't change the fact that VLAN-tagged packets can still be injected over raw sockets). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
58adf9dcb1 |
net: dsa: let drivers state that they need VLAN filtering while standalone
As explained in commit
|
||
Vladimir Oltean
|
06cfb2df7e |
net: dsa: don't advertise 'rx-vlan-filter' when not needed
There have been multiple independent reports about dsa_slave_vlan_rx_add_vid being called (and consequently calling the drivers' .port_vlan_add) when it isn't needed, and sometimes (not always) causing problems in the process. Case 1: mv88e6xxx_port_vlan_prepare is stubborn and only accepts VLANs on bridged ports. That is understandably so, because standalone mv88e6xxx ports are VLAN-unaware, and VTU entries are said to be a scarce resource. Otherwise said, the following fails lamentably on mv88e6xxx: ip link add br0 type bridge vlan_filtering 1 ip link set lan3 master br0 ip link add link lan10 name lan10.1 type vlan id 1 [485256.724147] mv88e6085 d0032004.mdio-mii:12: p10: hw VLAN 1 already used by port 3 in br0 RTNETLINK answers: Operation not supported This has become a worse issue since commit |
||
Vladimir Oltean
|
67b5fb5db7 |
net: dsa: properly fall back to software bridging
If the driver does not implement .port_bridge_{join,leave}, then we must fall back to standalone operation on that port, and trigger the error path of dsa_port_bridge_join. This sets dp->bridge_dev = NULL. In turn, having a non-NULL dp->bridge_dev when there is no offloading support makes the following things go wrong: - dsa_default_offload_fwd_mark make the wrong decision in setting skb->offload_fwd_mark. It should set skb->offload_fwd_mark = 0 for ports that don't offload the bridge, which should instruct the bridge to forward in software. But this does not happen, dp->bridge_dev is incorrectly set to point to the bridge, so the bridge is told that packets have been forwarded in hardware, which they haven't. - switchdev objects (MDBs, VLANs) should not be offloaded by ports that don't offload the bridge. Standalone ports should behave as packet-in, packet-out and the bridge should not be able to manipulate the pvid of the port, or tag stripping on egress, or ingress filtering. This should already work fine because dsa_slave_port_obj_add has: case SWITCHDEV_OBJ_ID_PORT_VLAN: if (!dsa_port_offloads_bridge_port(dp, obj->orig_dev)) return -EOPNOTSUPP; err = dsa_slave_vlan_add(dev, obj, extack); but since dsa_port_offloads_bridge_port works based on dp->bridge_dev, this is again sabotaging us. All the above work in case the port has an unoffloaded LAG interface, so this is well exercised code, we should apply it for plain unoffloaded bridge ports too. Reported-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
09dba21b43 |
net: dsa: don't call switchdev_bridge_port_unoffload for unoffloaded bridge ports
For ports that have a NULL dp->bridge_dev, dsa_port_to_bridge_port()
also returns NULL as expected.
Issue #1 is that we are performing a NULL pointer dereference on brport_dev.
Issue #2 is that these are ports on which switchdev_bridge_port_offload
has not been called, so we should not call switchdev_bridge_port_unoffload
on them either.
Both issues are addressed by checking against a NULL brport_dev in
dsa_port_pre_bridge_leave and exiting early.
Fixes:
|
||
Vladimir Oltean
|
f5e165e72b |
net: dsa: track unique bridge numbers across all DSA switch trees
Right now, cross-tree bridging setups work somewhat by mistake. In the case of cross-tree bridging with sja1105, all switch instances need to agree upon a common VLAN ID for forwarding a packet that belongs to a certain bridging domain. With TX forwarding offload, the VLAN ID is the bridge VLAN for VLAN-aware bridging, and the tag_8021q TX forwarding offload VID (a VLAN which has non-zero VBID bits) for VLAN-unaware bridging. The VBID for VLAN-unaware bridging is derived from the dp->bridge_num value calculated by DSA independently for each switch tree. If ports from one tree join one bridge, and ports from another tree join another bridge, DSA will assign them the same bridge_num, even though the bridges are different. If cross-tree bridging is supported, this is an issue. Modify DSA to calculate the bridge_num globally across all switch trees. This has the implication for a driver that the dp->bridge_num value that DSA will assign to its ports might not be contiguous, if there are boards with multiple DSA drivers instantiated. Additionally, all bridge_num values eat up towards each switch's ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate, and can be seen as a limitation introduced by this patch. However, that is the lesser evil for now. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
994d2cbb08 |
net: dsa: tag_sja1105: be dsa_loop-safe
Add support for tag_sja1105 running on non-sja1105 DSA ports, by making sure that every time we dereference dp->priv, we check the switch's dsa_switch_ops (otherwise we access a struct sja1105_port structure that is in fact something else). This adds an unconditional build-time dependency between sja1105 being built as module => tag_sja1105 must also be built as module. This was there only for PTP before. Some sane defaults must also take place when not running on sja1105 hardware. These are: - sja1105_xmit_tpid: the sja1105 driver uses different VLAN protocols depending on VLAN awareness and switch revision (when an encapsulated VLAN must be sent). Default to 0x8100. - sja1105_rcv_meta_state_machine: this aggregates PTP frames with their metadata timestamp frames. When running on non-sja1105 hardware, don't do that and accept all frames unmodified. - sja1105_defer_xmit: calls sja1105_port_deferred_xmit in sja1105_main.c which writes a management route over SPI. When not running on sja1105 hardware, bypass the SPI write and send the frame as-is. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
b2b8913341 |
net: dsa: tag_8021q: fix notifiers broadcast when they shouldn't, and vice versa
During the development of the blamed patch, the "bool broadcast"
argument of dsa_port_tag_8021q_vlan_{add,del} was originally called
"bool local", and the meaning was the exact opposite.
Due to a rookie mistake where the patch was modified at the last minute
without retesting, the instances of dsa_port_tag_8021q_vlan_{add,del}
are called with the wrong values. During setup and teardown, cross-chip
notifiers should not be broadcast to all DSA trees, while during
bridging, they should.
Fixes:
|
||
Jakub Kicinski
|
f4083a752a |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h |
||
Vladimir Oltean
|
724395f4dc |
net: dsa: tag_8021q: don't broadcast during setup/teardown
Currently, on my board with multiple sja1105 switches in disjoint trees
described in commit
|
||
Vladimir Oltean
|
ab97462beb |
net: dsa: print more information when a cross-chip notifier fails
Currently this error message does not say a lot: [ 32.693498] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.699725] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.705931] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.712139] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.718347] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.724554] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT but in this form, it is immediately obvious (at least to me) what the problem is, even without further looking at the code: [ 12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT [ 12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT [ 12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT [ 12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT [ 12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT [ 12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
a72808b658 |
net: dsa: create a helper for locating EtherType DSA headers on TX
Create a similar helper for locating the offset to the DSA header relative to skb->data, and make the existing EtherType header taggers to use it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
5d928ff486 |
net: dsa: create a helper for locating EtherType DSA headers on RX
It seems that protocol tagging driver writers are always surprised about the formula they use to reach their EtherType header on RX, which becomes apparent from the fact that there are comments in multiple drivers that mention the same information. Create a helper that returns a void pointer to skb->data - 2, as well as centralize the explanation why that is the case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
6bef794da6 |
net: dsa: create a helper which allocates space for EtherType DSA headers
Hide away the memmove used by DSA EtherType header taggers to shift the MAC SA and DA to the left to make room for the header, after they've called skb_push(). The call to skb_push() is still left explicit in drivers, to be symmetric with dsa_strip_etype_header, and because not all callers can be refactored to do it (for example, brcm_tag_xmit_ll has common code for a pre-Ethernet DSA tag and an EtherType DSA tag). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
f1dacd7aea |
net: dsa: create a helper that strips EtherType DSA headers on RX
All header taggers open-code a memmove that is fairly not all that obvious, and we can hide the details behind a helper function, since the only thing specific to the driver is the length of the header tag. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
c35b57ceff |
net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge
The blamed commit added a new field to struct switchdev_notifier_fdb_info,
but did not make sure that all call paths set it to something valid.
For example, a switchdev driver may emit a SWITCHDEV_FDB_ADD_TO_BRIDGE
notifier, and since the 'is_local' flag is not set, it contains junk
from the stack, so the bridge might interpret those notifications as
being for local FDB entries when that was not intended.
To avoid that now and in the future, zero-initialize all
switchdev_notifier_fdb_info structures created by drivers such that all
newly added fields to not need to touch drivers again.
Fixes:
|
||
Leon Romanovsky
|
919d13a7e4 |
devlink: Set device as early as possible
All kernel devlink implementations call to devlink_alloc() during initialization routine for specific device which is used later as a parent device for devlink_register(). Such late device assignment causes to the situation which requires us to call to device_register() before setting other parameters, but that call opens devlink to the world and makes accessible for the netlink users. Any attempt to move devlink_register() to be the last call generates the following error due to access to the devlink->dev pointer. [ 8.758862] devlink_nl_param_fill+0x2e8/0xe50 [ 8.760305] devlink_param_notify+0x6d/0x180 [ 8.760435] __devlink_params_register+0x2f1/0x670 [ 8.760558] devlink_params_register+0x1e/0x20 The simple change of API to set devlink device in the devlink_alloc() instead of devlink_register() fixes all this above and ensures that prior to call to devlink_register() everything already set. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
bee7c577e6 |
net: dsa: avoid fast ageing twice when port leaves a bridge
Drivers that support both the toggling of address learning and dynamic FDB flushing (mv88e6xxx, b53, sja1105) currently need to fast-age a port twice when it leaves a bridge: - once, when del_nbp() calls br_stp_disable_port() which puts the port in the BLOCKING state - twice, when dsa_port_switchdev_unsync_attrs() calls dsa_port_clear_brport_flags() which disables address learning The knee-jerk reaction might be to say "dsa_port_clear_brport_flags does not need to fast-age the port at all", but the thing is, we still need both code paths to flush the dynamic FDB entries in different situations. When a DSA switch port leaves a bonding/team interface that is (still) a bridge port, no del_nbp() will be called, so we rely on dsa_port_clear_brport_flags() function to restore proper standalone port functionality with address learning disabled. So the solution is just to avoid double the work when both code paths are called in series. Luckily, DSA already caches the STP port state, so we can skip flushing the dynamic FDB when we disable address learning and the STP state is one where no address learning takes place at all. Under that condition, not flushing the FDB is safe because there is supposed to not be any dynamic FDB entry at all (they were flushed during the transition towards that state, and none were learned in the meanwhile). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
a4ffe09fc2 |
net: dsa: still fast-age ports joining a bridge if they can't configure learning
Commit
|
||
Vladimir Oltean
|
9264e4ad26 |
net: dsa: flush the dynamic FDB of the software bridge when fast ageing a port
Currently, when DSA performs fast ageing on a port, 'bridge fdb' shows us that the 'self' entries (corresponding to the hardware bridge, as printed by dsa_slave_fdb_dump) are deleted, but the 'master' entries (corresponding to the software bridge) aren't. Indeed, searching through the bridge driver, neither the brport_attr_learning handler nor the IFLA_BRPORT_LEARNING handler call br_fdb_delete_by_port. However, br_stp_disable_port does, which is one of the paths which DSA uses to trigger a fast ageing process anyway. There is, however, one other very promising caller of br_fdb_delete_by_port, and that is the bridge driver's handler of the SWITCHDEV_FDB_FLUSH_TO_BRIDGE atomic notifier. Currently the s390/qeth HiperSockets card driver is the only user of this. I can't say I understand that driver's architecture or interaction with the bridge, but it appears to not be a switchdev driver in the traditional sense of the word. Nonetheless, the mechanism it provides is a useful way for DSA to express the fact that it performs fast ageing too, in a way that does not change the existing behavior for other drivers. Cc: Alexandra Winter <wintera@linux.ibm.com> Cc: Julian Wiedmann <jwi@linux.ibm.com> Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
4eab90d973 |
net: dsa: don't fast age bridge ports with learning turned off
On topology changes, stations that were dynamically learned on ports that are no longer part of the active topology must be flushed - this is described by clause "17.11 Updating learned station location information" of IEEE 802.1D-2004. However, when address learning on the bridge port is turned off in the first place, there is nothing to flush, so skip a potentially expensive operation. We can finally do this now since DSA is aware of the learning state of its bridged ports. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
045c45d1f5 |
net: dsa: centralize fast ageing when address learning is turned off
Currently DSA leaves it down to device drivers to fast age the FDB on a port when address learning is disabled on it. There are 2 reasons for doing that in the first place: - when address learning is disabled by user space, through IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user space typically wants to achieve is to operate in a mode with no dynamic FDB entry on that port. But if the port is already up, some addresses might have been already learned on it, and it seems silly to wait for 5 minutes for them to expire until something useful can be done. - when a port leaves a bridge and becomes standalone, DSA turns off address learning on it. This also has the nice side effect of flushing the dynamically learned bridge FDB entries on it, which is a good idea because standalone ports should not have bridge FDB entries on them. We let drivers manage fast ageing under this condition because if DSA were to do it, it would need to track each port's learning state, and act upon the transition, which it currently doesn't. But there are 2 reasons why doing it is better after all: - drivers might get it wrong and not do it (see b53_port_set_learning) - we would like to flush the dynamic entries from the software bridge too, and letting drivers do that would be another pain point So track the port learning state and trigger a fast age process automatically within DSA. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
39f3210154 |
net: dsa: don't fast age standalone ports
DSA drives the procedure to flush dynamic FDB entries from a port based
on the change of STP state: whenever we go from a state where address
learning is enabled (LEARNING, FORWARDING) to a state where it isn't
(LISTENING, BLOCKING, DISABLED), we need to flush the existing dynamic
entries.
However, there are cases when this is not needed. Internally, when a
DSA switch interface is not under a bridge, DSA still keeps it in the
"FORWARDING" STP state. And when that interface joins a bridge, the
bridge will meticulously iterate that port through all STP states,
starting with BLOCKING and ending with FORWARDING. Because there is a
state transition from the standalone version of FORWARDING into the
temporary BLOCKING bridge port state, DSA calls the fast age procedure.
Since commit
|
||
Vladimir Oltean
|
c73c57081b |
net: dsa: don't disable multicast flooding to the CPU even without an IGMP querier
Commit |
||
Vladimir Oltean
|
7df4e74494 |
net: dsa: stop syncing the bridge mcast_router attribute at join time
Qingfang points out that when a bridge with the default settings is
created and a port joins it:
ip link add br0 type bridge
ip link set swp0 master br0
DSA calls br_multicast_router() on the bridge to see if the br0 device
is a multicast router port, and if it is, it enables multicast flooding
to the CPU port, otherwise it disables it.
If we look through the multicast_router_show() sysfs or at the
IFLA_BR_MCAST_ROUTER netlink attribute, we see that the default mrouter
attribute for the bridge device is "1" (MDB_RTR_TYPE_TEMP_QUERY).
However, br_multicast_router() will return "0" (MDB_RTR_TYPE_DISABLED),
because an mrouter port in the MDB_RTR_TYPE_TEMP_QUERY state may not be
actually _active_ until it receives an actual IGMP query. So, the
br_multicast_router() function should really have been called
br_multicast_router_active() perhaps.
When/if an IGMP query is received, the bridge device will transition via
br_multicast_mark_router() into the active state until the
ip4_mc_router_timer expires after an multicast_querier_interval.
Of course, this does not happen if the bridge is created with an
mcast_router attribute of "2" (MDB_RTR_TYPE_PERM).
The point is that in lack of any IGMP query messages, and in the default
bridge configuration, unregistered multicast packets will not be able to
reach the CPU port through flooding, and this breaks many use cases
(most obviously, IPv6 ND, with its ICMP6 neighbor solicitation multicast
messages).
Leave the multicast flooding setting towards the CPU port down to a driver
level decision.
Fixes:
|
||
Vladimir Oltean
|
f8b17a0bd9 |
net: dsa: tag_sja1105: optionally build as module when switch driver is module if PTP is enabled
TX timestamps are sent by SJA1110 as Ethernet packets containing
metadata, so they are received by the tagging driver but must be
processed by the switch driver - the one that is stateful since it
keeps the TX timestamp queue.
This means that there is an sja1110_process_meta_tstamp() symbol
exported by the switch driver which is called by the tagging driver.
There is a shim definition for that function when the switch driver is
not compiled, which does nothing, but that shim is not effective when
the tagging protocol driver is built-in and the switch driver is a
module, because built-in code cannot call symbols exported by modules.
So add an optional dependency between the tagger and the switch driver,
if PTP support is enabled in the switch driver. If PTP is not enabled,
sja1110_process_meta_tstamp() will translate into the shim "do nothing
with these meta frames" function.
Fixes:
|
||
Vladimir Oltean
|
2c0b03258b |
net: dsa: give preference to local CPU ports
Be there an "H" switch topology, where there are 2 switches connected as follows: eth0 eth1 | | CPU port CPU port | DSA link | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0 | | | | | | user user user user user user port port port port port port basically one where each switch has its own CPU port for termination, but there is also a DSA link in case packets need to be forwarded in hardware between one switch and another. DSA insists to see this as a daisy chain topology, basically registering all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding eth1 as a valid DSA master. This is only half the story, since when asked using dsa_port_is_cpu(), DSA will respond that sw1p1 is a CPU port, however one which has no dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used. Furthermore, be there a driver for switches which support only one upstream port. This driver iterates through its ports and checks using dsa_is_upstream_port() whether the current port is an upstream one. For switch 1, two ports pass the "is upstream port" checks: - sw1p4 is an upstream port because it is a routing port towards the dedicated CPU port assigned using dsa_tree_setup_default_cpu() - sw1p1 is also an upstream port because it is a CPU port, albeit one that is disabled. This is because dsa_upstream_port() returns: if (!cpu_dp) return port; which means that if @dp does not have a ->cpu_dp pointer (which is a characteristic of CPU ports themselves as well as unused ports), then @dp is its own upstream port. So the driver for switch 1 rightfully says: I have two upstream ports, but I don't support multiple upstream ports! So let me error out, I don't know which one to choose and what to do with the other one. Generally I am against enforcing any default policy in the kernel in terms of user to CPU port assignment (like round robin or such) but this case is different. To solve the conundrum, one would have to: - Disable sw1p1 in the device tree or mark it as "not a CPU port" in order to comply with DSA's view of this topology as a daisy chain, where the termination traffic from switch 1 must pass through switch 0. This is counter-productive because it wastes 1Gbps of termination throughput in switch 1. - Disable the DSA link between sw0p4 and sw1p4 and do software forwarding between switch 0 and 1, and basically treat the switches as part of disjoint switch trees. This is counter-productive because it wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1. - Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could work, but it makes cross-chip bridging impossible. In this setup we would need to have 2 separate bridges, br0 spanning the ports of switch 0, and br1 spanning the ports of switch 1, and the "DSA links treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are the gateway ports between one bridge and another. This is hard to manage from a user's perspective, who wants to have a unified view of the switching fabric and the ability to transparently add ports to the same bridge. VLANs would also need to be explicitly managed by the user on these gateway ports. So it seems that the only reasonable thing to do is to make DSA prefer CPU ports that are local to the switch. Meaning that by default, the user and DSA ports of switch 0 will get assigned to the CPU port from switch 0 (sw0p1) and the user and DSA ports of switch 1 will get assigned to the CPU port from switch 1. The way this solves the problem is that sw1p4 is no longer an upstream port as far as switch 1 is concerned (it no longer views sw0p1 as its dedicated CPU port). So here we are, the first multi-CPU port that DSA supports is also perhaps the most uneventful one: the individual switches don't support multiple CPUs, however the DSA switch tree as a whole does have multiple CPU ports. No user space assignment of user ports to CPU ports is desirable, necessary, or possible. Ports that do not have a local CPU port (say there was an extra switch hanging off of sw0p0) default to the standard implementation of getting assigned to the first CPU port of the DSA switch tree. Is that good enough? Probably not (if the downstream switch was hanging off of switch 1, we would most certainly prefer its CPU port to be sw1p1), but in order to support that use case too, we would need to traverse the dst->rtable in search of an optimum dedicated CPU port, one that has the smallest number of hops between dp->ds and dp->cpu_dp->ds. At the moment, the DSA routing table structure does not keep the number of hops between dl->dp and dl->link_dp, and while it is probably deducible, there is zero justification to write that code now. Let's hope DSA will never have to support that use case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
0e8eb9a16e |
net: dsa: rename teardown_default_cpu to teardown_cpu_ports
There is nothing specific to having a default CPU port to what dsa_tree_teardown_default_cpu() does. Even with multiple CPU ports, it would do the same thing: iterate through the ports of this switch tree and reset the ->cpu_dp pointer to NULL. So rename it accordingly. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
421297efe6 |
net: dsa: tag_sja1105: consistently fail with arbitrary input
Dan Carpenter's smatch tests report that the "vid" variable, populated by sja1105_vlan_rcv when an skb is received by the tagger that has a VLAN ID which cannot be decoded by tag_8021q, may be uninitialized when used here: if (source_port == -1 || switch_id == -1) skb->dev = dsa_find_designated_bridge_port_by_vid(netdev, vid); The sja1105 driver, by construction, sets up the switch in a way that all data plane packets sent towards the CPU port are VLAN-tagged. So it is practically impossible, in a functional system, for a packet to be processed by sja1110_rcv() which is not a control packet and does not have a VLAN header either. However, it would be nice if the sja1105 tagging driver could consistently do something valid, for example fail, even if presented with packets that do not hold valid sja1105 tags. Currently it is a bit hard to argue that it does that, given the fact that a data plane packet with no VLAN tag will trigger a call to dsa_find_designated_bridge_port_by_vid with a vid argument that is an uninitialized stack variable. To fix this, we can initialize the u16 vid variable with 0, a value that can never be a bridge VLAN, so dsa_find_designated_bridge_port_by_vid will always return a NULL skb->dev. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20210802195137.303625-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Vladimir Oltean
|
29a097b774 |
net: dsa: remove the struct packet_type argument from dsa_device_ops::rcv()
No tagging driver uses this. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
bea7907837 |
net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge
DSA has gained the recent ability to deal gracefully with upper interfaces it cannot offload, such as the bridge, bonding or team drivers. When such uppers exist, the ports are still in standalone mode as far as the hardware is concerned. But when we deliver packets to the software bridge in order for that to do the forwarding, there is an unpleasant surprise in that the bridge will refuse to forward them. This is because we unconditionally set skb->offload_fwd_mark = true, meaning that the bridge thinks the frames were already forwarded in hardware by us. Since dp->bridge_dev is populated only when there is hardware offload for it, but not in the software fallback case, let's introduce a new helper that can be called from the tagger data path which sets the skb->offload_fwd_mark accordingly to zero when there is no hardware offload for bridging. This lets the bridge forward packets back to other interfaces of our switch, if needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
04a1758348 |
net: dsa: tag_sja1105: fix control packets on SJA1110 being received on an imprecise port
On RX, a control packet with SJA1110 will have:
- an in-band control extension (DSA tag) composed of a header and an
optional trailer (if it is a timestamp frame). We can (and do) deduce
the source port and switch id from this.
- a VLAN header, which can either be the tag_8021q RX VLAN (pvid) or the
bridge VLAN. The sja1105_vlan_rcv() function attempts to deduce the
source port and switch id a second time from this.
The basic idea is that even though we don't need the source port
information from the tag_8021q header if it's a control packet, we do
need to strip that header before we pass it on to the network stack.
The problem is that we call sja1105_vlan_rcv for ports under VLAN-aware
bridges, and that function tells us it couldn't identify a tag_8021q
header, so we need to perform imprecise RX by VID. Well, we don't,
because we already know the source port and switch ID.
This patch drops the return value from sja1105_vlan_rcv and we just look
at the source_port and switch_id values from sja1105_rcv and sja1110_rcv
which were initialized to -1. If they are still -1 it means we need to
perform imprecise RX.
Fixes:
|
||
Arnd Bergmann
|
a76053707d |
dev_ioctl: split out ndo_eth_ioctl
Most users of ndo_do_ioctl are ethernet drivers that implement the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP. Separate these from the few drivers that use ndo_do_ioctl to implement SIOCBOND, SIOCBR and SIOCWANDEV commands. This is a purely cosmetic change intended to help readers find their way through the implementation. Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Vladimir Oltean <olteanv@gmail.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: linux-rdma@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
edac6f6332 |
Revert "net: dsa: Allow drivers to filter packets they can decode source port from"
This reverts commit
|
||
Vladimir Oltean
|
b6ad86e6ad |
net: dsa: sja1105: add bridge TX data plane offload based on tag_8021q
The main desire for having this feature in sja1105 is to support network stack termination for traffic coming from a VLAN-aware bridge. For sja1105, offloading the bridge data plane means sending packets as-is, with the proper VLAN tag, to the chip. The chip will look up its FDB and forward them to the correct destination port. But we support bridge data plane offload even for VLAN-unaware bridges, and the implementation there is different. In fact, VLAN-unaware bridging is governed by tag_8021q, so it makes sense to have the .bridge_fwd_offload_add() implementation fully within tag_8021q. The key difference is that we only support 1 VLAN-aware bridge, but we support multiple VLAN-unaware bridges. So we need to make sure that the forwarding domain is not crossed by packets injected from the stack. For this, we introduce the concept of a tag_8021q TX VLAN for bridge forwarding offload. As opposed to the regular TX VLANs which contain only 2 ports (the user port and the CPU port), a bridge data plane TX VLAN is "multicast" (or "imprecise"): it contains all the ports that are part of a certain bridge, and the hardware will select where the packet goes within this "imprecise" forwarding domain. Each VLAN-unaware bridge has its own "imprecise" TX VLAN, so we make use of the unique "bridge_num" provided by DSA for the data plane offload. We use the same 3 bits from the tag_8021q VLAN ID format to encode this bridge number. Note that these 3 bit positions have been used before for sub-VLANs in best-effort VLAN filtering mode. The difference is that for best-effort, the sub-VLANs were only valid on RX (and it was documented that the sub-VLAN field needed to be transmitted as zero). Whereas for the bridge data plane offload, these 3 bits are only valid on TX. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
884be12f85 |
net: dsa: sja1105: add support for imprecise RX
This is already common knowledge by now, but the sja1105 does not have hardware support for DSA tagging for data plane packets, and tag_8021q sets up a unique pvid per port, transmitted as VLAN-tagged towards the CPU, for the source port to be decoded nonetheless. When the port is part of a VLAN-aware bridge, the pvid committed to hardware is taken from the bridge and not from tag_8021q, so we need to work with that the best we can. Configure the switches to send all packets to the CPU as VLAN-tagged (even ones that were originally untagged on the wire) and make use of dsa_untag_bridge_pvid() to get rid of it before we send those packets up the network stack. With the classified VLAN used by hardware known to the tagger, we first peek at the VID in an attempt to figure out if the packet was received from a VLAN-unaware port (standalone or under a VLAN-unaware bridge), case in which we can continue to call dsa_8021q_rcv(). If that is not the case, the packet probably came from a VLAN-aware bridge. So we call the DSA helper that finds for us a "designated bridge port" - one that is a member of the VLAN ID from the packet, and is in the proper STP state - basically these are all checks performed by br_handle_frame() in the software RX data path. The bridge will accept the packet as valid even if the source port was maybe wrong. So it will maybe learn the MAC SA of the packet on the wrong port, and its software FDB will be out of sync with the hardware FDB. So replies towards this same MAC DA will not work, because the bridge will send towards a different netdev. This is where the bridge data plane offload ("imprecise TX") added by the next patch comes in handy. The software FDB is wrong, true, but the hardware FDB isn't, and by offloading the bridge forwarding plane we have a chance to right a wrong, and have the hardware look up the FDB for us for the reply packet. So it all cancels out. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Tobias Waldekranz
|
d82f8ab0d8 |
net: dsa: tag_dsa: offload the bridge forwarding process
Allow the DSA tagger to generate FORWARD frames for offloaded skbs sent from a bridge that we offload, allowing the switch to handle any frame replication that may be required. This also means that source address learning takes place on packets sent from the CPU, meaning that return traffic no longer needs to be flooded as unknown unicast. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
123abc06e7 |
net: dsa: add support for bridge TX forwarding offload
For a DSA switch, to offload the forwarding process of a bridge device means to send the packets coming from the software bridge as data plane packets. This is contrary to everything that DSA has done so far, because the current taggers only know to send control packets (ones that target a specific destination port), whereas data plane packets are supposed to be forwarded according to the FDB lookup, much like packets ingressing on any regular ingress port. If the FDB lookup process returns multiple destination ports (flooding, multicast), then replication is also handled by the switch hardware - the bridge only sends a single packet and avoids the skb_clone(). DSA keeps for each bridge port a zero-based index (the number of the bridge). Multiple ports performing TX forwarding offload to the same bridge have the same dp->bridge_num value, and ports not offloading the TX data plane of a bridge have dp->bridge_num = -1. The tagger can check if the packet that is being transmitted on has skb->offload_fwd_mark = true or not. If it does, it can be sure that the packet belongs to the data plane of a bridge, further information about which can be obtained based on dp->bridge_dev and dp->bridge_num. It can then compose a DSA tag for injecting a data plane packet into that bridge number. For the switch driver side, we offer two new dsa_switch_ops methods, called .port_bridge_fwd_offload_{add,del}, which are modeled after .port_bridge_{join,leave}. These methods are provided in case the driver needs to configure the hardware to treat packets coming from that bridge software interface as data plane packets. The switchdev <-> bridge interaction happens during the netdev_master_upper_dev_link() call, so to switch drivers, the effect is that the .port_bridge_fwd_offload_add() method is called immediately after .port_bridge_join(). If the bridge number exceeds the number of bridges for which the switch driver can offload the TX data plane (and this includes the case where the driver can offload none), DSA falls back to simply returning tx_fwd_offload = false in the switchdev_bridge_port_offload() call. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
5b22d3669f |
net: dsa: track the number of switches in a tree
In preparation of supporting data plane forwarding on behalf of a software bridge, some drivers might need to view bridges as virtual switches behind the CPU port in a cross-chip topology. Give them some help and let them know how many physical switches there are in the tree, so that they can count the virtual switches starting from that number on. Note that the first dsa_switch_ops method where this information is reliably available is .setup(). This is because of how DSA works: in a tree with 3 switches, each calling dsa_register_switch(), the first 2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table() and exit with error code 0 because the topology is not complete. Since probing is parallel at this point, one switch does not know about the existence of the other. Then the third switch comes, and for it, dsa_tree_setup_routing_table() returns complete = true. This switch goes ahead and calls dsa_tree_setup_switches() for everybody else, calling their .setup() methods too. This acts as the synchronization point. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Tobias Waldekranz
|
472111920f |
net: bridge: switchdev: allow the TX data plane forwarding to be offloaded
Allow switchdevs to forward frames from the CPU in accordance with the bridge configuration in the same way as is done between bridge ports. This means that the bridge will only send a single skb towards one of the ports under the switchdev's control, and expects the driver to deliver the packet to all eligible ports in its domain. Primarily this improves the performance of multicast flows with multiple subscribers, as it allows the hardware to perform the frame replication. The basic flow between the driver and the bridge is as follows: - When joining a bridge port, the switchdev driver calls switchdev_bridge_port_offload() with tx_fwd_offload = true. - The bridge sends offloadable skbs to one of the ports under the switchdev's control using skb->offload_fwd_mark = true. - The switchdev driver checks the skb->offload_fwd_mark field and lets its FDB lookup select the destination port mask for this packet. v1->v2: - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long - introduce a static key "br_switchdev_fwd_offload_used" to minimize the impact of the newly introduced feature on all the setups which don't have hardware that can make use of it - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache line access - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in __br_forward() - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge is being used - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP v2->v3: - replace the solution based on .ndo_dfwd_add_station with a solution based on switchdev_bridge_port_offload - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD v3->v4: rebase v4->v5: - make sure the static key is decremented on bridge port unoffload - more function and variable renaming and comments for them: br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload fwd_accel to tx_fwd_offload Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
David S. Miller
|
5af84df962 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts are simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
4e51bf44a0 |
net: bridge: move the switchdev object replay helpers to "push" mode
Starting with commit |
||
Vladimir Oltean
|
2f5dc00f7a |
net: bridge: switchdev: let drivers inform which bridge ports are offloaded
On reception of an skb, the bridge checks if it was marked as 'already forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it is, it assigns the source hardware domain of that skb based on the hardware domain of the ingress port. Then during forwarding, it enforces that the egress port must have a different hardware domain than the ingress one (this is done in nbp_switchdev_allowed_egress). Non-switchdev drivers don't report any physical switch id (neither through devlink nor .ndo_get_port_parent_id), therefore the bridge assigns them a hardware domain of 0, and packets coming from them will always have skb->offload_fwd_mark = 0. So there aren't any restrictions. Problems appear due to the fact that DSA would like to perform software fallback for bonding and team interfaces that the physical switch cannot offload. +-- br0 ---+ / / | \ / / | \ / | | bond0 / | | / \ swp0 swp1 swp2 swp3 swp4 There, it is desirable that the presence of swp3 and swp4 under a non-offloaded LAG does not preclude us from doing hardware bridging beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high enough that software bridging between {swp0,swp1,swp2} and bond0 is not impractical. But this creates an impossible paradox given the current way in which port hardware domains are assigned. When the driver receives a packet from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to something. - If we set it to 0, then the bridge will forward it towards swp1, swp2 and bond0. But the switch has already forwarded it towards swp1 and swp2 (not to bond0, remember, that isn't offloaded, so as far as the switch is concerned, ports swp3 and swp4 are not looking up the FDB, and the entire bond0 is a destination that is strictly behind the CPU). But we don't want duplicated traffic towards swp1 and swp2, so it's not ok to set skb->offload_fwd_mark = 0. - If we set it to 1, then the bridge will not forward the skb towards the ports with the same switchdev mark, i.e. not to swp1, swp2 and bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should have forwarded the skb there. So the real issue is that bond0 will be assigned the same hardware domain as {swp0,swp1,swp2}, because the function that assigns hardware domains to bridge ports, nbp_switchdev_add(), recurses through bond0's lower interfaces until it finds something that implements devlink (calls dev_get_port_parent_id with bool recurse = true). This is a problem because the fact that bond0 can be offloaded by swp3 and swp4 in our example is merely an assumption. A solution is to give the bridge explicit hints as to what hardware domain it should use for each port. Currently, the bridging offload is very 'silent': a driver registers a netdevice notifier, which is put on the netns's notifier chain, and which sniffs around for NETDEV_CHANGEUPPER events where the upper is a bridge, and the lower is an interface it knows about (one registered by this driver, normally). Then, from within that notifier, it does a bunch of stuff behind the bridge's back, without the bridge necessarily knowing that there's somebody offloading that port. It looks like this: ip link set swp0 master br0 | v br_add_if() calls netdev_master_upper_dev_link() | v call_netdevice_notifiers | v dsa_slave_netdevice_event | v oh, hey! it's for me! | v .port_bridge_join What we do to solve the conundrum is to be less silent, and change the switchdev drivers to present themselves to the bridge. Something like this: ip link set swp0 master br0 | v br_add_if() calls netdev_master_upper_dev_link() | v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the | | hardware domain for v | this port, and zero dsa_slave_netdevice_event | if I got nothing. | | v | oh, hey! it's for me! | | | v | .port_bridge_join | | | +------------------------+ switchdev_bridge_port_offload(swp0, swp0) Then stacked interfaces (like bond0 on top of swp3/swp4) would be treated differently in DSA, depending on whether we can or cannot offload them. The offload case: ip link set bond0 master br0 | v br_add_if() calls netdev_master_upper_dev_link() | v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the | | switchdev mark for v | bond0. dsa_slave_netdevice_event | Coincidentally (or not), | | bond0 and swp0, swp1, swp2 v | all have the same switchdev hmm, it's not quite for me, | mark now, since the ASIC but my driver has already | is able to forward towards called .port_lag_join | all these ports in hw. for it, because I have | a port with dp->lag_dev == bond0. | | | v | .port_bridge_join | for swp3 and swp4 | | | +------------------------+ switchdev_bridge_port_offload(bond0, swp3) switchdev_bridge_port_offload(bond0, swp4) And the non-offload case: ip link set bond0 master br0 | v br_add_if() calls netdev_master_upper_dev_link() | v bridge waiting: call_netdevice_notifiers ^ huh, switchdev_bridge_port_offload | | wasn't called, okay, I'll use a v | hwdom of zero for this one. dsa_slave_netdevice_event : Then packets received on swp0 will | : not be software-forwarded towards v : swp1, but they will towards bond0. it's not for me, but bond0 is an upper of swp3 and swp4, but their dp->lag_dev is NULL because they couldn't offload it. Basically we can draw the conclusion that the lowers of a bridge port can come and go, so depending on the configuration of lowers for a bridge port, it can dynamically toggle between offloaded and unoffloaded. Therefore, we need an equivalent switchdev_bridge_port_unoffload too. This patch changes the way any switchdev driver interacts with the bridge. From now on, everybody needs to call switchdev_bridge_port_offload and switchdev_bridge_port_unoffload, otherwise the bridge will treat the port as non-offloaded and allow software flooding to other ports from the same ASIC. Note that these functions lay the ground for a more complex handshake between switchdev drivers and the bridge in the future. For drivers that will request a replay of the switchdev objects when they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we place the call to switchdev_bridge_port_unoffload() strategically inside the NETDEV_PRECHANGEUPPER notifier's code path, and not inside NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers need the netdev adjacency lists to be valid, and that is only true in NETDEV_PRECHANGEUPPER. Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Lino Sanfilippo
|
37120f23ac |
net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum
If the checksum calculation is offloaded to the network device (e.g due to NETIF_F_HW_CSUM inherited from the DSA master device), the calculated layer 4 checksum is incorrect. This is since the DSA tag which is placed after the layer 4 data is considered as being part of the daa and thus errorneously included into the checksum calculation. To avoid this, always calculate the layer 4 checksum in software. Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Lino Sanfilippo
|
21cf377a9c |
net: dsa: ensure linearized SKBs in case of tail taggers
The function skb_put() that is used by tail taggers to make room for the DSA tag must only be called for linearized SKBS. However in case that the slave device inherited features like NETIF_F_HW_SG or NETIF_F_FRAGLIST the SKB passed to the slaves transmit function may not be linearized. Avoid those SKBs by clearing the NETIF_F_HW_SG and NETIF_F_FRAGLIST flags for tail taggers. Furthermore since the tagging protocol can be changed at runtime move the code for setting up the slaves features into dsa_slave_setup_tagger(). Suggested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
b94dc99c0d |
net: dsa: use switchdev_handle_fdb_{add,del}_to_device
Using the new fan-out helper for FDB entries installed on the software bridge, we can install host addresses with the proper refcount on the CPU port, such that this case: ip link set swp0 master br0 ip link set swp1 master br0 ip link set swp2 master br0 ip link set swp3 master br0 ip link set br0 address 00:01:02:03:04:05 ip link set swp3 nomaster works properly and the br0 address remains installed as a host entry with refcount 3 instead of getting deleted. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
c6451cda10 |
net: switchdev: introduce helper for checking dynamically learned FDB entries
It is a bit difficult to understand what DSA checks when it tries to avoid installing dynamically learned addresses on foreign interfaces as local host addresses, so create a generic switchdev helper that can be reused and is generally more readable. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
c64b9c0504 |
net: dsa: tag_8021q: add proper cross-chip notifier support
The big problem which mandates cross-chip notifiers for tag_8021q is this: | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] When the user runs: ip link add br0 type bridge ip link set sw0p0 master br0 ip link set sw2p0 master br0 It doesn't work. This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and "other_ds" are at most 1 hop away from each other, so it is sufficient to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice versa and presto, the cross-chip link works. When there is another switch in the middle, such as in this case switch 1 with its DSA links sw1p3 and sw1p4, somebody needs to tell it about these VLANs too. Which is exactly why the problem is quadratic: when a port joins a bridge, for each port in the tree that's already in that same bridge we notify a tag_8021q VLAN addition of that port's RX VLAN to the entire tree. It is a very complicated web of VLANs. It must be mentioned that currently we install tag_8021q VLANs on too many ports (DSA links - to be precise, on all of them). For example, when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there isn't any port of switch 0 that is a member of br0 (at least yet). In theory we could notify only the switches which sit in between the port joining the bridge and the port reacting to that bridge_join event. But in practice that is impossible, because of the way 'link' properties are described in the device tree. The DSA bindings require DT writers to list out not only the real/physical DSA links, but in fact the entire routing table, like for example switch 0 above will have: sw0p3: port@3 { link = <&sw1p4 &sw2p4>; }; This was done because: /* TODO: ideally DSA ports would have a single dp->link_dp member, * and no dst->rtable nor this struct dsa_link would be needed, * but this would require some more complex tree walking, * so keep it stupid at the moment and list them all. */ but it is a perfect example of a situation where too much information is actively detrimential, because we are now in the position where we cannot distinguish a real DSA link from one that is put there to avoid the 'complex tree walking'. And because DT is ABI, there is not much we can change. And because we do not know which DSA links are real and which ones aren't, we can't really know if DSA switch A is in the data path between switches B and C, in the general case. So this is why tag_8021q RX VLANs are added on all DSA links, and probably why it will never change. On the other hand, at least the number of additions/deletions is well balanced, and this means that once we implement reference counting at the cross-chip notifier level a la fdb/mdb, there is absolutely zero need for a struct dsa_8021q_crosschip_link, it's all self-managing. In fact, with the tag_8021q notifiers emitted from the bridge join notifiers, it becomes so generic that sja1105 does not need to do anything anymore, we can just delete its implementation of the .crosschip_bridge_{join,leave} methods. Among other things we can simply delete is the home-grown implementation of sja1105_notify_crosschip_switches(). The reason why that is wrong is because it is not quadratic - it only covers remote switches to which we have a cross-chip bridging link and that does not cover in-between switches. This deletion is part of the same patch because sja1105 used to poke deep inside the guts of the tag_8021q context in order to do that. Because the cross-chip links went away, so needs the sja1105 code. Last but not least, dsa_8021q_setup_port() is simplified (and also renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react on the CPU port too, the four dsa_8021q_vid_apply() calls: - 1 for RX VLAN on user port - 1 for the user port's RX VLAN on the CPU port - 1 for TX VLAN on user port - 1 for the user port's TX VLAN on the CPU port now get squashed into only 2 notifier calls via dsa_port_tag_8021q_vlan_add. And because the notifiers to add and to delete a tag_8021q VLAN are distinct, now we finally break up the port setup and teardown into separate functions instead of relying on a "bool enabled" flag which tells us what to do. Arguably it should have been this way from the get go. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
e19cc13c9c |
net: dsa: tag_8021q: manage RX VLANs dynamically at bridge join/leave time
There has been at least one wasted opportunity for tag_8021q to be used by a driver: https://patchwork.ozlabs.org/project/netdev/patch/20200710113611.3398-3-kurt@linutronix.de/#2484272 because of a design decision: the declared purpose of tag_8021q is to offer source port/switch identification for a tagging driver for packets coming from a switch with no hardware DSA tagging support. It is not intended to provide VLAN-based port isolation, because its first user, sja1105, had another mechanism for bridging domain isolation, the L2 Forwarding Table. So even if 2 ports are in the same VLAN but they are separated via the L2 Forwarding Table, they will not communicate with one another. The L2 Forwarding Table is managed by the sja1105_bridge_join() and sja1105_bridge_leave() methods. As a consequence, today tag_8021q does not bother too much with hooking into .port_bridge_join() and .port_bridge_leave() because that would introduce yet another degree of freedom, it just iterates statically through all ports of a switch and adds the RX VLAN of one port to all the others. In this way, whenever .port_bridge_join() is called, bridging will magically work because the RX VLANs are already installed everywhere they need to be. This is not to say that the reason for the change in this patch is to satisfy the hellcreek and similar use cases, that is merely a nice side effect. Instead it is to make sja1105 cross-chip links work properly over a DSA link. For context, sja1105 today supports a degenerate form of cross-chip bridging, where the switches are interconnected through their CPU ports ("disjoint trees" topology). There is some code which has been generalized into dsa_8021q_crosschip_link_{add,del}, but it is not enough, and frankly it is impossible to build upon that. Real multi-switch DSA trees, like daisy chains or H trees, which have actual DSA links, do not work. The problem is that sja1105 is unlike mv88e6xxx, and does not have a PVT for cross-chip bridging, which is a table by which the local switch can select the forwarding domain for packets from a certain ingress switch ID and source port. The sja1105 switches cannot parse their own DSA tags, because, well, they don't really have support for DSA tags, it's all VLANs. So to make something like cross-chip bridging between sw0p0 and sw1p0 to work over the sw0p3/sw1p3 DSA link to work with sja1105 in the topology below: | | sw0p0 sw0p1 sw0p2 sw0p3 sw1p3 sw1p2 sw1p1 sw1p0 [ user ] [ user ] [ cpu ] [ dsa ] ---- [ dsa ] [ cpu ] [ user ] [ user ] we need to ask ourselves 2 questions: (1) how should the L2 Forwarding Table be managed? (2) how should the VLAN Lookup Table be managed? i.e. what should prevent packets from going to unwanted ports? Since as mentioned, there is no PVT, the L2 Forwarding Table only contains forwarding rules for local ports. So we can say "all user ports are allowed to forward to all CPU ports and all DSA links". If we allow forwarding to DSA links unconditionally, this means we must prevent forwarding using the VLAN Lookup Table. This is in fact asymmetric with what we do for tag_8021q on ports local to the same switch, and it matters because now that we are making tag_8021q a core DSA feature, we need to hook into .crosschip_bridge_join() to add/remove the tag_8021q VLANs. So for symmetry it makes sense to manage the VLANs for local forwarding in the same way as cross-chip forwarding. Note that there is a very precise reason why tag_8021q hooks into dsa_switch_bridge_join() which acts at the cross-chip notifier level, and not at a higher level such as dsa_port_bridge_join(). We need to install the RX VLAN of the newly joining port into the VLAN table of all the existing ports across the tree that are part of the same bridge, and the notifier already does the iteration through the switches for us. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
328621f613 |
net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register
Right now, setting up tag_8021q is a 2-step operation for a driver, first the context structure needs to be created, then the VLANs need to be installed on the ports. A similar thing is true for teardown. Merge the 2 steps into the register/unregister methods, to be as transparent as possible for the driver as to what tag_8021q does behind the scenes. This also gets rid of the funny "bool setup == true means setup, == false means teardown" API that tag_8021q used to expose. Note that dsa_tag_8021q_register() must be called at least in the .setup() driver method and never earlier (like in the driver probe function). This is because the DSA switch tree is not initialized at probe time, and the cross-chip notifiers will not work. For symmetry with .setup(), the unregister method should be put in .teardown(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
5da11eb407 |
net: dsa: make tag_8021q operations part of the core
Make tag_8021q a more central element of DSA and move the 2 driver specific operations outside of struct dsa_8021q_context (which is supposed to hold dynamic data and not really constant function pointers). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
d7b1fd520d |
net: dsa: let the core manage the tag_8021q context
The basic problem description is as follows:
Be there 3 switches in a daisy chain topology:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
The CPU will not be able to ping through the user ports of the
bottom-most switch (like for example sw2p0), simply because tag_8021q
was not coded up for this scenario - it has always assumed DSA switch
trees with a single switch.
To add support for the topology above, we must admit that the RX VLAN of
sw2p0 must be added on some ports of switches 0 and 1 as well. This is
in fact a textbook example of thing that can use the cross-chip notifier
framework that DSA has set up in switch.c.
There is only one problem: core DSA (switch.c) is not able right now to
make the connection between a struct dsa_switch *ds and a struct
dsa_8021q_context *ctx. Right now, it is drivers who call into
tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer,
and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del}
methods.
But with cross-chip notifiers, it is possible for tag_8021q to call
drivers without drivers having ever asked for anything. A good example
is right above: when sw2p0 wants to set itself up for tag_8021q,
the .tag_8021q_vlan_add method needs to be called for switches 1 and 0,
so that they transport sw2p0's VLANs towards the CPU without dropping
them.
So instead of letting drivers manage the tag_8021q context, add a
tag_8021q_ctx pointer inside of struct dsa_switch, which will be
populated when dsa_tag_8021q_register() returns success.
The patch is fairly long-winded because we are partly reverting commit
|
||
Vladimir Oltean
|
8b6e638b4b |
net: dsa: build tag_8021q.c as part of DSA core
Upcoming patches will add tag_8021q related logic to switch.c and port.c, in order to allow it to make use of cross-chip notifiers. In addition, a struct dsa_8021q_context *ctx pointer will be added to struct dsa_switch. It seems fairly low-reward to #ifdef the *ctx from struct dsa_switch and to provide shim implementations of the entire tag_8021q.c calling surface (not even clear what to do about the tag_8021q cross-chip notifiers to avoid compiling them). The runtime overhead for switches which don't use tag_8021q is fairly small because all helpers will check for ds->tag_8021q_ctx being a NULL pointer and stop there. So let's make it part of dsa_core.o. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
cedf467064 |
net: dsa: tag_8021q: create dsa_tag_8021q_{register,unregister} helpers
In preparation of moving tag_8021q to core DSA, move all initialization and teardown related to tag_8021q which is currently done by drivers in 2 functions called "register" and "unregister". These will gather more functionality in future patches, which will better justify the chosen naming scheme. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
69ebb37064 |
net: dsa: tag_8021q: use symbolic error names
Use %pe to give the user a string holding the error code instead of just a number. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
a81a45744b |
net: dsa: tag_8021q: use "err" consistently instead of "rc"
Some of the tag_8021q code has been taken out of sja1105, which uses "rc" for its return code variables, whereas the DSA core uses "err". Change tag_8021q for consistency. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
0fac6aa098 |
net: dsa: sja1105: delete the best_effort_vlan_filtering mode
Simply put, the best-effort VLAN filtering mode relied on VLAN retagging from a bridge VLAN towards a tag_8021q sub-VLAN in order to be able to decode the source port in the tagger, but the VLAN retagging implementation inside the sja1105 chips is not the best and we were relying on marginal operating conditions. The most notable limitation of the best-effort VLAN filtering mode is its incapacity to treat this case properly: ip link add br0 type bridge vlan_filtering 1 ip link set swp2 master br0 ip link set swp4 master br0 bridge vlan del dev swp4 vid 1 bridge vlan add dev swp4 vid 1 pvid When sending an untagged packet through swp2, the expectation is for it to be forwarded to swp4 as egress-tagged (so it will contain VLAN ID 1 on egress). But the switch will send it as egress-untagged. There was an attempt to fix this here: https://patchwork.kernel.org/project/netdevbpf/patch/20210407201452.1703261-2-olteanv@gmail.com/ but it failed miserably because it broke PTP RX timestamping, in a way that cannot be corrected due to hardware issues related to VLAN retagging. So with either PTP broken or pushing VLAN headers on egress for untagged packets being broken, the sad reality is that the best-effort VLAN filtering code is broken. Delete it. Note that this means there will be a temporary loss of functionality in this driver until it is replaced with something better (network stack RX/TX capability for "mode 2" as described in Documentation/networking/dsa/sja1105.rst, the "port under VLAN-aware bridge" case). We simply cannot keep this code until that driver rework is done, it is super bloated and tangled with tag_8021q. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
bcb9928a15 |
net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
This was not caught because there is no switch driver which implements
the .port_bridge_join but not .port_bridge_leave method, but it should
nonetheless be fixed, as in certain conditions (driver development) it
might lead to NULL pointer dereference.
Fixes:
|
||
Vladimir Oltean
|
b71d098715 |
net: dsa: return -EOPNOTSUPP when driver does not implement .port_lag_join
The DSA core has a layered structure, and even though we end up returning 0 (success) to user space when setting a bonding/team upper that can't be offloaded, some parts of the framework actually need to know that we couldn't offload that. For example, if dsa_switch_lag_join returns 0 as it currently does, dsa_port_lag_join has no way to tell a successful offload from a software fallback, and it will call dsa_port_bridge_join afterwards. Then we'll think we're offloading the bridge master of the LAG, when in fact we're not even offloading the LAG. In turn, this will make us set skb->offload_fwd_mark = true, which is incorrect and the bridge doesn't like it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
63c51453c8 |
net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
When we join a bridge that already has some local addresses pointing to itself, we do not get those notifications. Similarly, when we leave that bridge, we do not get notifications for the deletion of those entries. The only switchdev notifications we get are those of entries added while the DSA port is enslaved to the bridge. This makes use cases such as the following work properly (with the number of additions and removals properly balanced): ip link add br0 type bridge ip link add br1 type bridge ip link set br0 address 00:01:02:03:04:05 ip link set br1 address 00:01:02:03:04:05 ip link set swp0 up ip link set swp1 up ip link set swp0 master br0 ip link set swp1 master br1 ip link set br0 up ip link set br1 up ip link del br1 # 00:01:02:03:04:05 still installed on the CPU port ip link del br0 # 00:01:02:03:04:05 finally removed from the CPU port Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
4bed397c3e |
net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
When (a) "dev" is a bridge port which the DSA switch tree offloads, but is otherwise not a dsa slave (such as a LAG netdev), or (b) "dev" is the bridge net device itself then strange things happen to the dev_hold/dev_put pair: dsa_schedule_work() will still be called with a DSA port that offloads that netdev, but dev_hold() will be called on the non-DSA netdev. Then the "if" condition in dsa_slave_switchdev_event_work() does not pass, because "dev" is not a DSA netdev, so dev_put() is not called. This results in the simple fact that we have a reference counting mismatch on the "dev" net device. This can be seen when we add support for host addresses installed on the bridge net device. ip link add br1 type bridge ip link set br1 address 00:01:02:03:04:05 ip link set swp0 master br1 ip link del br1 [ 968.512278] unregister_netdevice: waiting for br1 to become free. Usage count = 5 It seems foolish to do penny pinching and not add the net_device pointer in the dsa_switchdev_event_work structure, so let's finally do that. As an added bonus, when we start offloading local entries pointing towards the bridge, these will now properly appear as 'offloaded' in 'bridge fdb' (this was not possible before, because 'dev' was assumed to only be a DSA net device): 00:01:02:03:04:05 dev br0 vlan 1 offload master br0 permanent 00:01:02:03:04:05 dev br0 offload master br0 permanent Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
81a619f787 |
net: dsa: include fdb entries pointing to bridge in the host fdb list
The bridge supports a legacy way of adding local (non-forwarded) FDB
entries, which works on an individual port basis:
bridge fdb add dev swp0 00:01:02:03:04:05 master local
As well as a new way, added by Roopa Prabhu in commit
|
||
Tobias Waldekranz
|
10fae4ac89 |
net: dsa: include bridge addresses which are local in the host fdb list
The bridge automatically creates local (not forwarded) fdb entries pointing towards physical ports with their interface MAC addresses. For switchdev, the significance of these fdb entries is the exact opposite of that of non-local entries: instead of sending these frame outwards, we must send them inwards (towards the host). NOTE: The bridge's own MAC address is also "local". If that address is not shared with any port, the bridge's MAC is not be added by this functionality - but the following commit takes care of that case. NOTE 2: We mark these addresses as host-filtered regardless of the value of ds->assisted_learning_on_cpu_port. This is because, as opposed to the speculative logic done for dynamic address learning on foreign interfaces, the local FDB entries are rather fixed, so there isn't any risk of them migrating from one bridge port to another. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
3068d466a6 |
net: dsa: sync static FDB entries on foreign interfaces to hardware
DSA is able to install FDB entries towards the CPU port for addresses which were dynamically learnt by the software bridge on foreign interfaces that are in the same bridge with a DSA switch interface. Since this behavior is opportunistic, it is guarded by the "assisted_learning_on_cpu_port" property which can be enabled by drivers and is not done automatically (since certain switches may support address learning of packets coming from the CPU port). But if those FDB entries added on the foreign interfaces are static (added by the user) instead of dynamically learnt, currently DSA does not do anything (and arguably it should). Because static FDB entries are not supposed to move on their own, there is no downside in reusing the "assisted_learning_on_cpu_port" logic to sync static FDB entries to the DSA CPU port unconditionally, even if assisted_learning_on_cpu_port is not requested by the driver. For example, this situation: br0 / \ swp0 dummy0 $ bridge fdb add 02:00:de:ad:00:01 dev dummy0 vlan 1 master static Results in DSA adding an entry in the hardware FDB, pointing this address towards the CPU port. The same is true for entries added to the bridge itself, e.g: $ bridge fdb add 02:00:de:ad:00:01 dev br0 vlan 1 self local (except that right now, DSA still ignores 'local' FDB entries, this will be changed in a later patch) Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
26ee7b06a4 |
net: dsa: install the host MDB and FDB entries in the master's RX filter
If the DSA master implements strict address filtering, then the unicast and multicast addresses kept by the DSA CPU ports should be synchronized with the address lists of the DSA master. Note that we want the synchronization of the master's address lists even if the DSA switch doesn't support unicast/multicast database operations, on the premises that the packets will be flooded to the CPU in that case, and we should still instruct the master to receive them. This is why we do the dev_uc_add() etc first, even if dsa_port_notify() returns -EOPNOTSUPP. In turn, dev_uc_add() and friends return error only if memory allocation fails, so it is probably ok to check and propagate that error code and not just ignore it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
3f6e32f92a |
net: dsa: reference count the FDB addresses at the cross-chip notifier level
The same concerns expressed for host MDB entries are valid for host FDBs just as well: - in the case of multiple bridges spanning the same switch chip, deleting a host FDB entry that belongs to one bridge will result in breakage to the other bridge - not deleting FDB entries across DSA links means that the switch's hardware tables will eventually run out, given enough wear&tear So do the same thing and introduce reference counting for CPU ports and DSA links using the same data structures as we have for MDB entries. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
3dc80afc50 |
net: dsa: introduce a separate cross-chip notifier type for host FDBs
DSA treats some bridge FDB entries by trapping them to the CPU port. Currently, the only class of such entries are FDB addresses learnt by the software bridge on a foreign interface. However there are many more to be added: - FDB entries with the is_local flag (for termination) added by the bridge on the user ports (typically containing the MAC address of the bridge port) - FDB entries pointing towards the bridge net device (for termination). Typically these contain the MAC address of the bridge net device. - Static FDB entries installed on a foreign interface that is in the same bridge with a DSA user port. The reason why a separate cross-chip notifier for host FDBs is justified compared to normal FDBs is the same as in the case of host MDBs: the cross-chip notifier matching function in switch.c should avoid installing these entries on routing ports that route towards the targeted switch, but not towards the CPU. This is required in order to have proper support for H-like multi-chip topologies. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
161ca59d39 |
net: dsa: reference count the MDB entries at the cross-chip notifier level
Ever since the cross-chip notifiers were introduced, the design was meant to be simplistic and just get the job done without worrying too much about dangling resources left behind. For example, somebody installs an MDB entry on sw0p0 in this daisy chain topology. It gets installed using ds->ops->port_mdb_add() on sw0p0, sw1p4 and sw2p4. | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ x ] [ ] [ ] [ ] [ ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] Then the same person deletes that MDB entry. The cross-chip notifier for deletion only matches sw0p0: | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ x ] [ ] [ ] [ ] [ ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ ] Why? Because the DSA links are 'trunk' ports, if we just go ahead and delete the MDB from sw1p4 and sw2p4 directly, we might delete those multicast entries when they are still needed. Just consider the fact that somebody does: - add a multicast MAC address towards sw0p0 [ via the cross-chip notifiers it gets installed on the DSA links too ] - add the same multicast MAC address towards sw0p1 (another port of that same switch) - delete the same multicast MAC address from sw0p0. At this point, if we deleted the MAC address from the DSA links, it would be flooded, even though there is still an entry on switch 0 which needs it not to. So that is why deletions only match the targeted source port and nothing on DSA links. Of course, dangling resources means that the hardware tables will eventually run out given enough additions/removals, but hey, at least it's simple. But there is a bigger concern which needs to be addressed, and that is our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add() on the upstream port, and a similar thing happens on deletion: dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the upstream port. When there are 2 VLAN-unaware bridges spanning the same switch (which is a use case DSA proudly supports), each bridge will install its own SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the user ports enslaved to br0 and the user ports enslaved to br1. Not good. The host-trapped multicast addresses installed by br1 will be deleted when any state changes in br0 (IGMP timers expire, or ports leave, etc). To avoid this, we could of course go the route of the zero-sum game and delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better design is to just admit that on shared ports like DSA links and CPU ports, we should be reference counting calls, even if this consumes some dynamic memory which DSA has traditionally avoided. On the flip side, the hardware tables of switches are limited in size, so it would be good if the OS managed them properly instead of having them eventually overflow. To address the memory usage concern, we only apply the refcounting of MDB entries on ports that are really shared (CPU ports and DSA links) and not on user ports. In a typical single-switch setup, this means only the CPU port (and the host MDB entries are not that many, really). The name of the newly introduced data structures (dsa_mac_addr) is chosen in such a way that will be reusable for host FDB entries (next patch). With this change, we can finally have the same matching logic for the MDB additions and deletions, as well as for their host-trapped variants. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
b8e997c490 |
net: dsa: introduce a separate cross-chip notifier type for host MDBs
Commit
|
||
Vladimir Oltean
|
b117e1e8a8 |
net: dsa: delete dsa_legacy_fdb_add and dsa_legacy_fdb_del
We want to add reference counting for FDB entries in cross-chip topologies, and in order for that to have any chance of working and not be unbalanced (leading to entries which are never deleted), we need to ensure that higher layers are sane, because if they aren't, it's garbage in, garbage out. For example, if we add a bridge FDB entry twice, the bridge properly errors out: $ bridge fdb add dev swp0 00:01:02:03:04:07 master static $ bridge fdb add dev swp0 00:01:02:03:04:07 master static RTNETLINK answers: File exists However, the same thing cannot be said about the bridge bypass operations: $ bridge fdb add dev swp0 00:01:02:03:04:07 $ bridge fdb add dev swp0 00:01:02:03:04:07 $ bridge fdb add dev swp0 00:01:02:03:04:07 $ bridge fdb add dev swp0 00:01:02:03:04:07 $ echo $? 0 But one 'bridge fdb del' is enough to remove the entry, no matter how many times it was added. The bridge bypass operations are impossible to maintain in these circumstances and lack of support for reference counting the cross-chip notifiers is holding us back from making further progress, so just drop support for them. The only way left for users to install static bridge FDB entries is the proper one, using the "master static" flags. With this change, rtnl_fdb_add() falls back to calling ndo_dflt_fdb_add() which uses the duplicate-exclusive variant of dev_uc_add(): dev_uc_add_excl(). Because DSA does not (yet) declare IFF_UNICAST_FLT, this results in us going to promiscuous mode: $ bridge fdb add dev swp0 00:01:02:03:04:05 [ 28.206743] device swp0 entered promiscuous mode $ bridge fdb add dev swp0 00:01:02:03:04:05 RTNETLINK answers: File exists So even if it does not completely fail, there is at least some indication that it is behaving differently from before, and closer to user space expectations, I would argue (the lack of a "local|static" specifier defaults to "local", or "host-only", so dev_uc_add() is a reasonable default implementation). If the generic implementation of .ndo_fdb_add provided by Vlad Yasevich is a proof of anything, it only proves that the implementation provided by DSA was always wrong, by not looking at "ndm->ndm_state & NUD_NOARP" (the "static" flag which means that the FDB entry points outwards) and "ndm->ndm_state & NUD_PERMANENT" (the "local" flag which means that the FDB entry points towards the host). It all used to mean the same thing to DSA. Update the documentation so that the users are not confused about what's going on. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
7491894532 |
net: dsa: replay a deletion of switchdev objects for ports leaving a bridged LAG
When a DSA switch port leaves a bonding interface that is under a bridge, there might be dangling switchdev objects on that port left behind, because the bridge is not aware that its lower interface (the bond) changed state in any way. Call the bridge replay helpers with adding=false before changing dp->bridge_dev to NULL, because we need to simulate to dsa_slave_port_obj_del() that these notifications were emitted by the bridge. We add this hook to the NETDEV_PRECHANGEUPPER event handler, because we are calling into switchdev (and the __switchdev_handle_port_obj_del fanout helpers expect the upper/lower adjacency lists to still be valid) and PRECHANGEUPPER is the last moment in time when they still are. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
4ede74e73b |
net: dsa: refactor the prechangeupper sanity checks into a dedicated function
We need to add more logic to the DSA NETDEV_PRECHANGEUPPER event handler, more exactly we need to request an unsync of switchdev objects. In order to fit more code, refactor the existing logic into a helper. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Vladimir Oltean
|
7e8c18586d |
net: bridge: allow the switchdev replay functions to be called for deletion
When a switchdev port leaves a LAG that is a bridge port, the switchdev objects and port attributes offloaded to that port are not removed: ip link add br0 type bridge ip link add bond0 type bond mode 802.3ad ip link set swp0 master bond0 ip link set bond0 master br0 bridge vlan add dev bond0 vid 100 ip link set swp0 nomaster VLAN 100 will remain installed on swp0 despite it going into standalone mode, because as far as the bridge is concerned, nothing ever happened to its bridge port. Let's extend the bridge vlan, fdb and mdb replay functions to take a 'bool adding' argument, and make DSA and ocelot call the replay functions with 'adding' as false from the switchdev unsync path, for the switch port that leaves the bridge. Note that this patch in itself does not salvage anything, because in the current pull mode of operation, DSA still needs to call the replay helpers with adding=false. This will be done in another patch. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |