The IPv6 path already drops dst in the daddr changed case, but the IPv4
path does not. This change makes the two code paths consistent.
Further, it is possible that there is already a metadata_dst allocated from
ingress that might already be attached to skbuff->dst while following
the bridge path. If it is not released before setting a new
metadata_dst, it will be leaked. This is similar to what is done in
bpf_set_tunnel_key() or ip6_route_input().
It is important to note that the memory being leaked is not the dst
being set in the bridge code, but rather memory allocated from some
other code path that is not being freed correctly before the skb dst is
overwritten.
An example of the leakage fixed by this commit found using kmemleak:
unreferenced object 0xffff888010112b00 (size 256):
comm "softirq", pid 0, jiffies 4294762496 (age 32.012s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 80 16 f1 83 ff ff ff ff ................
e1 4e f6 82 ff ff ff ff 00 00 00 00 00 00 00 00 .N..............
backtrace:
[<00000000d79567ea>] metadata_dst_alloc+0x1b/0xe0
[<00000000be113e13>] udp_tun_rx_dst+0x174/0x1f0
[<00000000a36848f4>] geneve_udp_encap_recv+0x350/0x7b0
[<00000000d4afb476>] udp_queue_rcv_one_skb+0x380/0x560
[<00000000ac064aea>] udp_unicast_rcv_skb+0x75/0x90
[<000000009a8ee8c5>] ip_protocol_deliver_rcu+0xd8/0x230
[<00000000ef4980bb>] ip_local_deliver_finish+0x7a/0xa0
[<00000000d7533c8c>] __netif_receive_skb_one_core+0x89/0xa0
[<00000000a879497d>] process_backlog+0x93/0x190
[<00000000e41ade9f>] __napi_poll+0x28/0x170
[<00000000b4c0906b>] net_rx_action+0x14f/0x2a0
[<00000000b20dd5d4>] __do_softirq+0xf4/0x305
[<000000003a7d7e15>] __irq_exit_rcu+0xc3/0x140
[<00000000968d39a2>] sysvec_apic_timer_interrupt+0x9e/0xc0
[<000000009e920794>] asm_sysvec_apic_timer_interrupt+0x16/0x20
[<000000008942add0>] native_safe_halt+0x13/0x20
Florian Westphal says: "Original code was likely fine because nothing
ever did set a skb->dst entry earlier than bridge in those days."
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Harsh Modi <harshmodi@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Harshit Mogalapalli says:
In ebt_do_table() function dereferencing 'private->hook_entry[hook]'
can lead to NULL pointer dereference. [..] Kernel panic:
general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
[..]
RIP: 0010:ebt_do_table+0x1dc/0x1ce0
Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88
[..]
Call Trace:
nf_hook_slow+0xb1/0x170
__br_forward+0x289/0x730
maybe_deliver+0x24b/0x380
br_flood+0xc6/0x390
br_dev_xmit+0xa2e/0x12c0
For some reason ebtables rejects blobs that provide entry points that are
not supported by the table, but what it should instead reject is the
opposite: blobs that DO NOT provide an entry point supported by the table.
t->valid_hooks is the bitmask of hooks (input, forward ...) that will see
packets. Providing an entry point that is not support is harmless
(never called/used), but the inverse isn't: it results in a crash
because the ebtables traverser doesn't expect a NULL blob for a location
its receiving packets for.
Instead of fixing all the individual checks, do what iptables is doing and
reject all blobs that differ from the expected hooks.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
After commit b6c02ef549 ("bridge: Netlink interface fix."),
br_fill_ifinfo() started to send an empty IFLA_AF_SPEC attribute when a
bridge vlan dump is requested but an interface does not have any vlans
configured.
iproute2 ignores such an empty attribute since commit b262a9becbcb
("bridge: Fix output with empty vlan lists") but older iproute2 versions as
well as other utilities have their output changed by the cited kernel
commit, resulting in failed test cases. Regardless, emitting an empty
attribute is pointless and inefficient.
Avoid this change by canceling the attribute if no AF_SPEC data was added.
Fixes: b6c02ef549 ("bridge: Netlink interface fix.")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20220725001236.95062-1-bpoirier@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Same as the existing ones, no conversions. This is just for sparse sake
only so that we no longer mix be16/u16 and be32/u32 types.
Alternative is to add __force __beX in various places, but this
seems nicer.
objdiff shows no changes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
9c5de246c1 ("net: sparx5: mdb add/del handle non-sparx5 devices")
fbb89d02e3 ("net: sparx5: Allow mdb entries to both CPU and ports")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When br_netfilter module is loaded, skbs may be diverted to the
ipv4/ipv6 hooks, just like as if we were routing.
Unfortunately, bridge filter hooks with priority 0 may be skipped
in this case.
Example:
1. an nftables bridge ruleset is loaded, with a prerouting
hook that has priority 0.
2. interface is added to the bridge.
3. no tcp packet is ever seen by the bridge prerouting hook.
4. flush the ruleset
5. load the bridge ruleset again.
6. tcp packets are processed as expected.
After 1) the only registered hook is the bridge prerouting hook, but its
not called yet because the bridge hasn't been brought up yet.
After 2), hook order is:
0 br_nf_pre_routing // br_netfilter internal hook
0 chain bridge f prerouting // nftables bridge ruleset
The packet is diverted to br_nf_pre_routing.
If call-iptables is off, the nftables bridge ruleset is called as expected.
But if its enabled, br_nf_hook_thresh() will skip it because it assumes
that all 0-priority hooks had been called previously in bridge context.
To avoid this, check for the br_nf_pre_routing hook itself, we need to
resume directly after it, even if this hook has a priority of 0.
Unfortunately, this still results in different packet flow.
With this fix, the eval order after in 3) is:
1. br_nf_pre_routing
2. ip(6)tables (if enabled)
3. nftables bridge
but after 5 its the much saner:
1. nftables bridge
2. br_nf_pre_routing
3. ip(6)tables (if enabled)
Unfortunately I don't see a solution here:
It would be possible to move br_nf_pre_routing to a higher priority
so that it will be called later in the pipeline, but this also impacts
ebtables evaluation order, and would still result in this very ordering
problem for all nftables-bridge hooks with the same priority as the
br_nf_pre_routing one.
Searching back through the git history I don't think this has
ever behaved in any other way, hence, no fixes-tag.
Reported-by: Radim Hrazdil <rhrazdil@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Adding mdb entries on disabled ports allows you to do setup before
accepting any traffic, avoiding any time where the port is not in the
multicast group.
Signed-off-by: Casper Andersson <casper.casan@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As explained in commit 316580b69d ("u64_stats: provide u64_stats_t type")
we should use u64_stats_t and related accessors to avoid load/store tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netdev reference helpers have a dev_ prefix for historic
reasons. Renaming the old helpers would be too much churn
but we can rename the tracking ones which are relatively
recent and should be the default for new code.
Rename:
dev_hold_track() -> netdev_hold()
dev_put_track() -> netdev_put()
dev_replace_track() -> netdev_ref_replace()
Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It is possible to stack bridges on top of each other. Consider the
following which makes use of an Ethernet switch:
br1
/ \
/ \
/ \
br0.11 wlan0
|
br0
/ | \
p1 p2 p3
br0 is offloaded to the switch. Above br0 is a vlan interface, for
vlan 11. This vlan interface is then a slave of br1. br1 also has a
wireless interface as a slave. This setup trunks wireless lan traffic
over the copper network inside a VLAN.
A frame received on p1 which is passed up to the bridge has the
skb->offload_fwd_mark flag set to true, indicating that the switch has
dealt with forwarding the frame out ports p2 and p3 as needed. This
flag instructs the software bridge it does not need to pass the frame
back down again. However, the flag is not getting reset when the frame
is passed upwards. As a result br1 sees the flag, wrongly interprets
it, and fails to forward the frame to wlan0.
When passing a frame upwards, clear the flag. This is the Rx
equivalent of br_switchdev_frame_unmark() in br_dev_xmit().
Fixes: f1c2eddf4c ("bridge: switchdev: Use an helper to clear forward mark")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20220518005840.771575-1-andrew@lunn.ch
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add extack support to .ndo_fdb_del in netdevice.h and
all related methods.
Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers should call the TSO setting helper, GSO is controllable
by user space.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_vlan_group() can return NULL and thus return value must be checked
to avoid dereferencing a NULL pointer.
Fixes: 6284c723d9 ("net: bridge: mst: Notify switchdev drivers of VLAN MSTI migrations")
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20220421101247.121896-1-clement.leger@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for fdb flush filtering based on destination ifindex and
vlan id. The ifindex must either match a port's device ifindex or the
bridge's. The vlan support is trivial since it's already validated by
rtnl_fdb_del, we just need to fill it in.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for fdb flush filtering based on ndm flags and state. NDM
state and flags are mapped to bridge-specific flags and matched
according to the specified masks. NTF_USE is used to represent
added_by_user flag since it sets it on fdb add and we don't have a 1:1
mapping for it. Only allowed bits can be set, NTF_SELF and NTF_MASTER are
ignored.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the ability to specify exactly which fdbs to be flushed. They are
described by a new structure - net_bridge_fdb_flush_desc. Currently it
can match on port/bridge ifindex, vlan id and fdb flags. It is used to
describe the existing dynamic fdb flush operation. Note that this flush
operation doesn't treat permanent entries in a special way (fdb_delete vs
fdb_delete_local), it will delete them regardless if any port is using
them, so currently it can't directly replace deletes which need to handle
that case, although we can extend it later for that too.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a minimal ndo_fdb_del_bulk implementation which flushes all entries.
Support for more fine-grained filtering will be added in the following
patches.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch expands on the earlier work on layer-2 mdb entries by adding
support for host entries. Due to the fact that host joined entries do
not have any flag field, we infer the permanent flag when reporting the
entries to userspace, which otherwise would be listed as 'temp'.
Before patch:
~# bridge mdb add dev br0 port br0 grp 01:00:00:c0:ff:ee permanent
Error: bridge: Flags are not allowed for host groups.
~# bridge mdb add dev br0 port br0 grp 01:00:00:c0:ff:ee
Error: bridge: Only permanent L2 entries allowed.
After patch:
~# bridge mdb add dev br0 port br0 grp 01:00:00:c0:ff:ee permanent
~# bridge mdb show
dev br0 port br0 grp 01:00:00:c0:ff:ee permanent vid 1
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add BR_HAIRPIN_MODE, BR_ISOLATED and BR_MULTICAST_TO_UNICAST port flags to
BR_PORT_FLAGS_HW_OFFLOAD so that switchdev drivers which have an offloaded
data plane have a chance to reject these bridge port flags if they don't
support them yet.
It makes the code path go through the
SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS driver handlers, which return
-EINVAL for everything they don't recognize.
For drivers that don't catch SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS at
all, switchdev will return -EOPNOTSUPP for those which is then ignored, but
those are in the minority.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20220410134227.18810-1-arinc.unal@arinc9.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ensure that no bridge masters are ever considered for MST info
dumping. MST states are only supported on bridge ports, not bridge
masters - which br_mst_info_size relies on.
Fixes: 122c29486e ("net: bridge: mst: Support setting and reporting MST port states")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Link: https://lore.kernel.org/r/20220322133001.16181-1-tobias@waldekranz.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
its enough to export the meta get reduce helper and then call it
from nft_meta_bridge too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Output of expressions might be larger than one single register, this might
clobber existing data. Reset tracking for all destination registers that
required to store the expression output.
This patch adds three new helper functions:
- nft_reg_track_update: cancel previous register tracking and update it.
- nft_reg_track_cancel: cancel any previous register tracking info.
- __nft_reg_track_cancel: cancel only one single register tracking info.
Partial register clobbering detection is also supported by checking the
.num_reg field which describes the number of register that are used.
This patch updates the following expressions:
- meta_bridge
- bitwise
- byteorder
- meta
- payload
to use these helper functions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Skip register tracking for expressions that perform read-only operations
on the registers. Define and use a cookie pointer NFT_REDUCE_READONLY to
avoid defining stubs for these expressions.
This patch re-enables register tracking which was disabled in ed5f85d422
("netfilter: nf_tables: disable register tracking"). Follow up patches
add remaining register tracking for existing expressions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is useful for switchdev drivers who are offloading MST states
into hardware. As an example, a driver may wish to flush the FDB for a
port when it transitions from forwarding to blocking - which means
that the previous state must be discoverable.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is useful for switchdev drivers that might want to refuse to join
a bridge where MST is enabled, if the hardware can't support it.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
br_mst_get_info answers the question: "On this bridge, which VIDs are
mapped to the given MSTI?"
This is useful in switchdev drivers, which might have to fan-out
operations, relating to an MSTI, per VLAN.
An example: When a port's MST state changes from forwarding to
blocking, a driver may choose to flush the dynamic FDB entries on that
port to get faster reconvergence of the network, but this should only
be done in the VLANs that are managed by the MSTI in question.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Generate a switchdev notification whenever an MST state changes. This
notification is keyed by the VLANs MSTI rather than the VID, since
multiple VLANs may share the same MST instance.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever a VLAN moves to a new MSTI, send a switchdev notification so
that switchdevs can track a bridge's VID to MSTI mappings.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Trigger a switchdev event whenever the bridge's MST mode is
enabled/disabled. This allows constituent ports to either perform any
required hardware config, or refuse the change if it not supported.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make it possible to change the port state in a given MSTI by extending
the bridge port netlink interface (RTM_SETLINK on PF_BRIDGE).The
proposed iproute2 interface would be:
bridge mst set dev <PORT> msti <MSTI> state <STATE>
Current states in all applicable MSTIs can also be dumped via a
corresponding RTM_GETLINK. The proposed iproute interface looks like
this:
$ bridge mst
port msti
vb1 0
state forwarding
100
state disabled
vb2 0
state forwarding
100
state forwarding
The preexisting per-VLAN states are still valid in the MST
mode (although they are read-only), and can be queried as usual if one
is interested in knowing a particular VLAN's state without having to
care about the VID to MSTI mapping (in this example VLAN 20 and 30 are
bound to MSTI 100):
$ bridge -d vlan
port vlan-id
vb1 10
state forwarding mcast_router 1
20
state disabled mcast_router 1
30
state disabled mcast_router 1
40
state forwarding mcast_router 1
vb2 10
state forwarding mcast_router 1
20
state forwarding mcast_router 1
30
state forwarding mcast_router 1
40
state forwarding mcast_router 1
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allow a VLAN to move out of the CST (MSTI 0), to an independent tree.
The user manages the VID to MSTI mappings via a global VLAN
setting. The proposed iproute2 interface would be:
bridge vlan global set dev br0 vid <VID> msti <MSTI>
Changing the state in non-zero MSTIs is still not supported, but will
be addressed in upcoming changes.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allow the user to switch from the current per-VLAN STP mode to an MST
mode.
Up to this point, per-VLAN STP states where always isolated from each
other. This is in contrast to the MSTP standard (802.1Q-2018, Clause
13.5), where VLANs are grouped into MST instances (MSTIs), and the
state is managed on a per-MSTI level, rather that at the per-VLAN
level.
Perhaps due to the prevalence of the standard, many switching ASICs
are built after the same model. Therefore, add a corresponding MST
mode to the bridge, which we can later add offloading support for in a
straight-forward way.
For now, all VLANs are fixed to MSTI 0, also called the Common
Spanning Tree (CST). That is, all VLANs will follow the port-global
state.
Upcoming changes will make this actually useful by allowing VLANs to
be mapped to arbitrary MSTIs and allow individual MSTI states to be
changed.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Revert CHECKSUM_UNNECESSARY for UDP packet from conntrack.
2) Reject unsupported families when creating tables, from Phil Sutter.
3) GRE support for the flowtable, from Toshiaki Makita.
4) Add GRE offload support for act_ct, also from Toshiaki.
5) Update mlx5 driver to support for GRE flowtable offload,
from Toshiaki Makita.
6) Oneliner to clean up incorrect indentation in nf_conntrack_bridge,
from Jiapeng Chong.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: bridge: clean up some inconsistent indenting
net/mlx5: Support GRE conntrack offload
act_ct: Support GRE offload
netfilter: flowtable: Support GRE
netfilter: nf_tables: Reject tables of unsupported family
Revert "netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY"
====================
Link: https://lore.kernel.org/r/20220315091513.66544-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit
baebdf48c3 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
the function netif_rx() can be used in preemptible/thread context as
well as in interrupt context.
Use netif_rx().
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: bridge@lists.linux-foundation.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
If skb->tstamp has the mono delivery_time, clearing it can hurt
the performance when it finally transmits out to fq@phy-dev.
The earlier patch added a skb->mono_delivery_time bit to
flag the skb->tstamp carrying the mono delivery_time.
This patch adds skb_clear_tstamp() helper which keeps
the mono delivery_time and clears everything else.
The delivery_time clearing will be postponed until the stack knows the
skb will be delivered locally. It will be done in a latter patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).
Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.
Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.
While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev. For example, when forwarding from egress to
ingress and then finally back to egress:
tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
^ ^
reset rest
This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
The current use case is to keep the TCP mono delivery_time (EDT) and
to be used with sch_fq. A latter patch will also allow tc-bpf@ingress
to read and change the mono delivery_time.
In the future, another bit (e.g. skb->user_delivery_time) can be added
for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
[ This patch is a prep work. The following patches will
get the other parts of the stack ready first. Then another patch
after that will finally set the skb->mono_delivery_time. ]
skb_set_delivery_time() function is added. It is used by the tcp_output.c
and during ip[6] fragmentation to assign the delivery_time to
the skb->tstamp and also set the skb->mono_delivery_time.
A note on the change in ip_send_unicast_reply() in ip_output.c.
It is only used by TCP to send reset/ack out of a ctl_sk.
Like the new skb_set_delivery_time(), this patch sets
the skb->mono_delivery_time to 0 for now as a place
holder. It will be enabled in a latter patch.
A similar case in tcp_ipv6 can be done with
skb_set_delivery_time() in tcp_v6_send_response().
[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various switchcores support setting ports in locked mode, so that
clients behind locked ports cannot send traffic through the port
unless a fdb entry is added with the clients MAC address.
Signed-off-by: Hans Schultz <schultz.hans+netdev@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a 802.1X scenario, clients connected to a bridge port shall not
be allowed to have traffic forwarded until fully authenticated.
A static fdb entry of the clients MAC address for the bridge port
unlocks the client and allows bidirectional communication.
This scenario is facilitated with setting the bridge port in locked
mode, which is also supported by various switchcore chipsets.
Signed-off-by: Hans Schultz <schultz.hans+netdev@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cleanup_net() is competing with other rtnl users.
Instead of calling br_net_exit() for each netns,
call br_net_exit_batch() once.
This gives cleanup_net() ability to group more devices
and call unregister_netdevice_many() only once for all bridge devices.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever bridge driver hits the max capacity of MDBs, it disables
the MC processing (by setting corresponding bridge option), but never
notifies switchdev about such change (the notifiers are called only upon
explicit setting of this option, through the registered netlink interface).
This could lead to situation when Software MDB processing gets disabled,
but this event never gets offloaded to the underlying Hardware.
Fix this by adding a notify message in such case.
Fixes: 147c1e9b90 ("switchdev: bridge: Offload multicast disabled")
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20220215165303.31908-1-oleksandr.mazur@plvision.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The major user of replayed switchdev objects is DSA, and so far it
hasn't needed information about anything other than bridge port VLANs,
so this is all that br_switchdev_vlan_replay() knows to handle.
DSA has managed to get by through replicating every VLAN addition on a
user port such that the same VLAN is also added on all DSA and CPU
ports, but there is a corner case where this does not work.
The mv88e6xxx DSA driver currently prints this error message as soon as
the first port of a switch joins a bridge:
mv88e6085 0x0000000008b96000:00: port 0 failed to add a6:ef:77:c8:5f:3d vid 1 to fdb: -95
where a6:ef:77:c8:5f:3d vid 1 is a local FDB entry corresponding to the
bridge MAC address in the default_pvid.
The -EOPNOTSUPP is returned by mv88e6xxx_port_db_load_purge() because it
tries to map VID 1 to a FID (the ATU is indexed by FID not VID), but
fails to do so. This is because ->port_fdb_add() is called before
->port_vlan_add() for VID 1.
The abridged timeline of the calls is:
br_add_if
-> netdev_master_upper_dev_link
-> dsa_port_bridge_join
-> switchdev_bridge_port_offload
-> br_switchdev_vlan_replay (*)
-> br_switchdev_fdb_replay
-> mv88e6xxx_port_fdb_add
-> nbp_vlan_init
-> nbp_vlan_add
-> mv88e6xxx_port_vlan_add
and the issue is that at the time of (*), the bridge port isn't in VID 1
(nbp_vlan_init hasn't been called), therefore br_switchdev_vlan_replay()
won't have anything to replay, therefore VID 1 won't be in the VTU by
the time mv88e6xxx_port_fdb_add() is called.
This happens only when the first port of a switch joins. For further
ports, the initial mv88e6xxx_port_vlan_add() is sufficient for VID 1 to
be loaded in the VTU (which is switch-wide, not per port).
The problem is somewhat unique to mv88e6xxx by chance, because most
other drivers offload an FDB entry by VID, so FDBs and VLANs can be
added asynchronously with respect to each other, but addressing the
issue at the bridge layer makes sense, since what mv88e6xxx requires
isn't absurd.
To fix this problem, we need to recognize that it isn't the VLAN group
of the port that we're interested in, but the VLAN group of the bridge
itself (so it isn't a timing issue, but rather insufficient information
being passed from switchdev to drivers).
As mentioned, currently nbp_switchdev_sync_objs() only calls
br_switchdev_vlan_replay() for VLANs corresponding to the port, but the
VLANs corresponding to the bridge itself, for local termination, also
need to be replayed. In this case, VID 1 is not (yet) present in the
port's VLAN group but is present in the bridge's VLAN group.
So to fix this bug, DSA is now obligated to explicitly handle VLANs
pointing towards the bridge in order to "close this race" (which isn't
really a race). As Tobias Waldekranz notices, this also implies that it
must explicitly handle port VLANs on foreign interfaces, something that
worked implicitly before:
https://patchwork.kernel.org/project/netdevbpf/patch/20220209213044.2353153-6-vladimir.oltean@nxp.com/#24735260
So in the end, br_switchdev_vlan_replay() must replay all VLANs from all
VLAN groups: all the ports, and the bridge itself.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There may be switchdev drivers that can add/remove a FDB or MDB entry
only as long as the VLAN it's in has been notified and offloaded first.
The nbp_switchdev_sync_objs() method satisfies this requirement on
addition, but nbp_switchdev_unsync_objs() first deletes VLANs, then
deletes MDBs and FDBs. Reverse the order of the function calls to cater
to this requirement.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_switchdev_port_vlan_add() currently emits a SWITCHDEV_PORT_OBJ_ADD
event with a SWITCHDEV_OBJ_ID_PORT_VLAN for 2 distinct cases:
- a struct net_bridge_vlan got created
- an existing struct net_bridge_vlan was modified
This makes it impossible for switchdev drivers to properly balance
PORT_OBJ_ADD with PORT_OBJ_DEL events, so if we want to allow that to
happen, we must provide a way for drivers to distinguish between a
VLAN with changed flags and a new one.
Annotate struct switchdev_obj_port_vlan with a "bool changed" that
distinguishes the 2 cases above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when a VLAN entry is added multiple times in a row to a
bridge port, nbp_vlan_add() calls br_switchdev_port_vlan_add() each
time, even if the VLAN already exists and nothing about it has changed:
bridge vlan add dev lan12 vid 100 master static
Similarly, when a VLAN is added multiple times in a row to a bridge,
br_vlan_add_existing() doesn't filter at all the calls to
br_switchdev_port_vlan_add():
bridge vlan add dev br0 vid 100 self
This behavior makes driver-level accounting of VLANs impossible, since
it is enough for a single deletion event to remove a VLAN, but the
addition event can be emitted an unlimited number of times.
The cause for this can be identified as follows: we rely on
__vlan_add_flags() to retroactively tell us whether it has changed
anything about the VLAN flags or VLAN group pvid. So we'd first have to
call __vlan_add_flags() before calling br_switchdev_port_vlan_add(), in
order to have access to the "bool *changed" information. But we don't
want to change the event ordering, because we'd have to revert the
struct net_bridge_vlan changes we've made if switchdev returns an error.
So to solve this, we need another function that tells us whether any
change is going to occur in the VLAN or VLAN group, _prior_ to calling
__vlan_add_flags().
Split __vlan_add_flags() into a precommit and a commit stage, and rename
it to __vlan_flags_update(). The precommit stage,
__vlan_flags_would_change(), will determine whether there is any reason
to notify switchdev due to a change of flags (note: the BRENTRY flag
transition from false to true is treated separately: as a new switchdev
entry, because we skipped notifying the master VLAN when it wasn't a
brentry yet, and therefore not as a change of flags).
With this lookahead/precommit function in place, we can avoid notifying
switchdev if nothing changed for the VLAN and VLAN group.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>