Even though ENETC interfaces are exposed as individual PCIe PFs with
their own driver instances, the ENETC is still fundamentally a
multi-port Ethernet controller, and some parts of the IP take a port
number (as can be seen in the PSFP implementation).
Create a common helper that can be used outside of the TSN code for
retrieving the ENETC port number based on the PF number. This is only
correct for LS1028A, the only Linux-capable instantiation of ENETC thus
far.
Note that ENETC port 3 is PF 6. The TSN code did not care about this
because ENETC port 3 does not support TSN, so the wrong mapping done by
enetc_get_port for PF 6 could have never been hit.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Described in fd5736bf9f ("enetc: Workaround for MDIO register access
issue") is a workaround for a hardware bug that requires a register
access of the MDIO controller to never happen concurrently with a
register access of a port PF. To avoid that, a mutual exclusion scheme
with rwlocks was implemented - the port PF accessors are the 'read'
side, and the MDIO accessors are the 'write' side.
When we do XDP_REDIRECT between two ENETC interfaces, all is fine
because the MDIO lock is already taken from the NAPI poll loop.
But when the ingress interface is not ENETC, just the egress is, the
MDIO lock is not taken, so we might access the port PF registers
concurrently with MDIO, which will make the link flap due to wrong
values returned from the PHY.
To avoid this, let's just slap an enetc_lock_mdio/enetc_unlock_mdio at
the beginning and ending of enetc_xdp_xmit. The fact that the MDIO lock
is designed as a rwlock is important here, because the read side is
reentrant (that is one of the main reasons why we chose it). Usually,
the way we benefit of its reentrancy is by running the data path
concurrently on both CPUs, but in this case, we benefit from the
reentrancy by taking the lock even when the lock is already taken
(and that's the situation where ENETC is both the ingress and the egress
interface for XDP_REDIRECT, which was fine before and still is fine now).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the TX ring is congested, enetc_xdp_tx() returns false for the
current XDP frame (represented as an array of software BDs).
This array of software TX BDs is constructed in enetc_rx_swbd_to_xdp_tx_swbd
from software BDs freshly cleaned from the RX ring. The issue is that we
scrub the RX software BDs too soon, more precisely before we know that
we can enqueue the TX BDs successfully into the TX ring.
If we can't enqueue them (and enetc_xdp_tx returns false), we call
enetc_xdp_drop which attempts to recycle the buffers held by the RX
software BDs. But because we scrubbed those RX BDs already, two things
happen:
(a) we leak their memory
(b) we populate the RX software BD ring with an all-zero rx_swbd
structure, which makes the buffer refill path allocate more memory.
enetc_refill_rx_ring
-> if (unlikely(!rx_swbd->page))
-> enetc_new_page
That is a recipe for fast OOM.
Fixes: 7ed2bc8007 ("net: enetc: add support for XDP_TX")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the XDP program returns an invalid action, we should free the RX
buffer.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible for one CPU to perform TX hashing (see netdev_pick_tx)
between the 8 ENETC TX rings, and the TX hashing to select TX queue 1.
At the same time, it is possible for the other CPU to already use TX
ring 1 for XDP (either XDP_TX or XDP_REDIRECT). Since there is no mutual
exclusion between XDP and the network stack, we run into an issue
because the ENETC TX procedure is not reentrant.
The obvious approach would be to just make XDP take the lock of the
network stack's TX queue corresponding to the ring it's about to enqueue
in.
For XDP_REDIRECT, this is quite straightforward, a lock at the beginning
and end of enetc_xdp_xmit() should do the trick.
But for XDP_TX, it's a bit more complicated. For one, we do TX batching
all by ourselves for frames with the XDP_TX verdict. This is something
we would like to keep the way it is, for performance reasons. But
batching means that the network stack's lock should be kept from the
first enqueued XDP_TX frame and until we ring the doorbell. That is
mostly fine, except for cases when in the same NAPI loop we have mixed
XDP_TX and XDP_REDIRECT frames. So if enetc_xdp_xmit() gets called while
we are holding the lock from the RX NAPI, then bam, deadlock. The naive
answer could be 'just flush the XDP_TX frames first, then release the
network stack's TX queue lock, then call xdp_do_flush_map()'. But even
xdp_do_redirect() is capable of flushing the batched XDP_REDIRECT
frames, so unless we unlock/relock the TX queue around xdp_do_redirect(),
there simply isn't any clean way to protect XDP_TX from concurrent
network stack .ndo_start_xmit() on another CPU.
So we need to take a different approach, and that is to reserve two
rings for the sole use of XDP. We leave TX rings
0..ndev->real_num_tx_queues-1 to be handled by the network stack, and we
pick them from the end of the priv->tx_ring array.
We make an effort to keep the mapping done by enetc_alloc_msix() which
decides which CPU handles the TX completions of which TX ring in its
NAPI poll. So the XDP TX ring of CPU 0 is handled by TX ring 6, and the
XDP TX ring of CPU 1 is handled by TX ring 7.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that commit d6a2829e82 ("net: enetc: increase RX ring default
size") has increased the RX ring size, it is quite easy to congest the
TX rings when the traffic is predominantly XDP_TX, as the RX ring is
quite a bit larger than the TX one.
Since we bit the bullet and did the expensive thing already (larger RX
rings consume more memory pages), it seems quite foolish to keep the TX
rings small. So make them equally sized with TX.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xdp_do_redirect already contains:
-> dev_map_enqueue
-> __xdp_enqueue
-> bq_enqueue
-> bq_xmit_all // if we have more than 16 frames
So the logic from enetc will never be hit, because ENETC_DEFAULT_TX_WORK
is 128.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the code path below fails:
enetc_clean_rx_ring_xdp // XDP_PASS
-> enetc_build_skb
-> enetc_map_rx_buff_to_skb
-> build_skb
enetc_clean_rx_ring_xdp will 'break', but that 'break' instruction isn't
strong enough to actually break the NAPI poll loop, just the switch/case
statement for XDP actions. So we increment rx_frm_cnt and go to the next
frames minding our own business.
Instead let's do what the skb NAPI poll function does, and break the
loop now, waiting for the memory pressure to go away. Otherwise the next
calls to build_skb() are likely to fail too.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving a frame with errors, currently we do nothing with it (we
don't construct an skb or an xdp_buff), we just exit the NAPI poll loop.
Let's put the buffer back into the RX ring (similar to XDP_DROP).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
enetc_put_xdp_buff has nothing to do with XDP, frankly, it is just a
helper to populate the recycle end of the shadow RX BD ring
(next_to_alloc) with a given buffer.
On the other hand, enetc_put_rx_buff plays more tricks than its name
would suggest.
So let's rename enetc_put_rx_buff into enetc_flip_rx_buff to reflect the
half-page buffer reuse tricks that it employs, and enetc_put_xdp_buff
into enetc_put_rx_buff which suggests a more garden-variety operation.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Later in enetc_clean_tx_ring we have:
/* Scrub the swbd here so we don't have to do that
* when we reuse it during xmit
*/
memset(tx_swbd, 0, sizeof(*tx_swbd));
So these assignments are unnecessary.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert system_wq queue_work() to schedule_work() which is
a wrapper around it, since the former is a rare construct.
Fixes: 7294380c52 ("enetc: support PTP Sync packet one-step timestamping")
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Normally, the bootloader will already initialize the MAC address
registers of the ENETC and the driver will just use them or generate a
random one, if it is not initialized.
Add a new way to provide the MAC address: via device tree. Besides the
usual 'mac-address' property, there is also the possibility to fetch it
via a NVMEM provider. The sl28 board stores the MAC address in the SPI
NOR flash OTP region. Having this will allow linux to fetch the MAC
address from there without being dependent on the bootloader.
No in-tree boards have the device tree properties set, thus for these,
this is a no-op.
Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add support for PTP Sync packet one-step timestamping.
Since ENETC single-step register has to be configured dynamically per
packet for correctionField offeset and UDP checksum update, current
one-step timestamping packet has to be sent only when the last one
completes transmitting on hardware. So, on the TX, this patch handles
one-step timestamping packet as below:
- Trasmit packet immediately if no other one in transfer, or queue to
skb queue if there is already one in transfer.
The test_and_set_bit_lock() is used here to lock and check state.
- Start a work when complete transfer on hardware, to release the bit
lock and to send one skb in skb queue if has.
And the configuration for one-step timestamping on ENETC before
transmitting is,
- Set one-step timestamping flag in extension BD.
- Write 30 bits current timestamp in tstamp field of extension BD.
- Update PTP Sync packet originTimestamp field with current timestamp.
- Configure single-step register for correctionField offeset and UDP
checksum update.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark TX timestamp type per skb on skb->cb[0], instead of
global variable for all skbs. This is a preparation for
one step timestamp support.
For one-step timestamping enablement, there will be both
one-step and two-step PTP messages to transfer. And a skb
queue is needed for one-step PTP messages making sure
start to send current message only after the last one
completed on hardware. (ENETC single-step register has to
be dynamically configured per message.) So, marking TX
timestamp type per skb is required.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if the current mapping is correct for the 1 CPU and 2 CPU cases
(currently enetc is included in SoCs with up to 2 CPUs only), better
use a generic rule for the mapping to cover all possible cases.
The number of CPUs is the same as the number of interrupt vectors:
Per device Tx rings -
device_tx_ring[idx], where idx = 0..n_rings_total-1
Per interrupt vector Tx rings -
int_vector[i].ring[j], where i = 0..n_int_vects-1
j = 0..n_rings_per_v-1
Mapping rule -
n_rings_per_v = n_rings_total / n_int_vects
for i = 0..n_int_vects - 1:
for j = 0..n_rings_per_v - 1:
idx = n_int_vects * j + i
int_vector[i].ring[j] <- device_tx_ring[idx]
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20210409071613.28912-1-claudiu.manoil@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The blamed commit introduced a bit in the TX software buffer descriptor
structure for determining whether a BD is final or not; we rearm the TX
interrupt vector for every frame (hence final BD) transmitted.
But there is a problem with the patch: it replaced a condition whose
expression is a bool which was evaluated at the beginning of the "while"
loop with a bool expression that is evaluated on the spot: tx_swbd->is_eof.
The problem with the latter expression is that the tx_swbd has already
been incremented at that stage, so the tx_swbd->is_eof check is in fact
with the _next_ software BD. Which is _not_ final.
The effect is that the CPU is in 100% load with ksoftirqd because it
does not acknowledge the TX interrupt, so the handler keeps getting
called again and again.
The fix is to restore the code structure, and keep the local bool is_eof
variable, just to assign it the tx_swbd->is_eof value instead of
!!tx_swbd->skb.
Fixes: d504498d2e ("net: enetc: add a dedicated is_eof bit in the TX software BD")
Reported-by: Alex Marginean <alexandru.marginean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://lore.kernel.org/r/20210409192759.3895104-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This loop will try to unmap enetc_unmap_tx_buff[-1] and crash.
Fixes: 9d2b68cc10 ("net: enetc: add support for XDP_REDIRECT")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/YHBHfCY/yv3EnM9z@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver implementation of the XDP_REDIRECT action reuses parts from
XDP_TX, most notably the enetc_xdp_tx function which transmits an array
of TX software BDs. Only this time, the buffers don't have DMA mappings,
we need to create them.
When a BPF program reaches the XDP_REDIRECT verdict for a frame, we can
employ the same buffer reuse strategy as for the normal processing path
and for XDP_PASS: we can flip to the other page half and seed that to
the RX ring.
Note that scatter/gather support is there, but disabled due to lack of
multi-buffer support in XDP (which is added by this series):
https://patchwork.kernel.org/project/netdevbpf/cover/cover.1616179034.git.lorenzo@kernel.org/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As explained in the XDP_TX patch, when receiving a burst of frames with
the XDP_TX verdict, there is a momentary dip in the number of available
RX buffers. The system will eventually recover as TX completions will
start kicking in and refilling our RX BD ring again. But until that
happens, we need to survive with as few out-of-buffer discards as
possible.
This increases the memory footprint of the driver in order to avoid
discards at 2.5Gbps line rate 64B packet sizes, the maximum speed
available for testing on 1 port on NXP LS1028A.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For reflecting packets back into the interface they came from, we create
an array of TX software BDs derived from the RX software BDs. Therefore,
we need to extend the TX software BD structure to contain most of the
stuff that's already present in the RX software BD structure, for
reasons that will become evident in a moment.
For a frame with the XDP_TX verdict, we don't reuse any buffer right
away as we do for XDP_DROP (the same page half) or XDP_PASS (the other
page half, same as the skb code path).
Because the buffer transfers ownership from the RX ring to the TX ring,
reusing any page half right away is very dangerous. So what we can do is
we can recycle the same page half as soon as TX is complete.
The code path is:
enetc_poll
-> enetc_clean_rx_ring_xdp
-> enetc_xdp_tx
-> enetc_refill_rx_ring
(time passes, another MSI interrupt is raised)
enetc_poll
-> enetc_clean_tx_ring
-> enetc_recycle_xdp_tx_buff
But that creates a problem, because there is a potentially large time
window between enetc_xdp_tx and enetc_recycle_xdp_tx_buff, period in
which we'll have less and less RX buffers.
Basically, when the ship starts sinking, the knee-jerk reaction is to
let enetc_refill_rx_ring do what it does for the standard skb code path
(refill every 16 consumed buffers), but that turns out to be very
inefficient. The problem is that we have no rx_swbd->page at our
disposal from the enetc_reuse_page path, so enetc_refill_rx_ring would
have to call enetc_new_page for every buffer that we refill (if we
choose to refill at this early stage). Very inefficient, it only makes
the problem worse, because page allocation is an expensive process, and
CPU time is exactly what we're lacking.
Additionally, there is an even bigger problem: if we let
enetc_refill_rx_ring top up the ring's buffers again from the RX path,
remember that the buffers sent to transmission haven't disappeared
anywhere. They will be eventually sent, and processed in
enetc_clean_tx_ring, and an attempt will be made to recycle them.
But surprise, the RX ring is already full of new buffers, because we
were premature in deciding that we should refill. So not only we took
the expensive decision of allocating new pages, but now we must throw
away perfectly good and reusable buffers.
So what we do is we implement an elastic refill mechanism, which keeps
track of the number of in-flight XDP_TX buffer descriptors. We top up
the RX ring only up to the total ring capacity minus the number of BDs
that are in flight (because we know that those BDs will return to us
eventually).
The enetc driver manages 1 RX ring per CPU, and the default TX ring
management is the same. So we do XDP_TX towards the TX ring of the same
index, because it is affined to the same CPU. This will probably not
produce great results when we have a tc-taprio/tc-mqprio qdisc on the
interface, because in that case, the number of TX rings might be
greater, but I didn't add any checks for that yet (mostly because I
didn't know what checks to add).
It should also be noted that we need to change the DMA mapping direction
for RX buffers, since they may now be reflected into the TX ring of the
same device. We choose to use DMA_BIDIRECTIONAL instead of unmapping and
remapping as DMA_TO_DEVICE, because performance is better this way.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the RX ring, enetc uses an allocation scheme based on pages split
into two buffers, which is already very efficient in terms of preventing
reallocations / maximizing reuse, so I see no reason why I would change
that.
+--------+--------+--------+--------+--------+--------+--------+
| | | | | | | |
| half B | half B | half B | half B | half B | half B | half B |
| | | | | | | |
+--------+--------+--------+--------+--------+--------+--------+
| | | | | | | |
| half A | half A | half A | half A | half A | half A | half A | RX ring
| | | | | | | |
+--------+--------+--------+--------+--------+--------+--------+
^ ^
| |
next_to_clean next_to_alloc
next_to_use
+--------+--------+--------+--------+--------+
| | | | | |
| half B | half B | half B | half B | half B |
| | | | | |
+--------+--------+--------+--------+--------+--------+--------+
| | | | | | | |
| half B | half B | half A | half A | half A | half A | half A | RX ring
| | | | | | | |
+--------+--------+--------+--------+--------+--------+--------+
| | | ^ ^
| half A | half A | | |
| | | next_to_clean next_to_use
+--------+--------+
^
|
next_to_alloc
then when enetc_refill_rx_ring is called, whose purpose is to advance
next_to_use, it sees that it can take buffers up to next_to_alloc, and
it says "oh, hey, rx_swbd->page isn't NULL, I don't need to allocate
one!".
The only problem is that for default PAGE_SIZE values of 4096, buffer
sizes are 2048 bytes. While this is enough for normal skb allocations at
an MTU of 1500 bytes, for XDP it isn't, because the XDP headroom is 256
bytes, and including skb_shared_info and alignment, we end up being able
to make use of only 1472 bytes, which is insufficient for the default
MTU.
To solve that problem, we implement scatter/gather processing in the
driver, because we would really like to keep the existing allocation
scheme. A packet of 1500 bytes is received in a buffer of 1472 bytes and
another one of 28 bytes.
Because the headroom required by XDP is different (and much larger) than
the one required by the network stack, whenever a BPF program is added
or deleted on the port, we drain the existing RX buffers and seed new
ones with the required headroom. We also keep the required headroom in
rx_ring->buffer_offset.
The simplest way to implement XDP_PASS, where an skb must be created, is
to create an xdp_buff based on the next_to_clean RX BDs, but not clear
those BDs from the RX ring yet, just keep the original index at which
the BDs for this frame started. Then, if the verdict is XDP_PASS,
instead of converting the xdb_buff to an skb, we replay a call to
enetc_build_skb (just as in the normal enetc_clean_rx_ring case),
starting from the original BD index.
We would also like to be minimally invasive to the regular RX data path,
and not check whether there is a BPF program attached to the ring on
every packet. So we create a separate RX ring processing function for
XDP.
Because we only install/remove the BPF program while the interface is
down, we forgo the rcu_read_lock() in enetc_clean_rx_ring, since there
shouldn't be any circumstance in which we are processing packets and
there is a potentially freed BPF program attached to the RX ring.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For XDP_TX, we need to call enetc_reuse_page from enetc_clean_tx_ring,
so we need to avoid a forward declaration.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the future introduction of some new fields into enetc_tx_swbd such
as is_xdp_tx, is_xdp_redirect etc, we need not only to set these bits
to true from the XDP_TX/XDP_REDIRECT code path, but also to false from
the old code paths.
This is because TX software buffer descriptors are kept in a ring that
is shadow of the hardware TX ring, so these structures keep getting
reused, and there is always the possibility that when a software BD is
reused (after we ran a full circle through the TX ring), the old user of
the tx_swbd had set is_xdp_tx = true, and now we are sending a regular
skb, which would need to set is_xdp_tx = false.
To be minimally invasive to the old code paths, let's just scrub the
software TX BD in the TX confirmation path (enetc_clean_tx_ring), once
we know that nobody uses this software TX BD (tx_ring->next_to_clean
hasn't yet been updated, and the TX paths check enetc_bd_unused which
tells them if there's any more space in the TX ring for a new enqueue).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the transmit path, if we have a scatter/gather frame, it is put into
multiple software buffer descriptors, the last of which has the skb
pointer populated (which is necessary for rearming the TX MSI vector and
for collecting the two-step TX timestamp from the TX confirmation path).
At the moment, this is sufficient, but with XDP_TX, we'll need to
service TX software buffer descriptors that don't have an skb pointer,
however they might be final nonetheless. So add a dedicated bit for
final software BDs that we populate and check explicitly. Also, we keep
looking just for an skb when doing TX timestamping, because we don't
want/need that for XDP.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to build an skb from two code paths now: from the plain RX data
path and from the XDP data path when the verdict is XDP_PASS.
Create a new enetc_build_skb function which contains the essential steps
for building an skb based on the first and last positions of buffer
descriptors within the RX ring.
We also squash the enetc_process_skb function into enetc_build_skb,
because what that function did wasn't very meaningful on its own.
The "rx_frm_cnt++" instruction has been moved around napi_gro_receive
for cosmetic reasons, to be in the same spot as rx_byte_cnt++, which
itself must be before napi_gro_receive, because that's when we lose
ownership of the skb.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can and should check the RX BD errors before starting to build the
skb. The only apparent reason why things are done in this backwards
order is to spare one call to enetc_rxbd_next.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When enetc runs out of exact match entries for unicast address
filtering, it switches to an approach based on hash tables, where
multiple MAC addresses might end up in the same bucket.
However, the enetc_set_mac_ht_flt function currently depends on the
system endianness, because it interprets the 64-bit hash value as an
array of two u32 elements. Modify this to use lower_32_bits and
upper_32_bits.
Tested by forcing enetc to go into hash table mode by creating two
macvlan upper interfaces:
ip link add link eno0 address 00:01:02:03:00:00 eno0.0 type macvlan && ip link set eno0.0 up
ip link add link eno0 address 00:01:02:03:00:01 eno0.1 type macvlan && ip link set eno0.1 up
and verified that the same bit values are written to the registers
before and after:
enetc_sync_mac_filters: addr 00:00:80:00:40:10 exact match 0
enetc_sync_mac_filters: addr 00:00:00:00:80:00 exact match 0
enetc_set_mac_ht_flt: hash 0x80008000000000 UMHFR0 0x0 UMHFR1 0x800080
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ENETC has a 64-entry hash table for VLAN RX filtering per Station
Interface, which is accessed through two 32-bit registers: VHFR0 holding
the low portion, and VHFR1 holding the high portion.
The enetc_set_vlan_ht_filter function looks at the pf->vlan_ht_filter
bitmap, which is fundamentally an unsigned long variable, and casts it
to a u32 array of two elements. It puts the first u32 element into VHFR0
and the second u32 element into VHFR1.
It is easy to imagine that this will not work on big endian systems
(although, yes, we have bigger problems, because currently enetc assumes
that the CPU endianness is equal to the controller endianness, aka
little endian - but let's assume that we could add a cpu_to_le32 in
enetc_wd_reg and a le32_to_cpu in enetc_rd_reg).
Let's use lower_32_bits and upper_32_bits which are designed to work
regardless of endianness.
Tested that both the old and the new method produce the same results:
$ ethtool -K eth1 rx-vlan-filter on
$ ip link add link eth1 name eth1.100 type vlan id 100
enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x20
enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x20
$ ip link add link eth1 name eth1.101 type vlan id 101
enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x30
enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x30
$ ip link add link eth1 name eth1.34 type vlan id 34
enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x34
enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x34
$ ip link add link eth1 name eth1.1024 type vlan id 1024
enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x1 VHFR1 0x34
enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x1 VHFR1 0x34
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael reports that after the blamed patch, unbinding a VF would cause
these transactions to remain pending, and trigger some warnings with the
DMA API debug:
$ echo 1 > /sys/bus/pci/devices/0000\:00\:00.0/sriov_numvfs
pci 0000:00:01.0: [1957:ef00] type 00 class 0x020001
fsl_enetc_vf 0000:00:01.0: Adding to iommu group 19
fsl_enetc_vf 0000:00:01.0: enabling device (0000 -> 0002)
fsl_enetc_vf 0000:00:01.0 eno0vf0: renamed from eth0
$ echo 0 > /sys/bus/pci/devices/0000\:00\:00.0/sriov_numvfs
DMA-API: pci 0000:00:01.0: device driver has pending DMA allocations while released from device [count=1]
One of leaked entries details: [size=2048 bytes] [mapped with DMA_BIDIRECTIONAL] [mapped as coherent]
WARNING: CPU: 0 PID: 2547 at kernel/dma/debug.c:853 dma_debug_device_change+0x174/0x1c8
(...)
Call trace:
dma_debug_device_change+0x174/0x1c8
blocking_notifier_call_chain+0x74/0xa8
device_release_driver_internal+0x18c/0x1f0
device_release_driver+0x20/0x30
pci_stop_bus_device+0x8c/0xe8
pci_stop_and_remove_bus_device+0x20/0x38
pci_iov_remove_virtfn+0xb8/0x128
sriov_disable+0x3c/0x110
pci_disable_sriov+0x24/0x30
enetc_sriov_configure+0x4c/0x108
sriov_numvfs_store+0x11c/0x198
(...)
DMA-API: Mapped at:
dma_entry_alloc+0xa4/0x130
debug_dma_alloc_coherent+0xbc/0x138
dma_alloc_attrs+0xa4/0x108
enetc_setup_cbdr+0x4c/0x1d0
enetc_vf_probe+0x11c/0x250
pci 0000:00:01.0: Removing from iommu group 19
This happens because stupid me moved enetc_teardown_cbdr outside of
enetc_free_si_resources, but did not bother to keep calling
enetc_teardown_cbdr from all the places where enetc_free_si_resources
was called. In particular, now it is no longer called from the main
unbind function, just from the probe error path.
Fixes: 4b47c0b81f ("net: enetc: don't initialize unused ports from a separate code path")
Reported-by: Michael Walle <michael@walle.cc>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
A follow-up patch will allow users to configures packet-per-second policing
in the software datapath. In preparation for this, teach all drivers that
support offload of the policer action to reject such configuration as
currently none of them support it.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit fd5736bf9f ("enetc: Workaround for MDIO register access
issue"), enetc_refill_rx_ring no longer updates the RX BD ring's
consumer index, that is left to be done by the caller. This has led to
bugs such as the ones found in 96a5223b91 ("net: enetc: remove bogus
write to SIRXIDR from enetc_setup_rxbdr") and 3a5d12c9be ("net: enetc:
keep RX ring consumer index in sync with hardware"), so it is desirable
that we move back the update of the consumer index into enetc_refill_rx_ring.
The trouble with that is the different MDIO locking context for the two
callers of enetc_refill_rx_ring:
- enetc_clean_rx_ring runs under enetc_lock_mdio()
- enetc_setup_rxbdr runs outside enetc_lock_mdio()
Simplify the callers of enetc_refill_rx_ring by making enetc_setup_rxbdr
explicitly take enetc_lock_mdio() around the call. It will be the only
place in need of ensuring the hot accessors can be used.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no other reason why this forward declaration exists rather than
poor ordering of the functions.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the NAPI enetc_poll after enetc_clean_rx_ring such that
we can delete the forward declarations.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The active_offloads variable of enetc_ndev_priv has an enum type, use it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we iterate through the BDs in the RX ring, the software producer
index (which is already passed by value to enetc_rxbd_next) lags behind,
and we end up with this funny looking "++i == rx_ring->bd_count" check
so that we drag it after us.
Let's pass the software producer index "i" by reference, so that
enetc_rxbd_next can increment it by itself (mod rx_ring->bd_count),
especially since enetc_rxbd_next has to increment the index anyway.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 3222b5b613 ("net: enetc: initialize RFS/RSS memories for
unused ports too") there is a requirement to initialize the memories of
unused PFs too, which has left the probe path in a bit of a rough shape,
because we basically have a minimal initialization path for unused PFs
which is separate from the main initialization path.
Now that initializing a control BD ring is as simple as calling
enetc_setup_cbdr, let's move that outside of enetc_alloc_si_resources
(unused PFs don't need classification rules, so no point in allocating
them just to free them later).
But enetc_alloc_si_resources is called both for PFs and for VFs, so now
that enetc_setup_cbdr is no longer called from this common function, it
means that the VF probe path needs to explicitly call enetc_setup_cbdr
too.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It makes no sense from an API perspective to first initialize some
portion of struct enetc_cbdr outside enetc_setup_cbdr, then leave that
function to initialize the rest. enetc_setup_cbdr should be able to
perform all initialization given a zero-initialized struct enetc_cbdr.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All call sites call enetc_clear_cbdr and enetc_free_cbdr one after
another, so let's combine the two functions into a single method named
enetc_teardown_cbdr which does both, and in the same order.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
enetc_clear_cbdr depends on struct enetc_hw because it must disable the
ring through a register write. We'd like to remove that dependency, so
let's do what's already done with the producer and consumer indices,
which is to save the iomem address in a variable kept in struct enetc_cbdr.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
enetc_alloc_cbdr and enetc_setup_cbdr are always called one after
another, so we can simplify the callers and make enetc_setup_cbdr do
everything that's needed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We shouldn't need to pass the struct device *dev to enetc CBDR APIs over
and over again, so save this inside struct enetc_cbdr::dma_dev and avoid
calling it from the enetc_free_cbdr functions.
This breaks the dependency of the cbdr API from struct enetc_si (the
station interface).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since there is a dedicated file in this driver for interacting with
control BD rings, it makes sense to move these functions there.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As explained in commit 29d98f54a4 ("net: enetc: allow hardware
timestamping on TX queues with tc-etf enabled"), hardware TX
timestamping requires an skb with skb->tstamp = 0. When a packet is sent
with SO_TXTIME, the skb->skb_mstamp_ns corrupts the value of skb->tstamp,
so the drivers need to explicitly reset skb->tstamp to zero after
consuming the TX time.
Create a helper named skb_txtime_consumed() which does just that. All
drivers which offload TC_SETUP_QDISC_ETF should implement it, and it
would make it easier to assess during review whether they do the right
thing in order to be compatible with hardware timestamping or not.
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The txtime is passed to the driver in skb->skb_mstamp_ns, which is
actually in a union with skb->tstamp (the place where software
timestamps are kept).
Since commit b50a5c70ff ("net: allow simultaneous SW and HW transmit
timestamping"), __sock_recv_timestamp has some logic for making sure
that the two calls to skb_tstamp_tx:
skb_tx_timestamp(skb) # Software timestamp in the driver
-> skb_tstamp_tx(skb, NULL)
and
skb_tstamp_tx(skb, &shhwtstamps) # Hardware timestamp in the driver
will both do the right thing and in a race-free manner, meaning that
skb_tx_timestamp will deliver a cmsg with the software timestamp only,
and skb_tstamp_tx with a non-NULL hwtstamps argument will deliver a cmsg
with the hardware timestamp only.
Why are races even possible? Well, because although the software timestamp
skb->tstamp is private per skb, the hardware timestamp skb_hwtstamps(skb)
lives in skb_shinfo(skb), an area which is shared between skbs and their
clones. And skb_tstamp_tx works by cloning the packets when timestamping
them, therefore attempting to perform hardware timestamping on an skb's
clone will also change the hardware timestamp of the original skb. And
the original skb might have been yet again cloned for software
timestamping, at an earlier stage.
So the logic in __sock_recv_timestamp can't be as simple as saying
"does this skb have a hardware timestamp? if yes I'll send the hardware
timestamp to the socket, otherwise I'll send the software timestamp",
precisely because the hardware timestamp is shared.
Instead, it's quite the other way around: __sock_recv_timestamp says
"does this skb have a software timestamp? if yes, I'll send the software
timestamp, otherwise the hardware one". This works because the software
timestamp is not shared with clones.
But that means we have a problem when we attempt hardware timestamping
with skbs that don't have the skb->tstamp == 0. __sock_recv_timestamp
will say "oh, yeah, this must be some sort of odd clone" and will not
deliver the hardware timestamp to the socket. And this is exactly what
is happening when we have txtime enabled on the socket: as mentioned,
that is put in a union with skb->tstamp, so it is quite easy to mistake
it.
Do what other drivers do (intel igb/igc) and write zero to skb->tstamp
before taking the hardware timestamp. It's of no use to us now (we're
already on the TX confirmation path).
Fixes: 0d08c9ec7d ("enetc: add support time specific departure base on the qos etf")
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On LS1028A, the MAC RX FIFO defaults to the value 2, which is too high
and may lead to RX lock-up under traffic at a rate higher than 6 Gbps.
Set it to 1 instead, as recommended by the hardware design team and by
later versions of the ENETC block guide.
Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Jason Liu <jason.hui.liu@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RX rings have a producer index owned by hardware, where newly
received frame buffers are placed, and a consumer index owned by
software, where newly allocated buffers are placed, in expectation of
hardware being able to place frame data in them.
Hardware increments the producer index when a frame is received, however
it is not allowed to increment the producer index to match the consumer
index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received
BDs. Whenever the producer index matches the value of the consumer
index, the ring has no unprocessed received frames and all BDs in the
ring have been initialized/prepared by software, i.e. hardware owns all
BDs in the ring.
The code uses the next_to_clean variable to keep track of the producer
index, and the next_to_use variable to keep track of the consumer index.
The RX rings are seeded from enetc_refill_rx_ring, which is called from
two places:
1. initially the ring is seeded until full with enetc_bd_unused(rx_ring),
i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511:
.ndo_open
-> enetc_open
-> enetc_setup_bdrs
-> enetc_setup_rxbdr
-> enetc_refill_rx_ring
2. then during the data path processing, it is refilled with 16 buffers
at a time:
enetc_msix
-> napi_schedule
-> enetc_poll
-> enetc_clean_rx_ring
-> enetc_refill_rx_ring
There is just one problem: the initial seeding done during .ndo_open
updates just the producer index (ENETC_RBPIR) with 0, and the software
next_to_clean and next_to_use variables. Notably, it will not update the
consumer index to make the hardware aware of the newly added buffers.
Wait, what? So how does it work?
Well, the reset values of the producer index and of the consumer index
of a ring are both zero. As per the description in the second paragraph,
it means that the ring is full of buffers waiting for hardware to put
frames in them, which by coincidence is almost true, because we have in
fact seeded 511 buffers into the ring.
But will the hardware attempt to access the 512th entry of the ring,
which has an invalid BD in it? Well, no, because in order to do that, it
would have to first populate the first 511 entries, and the NAPI
enetc_poll will kick in by then. Eventually, after 16 processed slots
have become available in the RX ring, enetc_clean_rx_ring will call
enetc_refill_rx_ring and then will [ finally ] update the consumer index
with the new software next_to_use variable. From now on, the
next_to_clean and next_to_use variables are in sync with the producer
and consumer ring indices.
So the day is saved, right? Well, not quite. Freeing the memory
allocated for the rings is done in:
enetc_close
-> enetc_clear_bdrs
-> enetc_clear_rxbdr
-> this just disables the ring
-> enetc_free_rxtx_rings
-> enetc_free_rx_ring
-> sets next_to_clean and next_to_use to 0
but again, nothing is committed to the hardware producer and consumer
indices (yay!). The assumption is that the ring is disabled, so the
indices don't matter anyway, and it's the responsibility of the "open"
code path to set those up.
.. Except that the "open" code path does not set those up properly.
While initially, things almost work, during subsequent enetc_close ->
enetc_open sequences, we have problems. To be precise, the enetc_open
that is subsequent to enetc_close will again refill the ring with 511
entries, but it will leave the consumer index untouched. Untouched
means, of course, equal to the value it had before disabling the ring
and draining the old buffers in enetc_close.
But as mentioned, enetc_setup_rxbdr will at least update the producer
index though, through this line of code:
enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0);
so at this stage we'll have:
next_to_clean=0 (in hardware 0)
next_to_use=511 (in hardware we'll have the refill index prior to enetc_close)
Again, the next_to_clean and producer index are in sync and set to
correct values, so the driver manages to limp on. Eventually, 16 ring
entries will be consumed by enetc_poll, and the savior
enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then
update the hardware consumer ring based upon the new next_to_use.
So.. it works?
Well, by coincidence, it almost does, but there's a circumstance where
enetc_clean_rx_ring won't be there to save us. If the previous value of
the consumer index was 15, there's a problem, because the NAPI poll
sequence will only issue a refill when 16 or more buffers have been
consumed.
It's easiest to illustrate this with an example:
ip link set eno0 up
ip addr add 192.168.100.1/24 dev eno0
ping 192.168.100.1 -c 20 # ping this port from another board
ip link set eno0 down
ip link set eno0 up
ping 192.168.100.1 -c 20 # ping it again from the same other board
One by one:
1. ip link set eno0 up
-> calls enetc_setup_rxbdr:
-> calls enetc_refill_rx_ring(511 buffers)
-> next_to_clean=0 (in hw 0)
-> next_to_use=511 (in hw 0)
2. ping 192.168.100.1 -c 20 # ping this port from another board
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15)
20 packets transmitted, 20 packets received, 0% packet loss
3. ip link set eno0 down
enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15)
4. ip link set eno0 up
-> calls enetc_setup_rxbdr:
-> calls enetc_refill_rx_ring(511 buffers)
-> next_to_clean=0 (in hw 0)
-> next_to_use=511 (in hw 15)
5. ping 192.168.100.1 -c 20 # ping it again from the same other board
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15)
20 packets transmitted, 12 packets received, 40% packet loss
And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal
to 15 for that to happen), no nothing. The hardware enters the condition where
the producer (14) + 1 is equal to the consumer (15) index, which makes it
believe it has no more free buffers to put packets in, so it starts discarding
them:
ip netns exec ns0 ethtool -S eno0 | grep -v ': 0'
NIC statistics:
Rx ring 0 discarded frames: 8
Summarized, if the interface receives between 16 and 32 (mod 512) frames
and then there is a link flap, then the port will eventually die with no
way to recover. If it receives less than 16 (mod 512) frames, then the
initial NAPI poll [ before the link flap ] will not update the consumer
index in hardware (it will remain zero) which will be ok when the buffers
are later reinitialized. If more than 32 (mod 512) frames are received,
the initial NAPI poll has the chance to refill the ring twice, updating
the consumer index to at least 32. So after the link flap, the consumer
index is still wrong, but the post-flap NAPI poll gets a chance to
refill the ring once (because it passes through cleaned_cnt=15) and
makes the consumer index be again back in sync with next_to_use.
The solution to this problem is actually simple, we just need to write
next_to_use into the hardware consumer index at enetc_open time, which
always brings it back in sync after an initial buffer seeding process.
The simpler thing would be to put the write to the consumer index into
enetc_refill_rx_ring directly, but there are issues with the MDIO
locking: in the NAPI poll code we have the enetc_lock_mdio() taken from
top-level and we use the unlocked enetc_wr_reg_hot, whereas in
enetc_open, the enetc_lock_mdio() is not taken at the top level, but
instead by each individual enetc_wr_reg, so we are forced to put an
additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of
the code is left as a refactoring exercise.
Fixes: d4fd0404c1 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Station Interface Receive Interrupt Detect Register (SIRXIDR)
contains a 16-bit wide mask of 'interrupt detected' events for each ring
associated with a port. Bit i is write-1-to-clean for RX ring i.
I have no explanation whatsoever how this line of code came to be
inserted in the blamed commit. I checked the downstream versions of that
patch and none of them have it.
The somewhat comical aspect of it is that we're writing a binary number
to the SIRXIDR register, which is derived from enetc_bd_unused(rx_ring).
Since the RX rings have 512 buffer descriptors, we end up writing 511 to
this register, which is 0x1ff, so we are effectively clearing the
'interrupt detected' event for rings 0-8.
This register is not what is used for interrupt handling though - it
only provides a summary for the entire SI. The hardware provides one
separate Interrupt Detect Register per RX ring, which auto-clears upon
read. So there doesn't seem to be any adverse effect caused by this
bogus write.
There is, however, one reason why this should be handled as a bugfix:
next_to_clean _should_ be committed to hardware, just not to that
register, and this was obscuring the fact that it wasn't. This is fixed
in the next patch, and removing the bogus line now allows the fix patch
to be backported beyond that point.
Fixes: fd5736bf9f ("enetc: Workaround for MDIO register access issue")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ENETC port 0 MAC supports in-band status signaling coming from a PHY
when operating in RGMII mode, and this feature is enabled by default.
It has been reported that RGMII is broken in fixed-link, and that is not
surprising considering the fact that no PHY is attached to the MAC in
that case, but a switch.
This brings us to the topic of the patch: the enetc driver should have
not enabled the optional in-band status signaling for RGMII unconditionally,
but should have forced the speed and duplex to what was resolved by
phylink.
Note that phylink does not accept the RGMII modes as valid for in-band
signaling, and these operate a bit differently than 1000base-x and SGMII
(notably there is no clause 37 state machine so no ACK required from the
MAC, instead the PHY sends extra code words on RXD[3:0] whenever it is
not transmitting something else, so it should be safe to leave a PHY
with this option unconditionally enabled even if we ignore it). The spec
talks about this here:
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/138/RGMIIv1_5F00_3.pdf
Fixes: 71b77a7a27 ("enetc: Migrate to PHYLINK and PCS_LYNX")
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting from the blamed commit:
In promiscuous mode, it is more intuitive that all traffic is received,
including VLAN tagged traffic. It appears that it is necessary to set
the flag in PSIPVMR for that to be the case, so VLAN promiscuous mode is
also temporarily enabled. On exit from promiscuous mode, the setting
made by ethtool is restored.
Intuitive or not, there isn't any definition issued by a standards body
which says that promiscuity has anything to do with VLAN filtering - it
only has to do with accepting packets regardless of destination MAC address.
In fact people are already trying to use this misunderstanding/bug of
the enetc driver as a justification to transform promiscuity into
something it never was about: accepting every packet (maybe that would
be the "rx-all" netdev feature?):
https://lore.kernel.org/netdev/20201110153958.ci5ekor3o2ekg3ky@ipetronik.com/
This is relevant because there are use cases in the kernel (such as
tc-flower rules with the protocol 802.1Q and a vlan_id key) which do not
(yet) use the vlan_vid_add API to be compatible with VLAN-filtering NICs
such as enetc, so for those, disabling rx-vlan-filter is currently the
only right solution to make these setups work:
https://lore.kernel.org/netdev/CA+h21hoxwRdhq4y+w8Kwgm74d4cA0xLeiHTrmT-VpSaM7obhkg@mail.gmail.com/
The blamed patch has unintentionally introduced one more way for this to
work, which is to enable IFF_PROMISC, however this is non-portable
because port promiscuity is not meant to disable VLAN filtering.
Therefore, it could invite people to write broken scripts for enetc, and
then wonder why they are broken when migrating to other drivers that
don't handle promiscuity in the same way.
Fixes: 7070eea5e9 ("enetc: permit configuration of rx-vlan-filter with ethtool")
Cc: Markus Blöchl <Markus.Bloechl@ipetronik.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>