linux/drivers/net/ethernet
Vladimir Oltean 3a5d12c9be net: enetc: keep RX ring consumer index in sync with hardware
The RX rings have a producer index owned by hardware, where newly
received frame buffers are placed, and a consumer index owned by
software, where newly allocated buffers are placed, in expectation of
hardware being able to place frame data in them.

Hardware increments the producer index when a frame is received, however
it is not allowed to increment the producer index to match the consumer
index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received
BDs. Whenever the producer index matches the value of the consumer
index, the ring has no unprocessed received frames and all BDs in the
ring have been initialized/prepared by software, i.e. hardware owns all
BDs in the ring.

The code uses the next_to_clean variable to keep track of the producer
index, and the next_to_use variable to keep track of the consumer index.

The RX rings are seeded from enetc_refill_rx_ring, which is called from
two places:

1. initially the ring is seeded until full with enetc_bd_unused(rx_ring),
   i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511:

.ndo_open
-> enetc_open
   -> enetc_setup_bdrs
      -> enetc_setup_rxbdr
         -> enetc_refill_rx_ring

2. then during the data path processing, it is refilled with 16 buffers
   at a time:

enetc_msix
-> napi_schedule
   -> enetc_poll
      -> enetc_clean_rx_ring
         -> enetc_refill_rx_ring

There is just one problem: the initial seeding done during .ndo_open
updates just the producer index (ENETC_RBPIR) with 0, and the software
next_to_clean and next_to_use variables. Notably, it will not update the
consumer index to make the hardware aware of the newly added buffers.

Wait, what? So how does it work?

Well, the reset values of the producer index and of the consumer index
of a ring are both zero. As per the description in the second paragraph,
it means that the ring is full of buffers waiting for hardware to put
frames in them, which by coincidence is almost true, because we have in
fact seeded 511 buffers into the ring.

But will the hardware attempt to access the 512th entry of the ring,
which has an invalid BD in it? Well, no, because in order to do that, it
would have to first populate the first 511 entries, and the NAPI
enetc_poll will kick in by then. Eventually, after 16 processed slots
have become available in the RX ring, enetc_clean_rx_ring will call
enetc_refill_rx_ring and then will [ finally ] update the consumer index
with the new software next_to_use variable. From now on, the
next_to_clean and next_to_use variables are in sync with the producer
and consumer ring indices.

So the day is saved, right? Well, not quite. Freeing the memory
allocated for the rings is done in:

enetc_close
-> enetc_clear_bdrs
   -> enetc_clear_rxbdr
      -> this just disables the ring
-> enetc_free_rxtx_rings
   -> enetc_free_rx_ring
      -> sets next_to_clean and next_to_use to 0

but again, nothing is committed to the hardware producer and consumer
indices (yay!). The assumption is that the ring is disabled, so the
indices don't matter anyway, and it's the responsibility of the "open"
code path to set those up.

.. Except that the "open" code path does not set those up properly.

While initially, things almost work, during subsequent enetc_close ->
enetc_open sequences, we have problems. To be precise, the enetc_open
that is subsequent to enetc_close will again refill the ring with 511
entries, but it will leave the consumer index untouched. Untouched
means, of course, equal to the value it had before disabling the ring
and draining the old buffers in enetc_close.

But as mentioned, enetc_setup_rxbdr will at least update the producer
index though, through this line of code:

	enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0);

so at this stage we'll have:

next_to_clean=0 (in hardware 0)
next_to_use=511 (in hardware we'll have the refill index prior to enetc_close)

Again, the next_to_clean and producer index are in sync and set to
correct values, so the driver manages to limp on. Eventually, 16 ring
entries will be consumed by enetc_poll, and the savior
enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then
update the hardware consumer ring based upon the new next_to_use.

So.. it works?
Well, by coincidence, it almost does, but there's a circumstance where
enetc_clean_rx_ring won't be there to save us. If the previous value of
the consumer index was 15, there's a problem, because the NAPI poll
sequence will only issue a refill when 16 or more buffers have been
consumed.

It's easiest to illustrate this with an example:

ip link set eno0 up
ip addr add 192.168.100.1/24 dev eno0
ping 192.168.100.1 -c 20 # ping this port from another board
ip link set eno0 down
ip link set eno0 up
ping 192.168.100.1 -c 20 # ping it again from the same other board

One by one:

1. ip link set eno0 up
-> calls enetc_setup_rxbdr:
   -> calls enetc_refill_rx_ring(511 buffers)
   -> next_to_clean=0 (in hw 0)
   -> next_to_use=511 (in hw 0)

2. ping 192.168.100.1 -c 20 # ping this port from another board
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15)

20 packets transmitted, 20 packets received, 0% packet loss

3. ip link set eno0 down
enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15)

4. ip link set eno0 up
-> calls enetc_setup_rxbdr:
   -> calls enetc_refill_rx_ring(511 buffers)
   -> next_to_clean=0 (in hw 0)
   -> next_to_use=511 (in hw 15)

5. ping 192.168.100.1 -c 20 # ping it again from the same other board
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15)

20 packets transmitted, 12 packets received, 40% packet loss

And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal
to 15 for that to happen), no nothing. The hardware enters the condition where
the producer (14) + 1 is equal to the consumer (15) index, which makes it
believe it has no more free buffers to put packets in, so it starts discarding
them:

ip netns exec ns0 ethtool -S eno0 | grep -v ': 0'
NIC statistics:
     Rx ring  0 discarded frames: 8

Summarized, if the interface receives between 16 and 32 (mod 512) frames
and then there is a link flap, then the port will eventually die with no
way to recover. If it receives less than 16 (mod 512) frames, then the
initial NAPI poll [ before the link flap ] will not update the consumer
index in hardware (it will remain zero) which will be ok when the buffers
are later reinitialized. If more than 32 (mod 512) frames are received,
the initial NAPI poll has the chance to refill the ring twice, updating
the consumer index to at least 32. So after the link flap, the consumer
index is still wrong, but the post-flap NAPI poll gets a chance to
refill the ring once (because it passes through cleaned_cnt=15) and
makes the consumer index be again back in sync with next_to_use.

The solution to this problem is actually simple, we just need to write
next_to_use into the hardware consumer index at enetc_open time, which
always brings it back in sync after an initial buffer seeding process.

The simpler thing would be to put the write to the consumer index into
enetc_refill_rx_ring directly, but there are issues with the MDIO
locking: in the NAPI poll code we have the enetc_lock_mdio() taken from
top-level and we use the unlocked enetc_wr_reg_hot, whereas in
enetc_open, the enetc_lock_mdio() is not taken at the top level, but
instead by each individual enetc_wr_reg, so we are forced to put an
additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of
the code is left as a refactoring exercise.

Fixes: d4fd0404c1 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
..
3com isa: Make the remove callback for isa drivers return void 2021-01-26 07:42:27 +01:00
8390
adaptec
aeroflex
agere ethernet: select CONFIG_CRC32 as needed 2020-12-04 14:42:21 -08:00
alacritech
allwinner net: allwinner: Fix some resources leak in the error handling path of the probe and in the remove function 2020-12-16 11:37:31 -08:00
alteon
altera
amazon Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-10 13:30:12 -08:00
amd Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-16 17:51:13 -08:00
apm
apple
aquantia net: ethernet: aquantia: Handle error cleanup of start on open 2021-02-11 14:38:06 -08:00
arc
atheros net: ag71xx: remove unnecessary MTU reservation 2021-02-22 17:45:25 -08:00
broadcom net: broadcom: bcm4908_enet: enable RX after processing packets 2021-02-28 11:51:31 -08:00
brocade net: bna: remove trailing semicolon in macro definition 2020-12-04 17:41:49 -08:00
cadence net: macb: ignore tx_clk if MII is used 2021-01-21 19:50:47 -08:00
calxeda
cavium net/ethernet: convert to use module_platform_driver in octeon_mgmt.c 2021-01-29 19:02:43 -08:00
chelsio Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2021-02-21 17:23:56 -08:00
cirrus
cisco net: remove ndo_udp_tunnel_* callbacks 2021-01-07 12:53:29 -08:00
cortina
davicom
dec
dlink
emulex net: remove ndo_udp_tunnel_* callbacks 2021-01-07 12:53:29 -08:00
ezchip
faraday Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-12-11 22:29:38 -08:00
freescale net: enetc: keep RX ring consumer index in sync with hardware 2021-03-01 13:34:47 -08:00
fujitsu
google gve: Add support for raw addressing in the tx path 2020-12-08 16:06:28 -08:00
hisilicon net: hns3: fix bug when calculating the TCAM table info 2021-02-28 12:04:00 -08:00
huawei net: hinic: simplify the return hinic_configure_max_qnum() 2020-12-09 17:05:37 -08:00
i825xx drivers: net: ethernet: i825xx: Fix couple of spellings in the file ether1.c 2021-02-04 19:07:55 -08:00
ibm Networking fixes for 5.12-rc1. Rather small batch this time. 2021-02-25 12:06:25 -08:00
intel Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 2021-02-23 16:45:30 -08:00
marvell Marvell Sky2 Ethernet adapter: fix warning messages. 2021-02-23 12:17:21 -08:00
mediatek net: ethernet: mediatek: support setting MTU 2021-01-25 18:23:14 -08:00
mellanox mlxsw: spectrum_router: Ignore routes using a deleted nexthop object 2021-02-26 15:47:53 -08:00
micrel net: ks8851: remove definition of DEBUG 2021-01-15 17:53:29 -08:00
microchip lan743x: sync only the received area of an rx ring buffer 2021-02-16 15:15:21 -08:00
moxa
mscc net: mscc: ocelot: select NET_DEVLINK 2021-02-26 15:29:53 -08:00
myricom
natsemi net/sonic: Fix some resource leaks in error handling paths 2021-01-05 15:59:20 -08:00
neterion net: vxget: clean up sparse warnings 2020-12-14 19:18:11 -08:00
netronome bpf: Rename BPF_XADD and prepare to encode other atomics in .imm 2021-01-14 18:34:29 -08:00
ni net: nixge: fix spelling mistake in Kconfig: "Instuments" -> "Instruments" 2020-12-17 10:54:31 -08:00
nvidia
nxp ethernet: select CONFIG_CRC32 as needed 2020-12-04 14:42:21 -08:00
oki-semi net: pch_gbe: Use 'dma_free_coherent()' to undo 'dma_alloc_coherent()' 2020-11-23 17:01:48 -08:00
packetengines
pasemi net: pasemi: fix error return code in pasemi_mac_open() 2020-12-02 18:03:58 -08:00
pensando ionic: Remove unused function pointer typedef ionic_reset_cb 2021-02-16 13:43:58 -08:00
qlogic qede: preserve per queue stats across up/down of interface 2021-02-11 14:25:06 -08:00
qualcomm net: qualcomm: rmnet: Fix rx_handler for non-linear skbs 2021-02-06 11:28:45 -08:00
rdc
realtek r8169: fix jumbo packet handling on RTL8168e 2021-02-25 09:55:16 -08:00
renesas sh_eth: fix TRSCER mask for R7S9210 2021-03-01 13:22:34 -08:00
rocker net: switchdev: pass flags and mask to both {PRE_,}BRIDGE_FLAGS attributes 2021-02-12 17:08:04 -08:00
samsung net: phy: rename PHY_IGNORE_INTERRUPT to PHY_MAC_INTERRUPT 2021-02-15 15:20:49 -08:00
seeq
sfc sfc: reduce the number of requested xdp ev queues 2021-01-22 19:16:45 -08:00
sgi
silan
sis
smsc net: smsc911x: Make Runtime PM handling more fine-grained 2021-01-19 17:46:39 -08:00
socionext net, xdp: Introduce xdp_prepare_buff utility routine 2021-01-08 13:39:24 -08:00
stmicro net: stmmac: re-init rx buffers when mac resume back 2021-02-26 15:17:12 -08:00
sun
synopsys net: dwc-xlgmac: Fix spelling mistake in function name 2021-02-06 11:37:01 -08:00
tehuti
ti Devicetree updates for v5.12: 2021-02-22 10:05:12 -08:00
toshiba net: ethernet: toshiba: spider_net: Document a whole bunch of function parameters 2021-01-16 18:13:28 -08:00
tundra
via
wiznet
xilinx Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-16 17:51:13 -08:00
xircom
xscale net: ixp4xx_eth: Use DEFINE_SPINLOCK() for spinlock 2021-01-05 15:43:41 -08:00
dnet.c
dnet.h
ec_bhf.c
ethoc.c net: ethernet: Fix memleak in ethoc_probe 2020-12-23 12:28:53 -08:00
fealnx.c
jme.c
jme.h
Kconfig net: remove aurora nb8800 driver 2021-01-21 19:48:38 -08:00
korina.c net: korina: fix return value 2020-12-16 15:01:07 -08:00
lantiq_etop.c
lantiq_xrx200.c
Makefile net: remove aurora nb8800 driver 2021-01-21 19:48:38 -08:00