Commit Graph

998606 Commits

Author SHA1 Message Date
Vladimir Oltean
9d2b68cc10 net: enetc: add support for XDP_REDIRECT
The driver implementation of the XDP_REDIRECT action reuses parts from
XDP_TX, most notably the enetc_xdp_tx function which transmits an array
of TX software BDs. Only this time, the buffers don't have DMA mappings,
we need to create them.

When a BPF program reaches the XDP_REDIRECT verdict for a frame, we can
employ the same buffer reuse strategy as for the normal processing path
and for XDP_PASS: we can flip to the other page half and seed that to
the RX ring.

Note that scatter/gather support is there, but disabled due to lack of
multi-buffer support in XDP (which is added by this series):
https://patchwork.kernel.org/project/netdevbpf/cover/cover.1616179034.git.lorenzo@kernel.org/

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:44 -07:00
Vladimir Oltean
d6a2829e82 net: enetc: increase RX ring default size
As explained in the XDP_TX patch, when receiving a burst of frames with
the XDP_TX verdict, there is a momentary dip in the number of available
RX buffers. The system will eventually recover as TX completions will
start kicking in and refilling our RX BD ring again. But until that
happens, we need to survive with as few out-of-buffer discards as
possible.

This increases the memory footprint of the driver in order to avoid
discards at 2.5Gbps line rate 64B packet sizes, the maximum speed
available for testing on 1 port on NXP LS1028A.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:44 -07:00
Vladimir Oltean
7ed2bc8007 net: enetc: add support for XDP_TX
For reflecting packets back into the interface they came from, we create
an array of TX software BDs derived from the RX software BDs. Therefore,
we need to extend the TX software BD structure to contain most of the
stuff that's already present in the RX software BD structure, for
reasons that will become evident in a moment.

For a frame with the XDP_TX verdict, we don't reuse any buffer right
away as we do for XDP_DROP (the same page half) or XDP_PASS (the other
page half, same as the skb code path).

Because the buffer transfers ownership from the RX ring to the TX ring,
reusing any page half right away is very dangerous. So what we can do is
we can recycle the same page half as soon as TX is complete.

The code path is:
enetc_poll
-> enetc_clean_rx_ring_xdp
   -> enetc_xdp_tx
   -> enetc_refill_rx_ring
(time passes, another MSI interrupt is raised)
enetc_poll
-> enetc_clean_tx_ring
   -> enetc_recycle_xdp_tx_buff

But that creates a problem, because there is a potentially large time
window between enetc_xdp_tx and enetc_recycle_xdp_tx_buff, period in
which we'll have less and less RX buffers.

Basically, when the ship starts sinking, the knee-jerk reaction is to
let enetc_refill_rx_ring do what it does for the standard skb code path
(refill every 16 consumed buffers), but that turns out to be very
inefficient. The problem is that we have no rx_swbd->page at our
disposal from the enetc_reuse_page path, so enetc_refill_rx_ring would
have to call enetc_new_page for every buffer that we refill (if we
choose to refill at this early stage). Very inefficient, it only makes
the problem worse, because page allocation is an expensive process, and
CPU time is exactly what we're lacking.

Additionally, there is an even bigger problem: if we let
enetc_refill_rx_ring top up the ring's buffers again from the RX path,
remember that the buffers sent to transmission haven't disappeared
anywhere. They will be eventually sent, and processed in
enetc_clean_tx_ring, and an attempt will be made to recycle them.
But surprise, the RX ring is already full of new buffers, because we
were premature in deciding that we should refill. So not only we took
the expensive decision of allocating new pages, but now we must throw
away perfectly good and reusable buffers.

So what we do is we implement an elastic refill mechanism, which keeps
track of the number of in-flight XDP_TX buffer descriptors. We top up
the RX ring only up to the total ring capacity minus the number of BDs
that are in flight (because we know that those BDs will return to us
eventually).

The enetc driver manages 1 RX ring per CPU, and the default TX ring
management is the same. So we do XDP_TX towards the TX ring of the same
index, because it is affined to the same CPU. This will probably not
produce great results when we have a tc-taprio/tc-mqprio qdisc on the
interface, because in that case, the number of TX rings might be
greater, but I didn't add any checks for that yet (mostly because I
didn't know what checks to add).

It should also be noted that we need to change the DMA mapping direction
for RX buffers, since they may now be reflected into the TX ring of the
same device. We choose to use DMA_BIDIRECTIONAL instead of unmapping and
remapping as DMA_TO_DEVICE, because performance is better this way.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:44 -07:00
Vladimir Oltean
d1b15102dd net: enetc: add support for XDP_DROP and XDP_PASS
For the RX ring, enetc uses an allocation scheme based on pages split
into two buffers, which is already very efficient in terms of preventing
reallocations / maximizing reuse, so I see no reason why I would change
that.

 +--------+--------+--------+--------+--------+--------+--------+
 |        |        |        |        |        |        |        |
 | half B | half B | half B | half B | half B | half B | half B |
 |        |        |        |        |        |        |        |
 +--------+--------+--------+--------+--------+--------+--------+
 |        |        |        |        |        |        |        |
 | half A | half A | half A | half A | half A | half A | half A | RX ring
 |        |        |        |        |        |        |        |
 +--------+--------+--------+--------+--------+--------+--------+
     ^                                                     ^
     |                                                     |
 next_to_clean                                       next_to_alloc
                                                      next_to_use

                   +--------+--------+--------+--------+--------+
                   |        |        |        |        |        |
                   | half B | half B | half B | half B | half B |
                   |        |        |        |        |        |
 +--------+--------+--------+--------+--------+--------+--------+
 |        |        |        |        |        |        |        |
 | half B | half B | half A | half A | half A | half A | half A | RX ring
 |        |        |        |        |        |        |        |
 +--------+--------+--------+--------+--------+--------+--------+
 |        |        |   ^                                   ^
 | half A | half A |   |                                   |
 |        |        | next_to_clean                   next_to_use
 +--------+--------+
              ^
              |
         next_to_alloc

then when enetc_refill_rx_ring is called, whose purpose is to advance
next_to_use, it sees that it can take buffers up to next_to_alloc, and
it says "oh, hey, rx_swbd->page isn't NULL, I don't need to allocate
one!".

The only problem is that for default PAGE_SIZE values of 4096, buffer
sizes are 2048 bytes. While this is enough for normal skb allocations at
an MTU of 1500 bytes, for XDP it isn't, because the XDP headroom is 256
bytes, and including skb_shared_info and alignment, we end up being able
to make use of only 1472 bytes, which is insufficient for the default
MTU.

To solve that problem, we implement scatter/gather processing in the
driver, because we would really like to keep the existing allocation
scheme. A packet of 1500 bytes is received in a buffer of 1472 bytes and
another one of 28 bytes.

Because the headroom required by XDP is different (and much larger) than
the one required by the network stack, whenever a BPF program is added
or deleted on the port, we drain the existing RX buffers and seed new
ones with the required headroom. We also keep the required headroom in
rx_ring->buffer_offset.

The simplest way to implement XDP_PASS, where an skb must be created, is
to create an xdp_buff based on the next_to_clean RX BDs, but not clear
those BDs from the RX ring yet, just keep the original index at which
the BDs for this frame started. Then, if the verdict is XDP_PASS,
instead of converting the xdb_buff to an skb, we replay a call to
enetc_build_skb (just as in the normal enetc_clean_rx_ring case),
starting from the original BD index.

We would also like to be minimally invasive to the regular RX data path,
and not check whether there is a BPF program attached to the ring on
every packet. So we create a separate RX ring processing function for
XDP.

Because we only install/remove the BPF program while the interface is
down, we forgo the rcu_read_lock() in enetc_clean_rx_ring, since there
shouldn't be any circumstance in which we are processing packets and
there is a potentially freed BPF program attached to the RX ring.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:44 -07:00
Vladimir Oltean
65d0cbb414 net: enetc: move up enetc_reuse_page and enetc_page_reusable
For XDP_TX, we need to call enetc_reuse_page from enetc_clean_tx_ring,
so we need to avoid a forward declaration.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:44 -07:00
Vladimir Oltean
1ee8d6f3be net: enetc: clean the TX software BD on the TX confirmation path
With the future introduction of some new fields into enetc_tx_swbd such
as is_xdp_tx, is_xdp_redirect etc, we need not only to set these bits
to true from the XDP_TX/XDP_REDIRECT code path, but also to false from
the old code paths.

This is because TX software buffer descriptors are kept in a ring that
is shadow of the hardware TX ring, so these structures keep getting
reused, and there is always the possibility that when a software BD is
reused (after we ran a full circle through the TX ring), the old user of
the tx_swbd had set is_xdp_tx = true, and now we are sending a regular
skb, which would need to set is_xdp_tx = false.

To be minimally invasive to the old code paths, let's just scrub the
software TX BD in the TX confirmation path (enetc_clean_tx_ring), once
we know that nobody uses this software TX BD (tx_ring->next_to_clean
hasn't yet been updated, and the TX paths check enetc_bd_unused which
tells them if there's any more space in the TX ring for a new enqueue).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:44 -07:00
Vladimir Oltean
d504498d2e net: enetc: add a dedicated is_eof bit in the TX software BD
In the transmit path, if we have a scatter/gather frame, it is put into
multiple software buffer descriptors, the last of which has the skb
pointer populated (which is necessary for rearming the TX MSI vector and
for collecting the two-step TX timestamp from the TX confirmation path).

At the moment, this is sufficient, but with XDP_TX, we'll need to
service TX software buffer descriptors that don't have an skb pointer,
however they might be final nonetheless. So add a dedicated bit for
final software BDs that we populate and check explicitly. Also, we keep
looking just for an skb when doing TX timestamping, because we don't
want/need that for XDP.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:44 -07:00
Vladimir Oltean
a800abd3ec net: enetc: move skb creation into enetc_build_skb
We need to build an skb from two code paths now: from the plain RX data
path and from the XDP data path when the verdict is XDP_PASS.

Create a new enetc_build_skb function which contains the essential steps
for building an skb based on the first and last positions of buffer
descriptors within the RX ring.

We also squash the enetc_process_skb function into enetc_build_skb,
because what that function did wasn't very meaningful on its own.

The "rx_frm_cnt++" instruction has been moved around napi_gro_receive
for cosmetic reasons, to be in the same spot as rx_byte_cnt++, which
itself must be before napi_gro_receive, because that's when we lose
ownership of the skb.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:44 -07:00
Vladimir Oltean
2fa423f5f0 net: enetc: consume the error RX buffer descriptors in a dedicated function
We can and should check the RX BD errors before starting to build the
skb. The only apparent reason why things are done in this backwards
order is to spare one call to enetc_rxbd_next.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:57:43 -07:00
Eric Dumazet
0d7a7b2014 ipv6: remove extra dev_hold() for fallback tunnels
My previous commits added a dev_hold() in tunnels ndo_init(),
but forgot to remove it from special functions setting up fallback tunnels.

Fallback tunnels do call their respective ndo_init()

This leads to various reports like :

unregister_netdevice: waiting for ip6gre0 to become free. Usage count = 2

Fixes: 48bb569726 ("ip6_tunnel: sit: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 6289a98f08 ("sit: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 40cb881b5a ("ip6_vti: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 7f700334be ("ip6_gre: proper dev_{hold|put} in ndo_[un]init methods")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:53:11 -07:00
Yang Yingliang
ac1db7acea net/tipc: fix missing destroy_workqueue() on error in tipc_crypto_start()
Add the missing destroy_workqueue() before return from
tipc_crypto_start() in the error handling case.

Fixes: 1ef6f7c939 ("tipc: add automatic session key exchange")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:52:00 -07:00
David S. Miller
ab1b4f0a83 Merge branch 'inet-shrink-netns'
Eric Dumazet says:

====================
inet: shrink netns_ipv{4|6}

This patch series work on reducing footprint of netns_ipv4
and netns_ipv6. Some sysctls are converted to bytes,
and some fields are moves to reduce number of holes
and paddings.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
0dd39d952f ipv6: move ip6_dst_ops first in netns_ipv6
ip6_dst_ops have cache line alignement.

Moving it at beginning of netns_ipv6
removes a 48 byte hole, and shrinks netns_ipv6
from 12 to 11 cache lines.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
a6175633a2 ipv6: convert elligible sysctls to u8
Convert most sysctls that can fit in a byte.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
1c3289c931 tcp: convert tcp_comp_sack_nr sysctl to u8
tcp_comp_sack_nr max value was already 255.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
7d4b37ebb9 ipv4: convert igmp_link_local_mcast_reports sysctl to u8
This sysctl is a bool, can use less storage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
be205fe6ec ipv4: convert fib_multipath_{use_neigh|hash_policy} sysctls to u8
Make room for better packing of netns_ipv4

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
cd04bd0222 ipv4: convert udp_l3mdev_accept sysctl to u8
Reduce footprint of sysctls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
b2908fac5b ipv4: convert fib_notify_on_flag_change sysctl to u8
Reduce footprint of sysctls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:19 -07:00
Eric Dumazet
490f33c4e7 inet: shrink netns_ipv4 by another cache line
By shuffling around some fields to remove 8 bytes of hole,
we can save one cache line.

pahole result before/after the patch :

/* size: 768, cachelines: 12, members: 139 */
/* sum members: 673, holes: 11, sum holes: 39 */
/* padding: 56 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */

->

/* size: 704, cachelines: 11, members: 139 */
/* sum members: 673, holes: 10, sum holes: 31 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:19 -07:00
Eric Dumazet
1caf8d39c5 inet: shrink inet_timewait_death_row by 48 bytes
struct inet_timewait_death_row uses two cache lines, because we want
tw_count to use a full cache line to avoid false sharing.

Rework its definition and placement in netns_ipv4 so that:

1) We add 60 bytes of padding after tw_count to avoid
  false sharing, knowing that tcp_death_row will
  have ____cacheline_aligned_in_smp attribute.

2) We do not risk padding before tcp_death_row, because
  we move it at the beginning of netns_ipv4, even if new
 fields are added later.

3) We do not waste 48 bytes of padding after it.

Note that I have not changed dccp.

pahole result for struct netns_ipv4 before/after the patch :

/* size: 832, cachelines: 13, members: 139 */
/* sum members: 721, holes: 12, sum holes: 95 */
/* padding: 16 */
/* paddings: 2, sum paddings: 55 */

->

/* size: 768, cachelines: 12, members: 139 */
/* sum members: 673, holes: 11, sum holes: 39 */
/* padding: 56 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:19 -07:00
David S. Miller
30b8817f5f Merge branch 'net-coding-style'
Weihang Li says:

====================
net: fix some coding style issues

Do some cleanups according to the coding style of kernel, including wrong
print type, redundant and missing spaces and so on.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:34:09 -07:00
Yangyang Li
44d043b53d net: lpc_eth: fix format warnings of block comments
Fix the following format warning:
1. Block comments use * on subsequent lines
2. Block comments use a trailing */ on a separate line

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:34:09 -07:00
Yixing Liu
142c1d2ed9 net: toshiba: fix the trailing format of some block comments
Use a trailling */ on a separate line for block comments.

Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:34:09 -07:00
Yixing Liu
1f78ff4ff7 net: ocelot: fix a trailling format issue with block comments
Use a tralling */ on a separate line for block comments.

Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:34:09 -07:00
Yixing Liu
3f6ebcffaf net: amd: correct some format issues
There should be a blank line after declarations.

Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:34:09 -07:00
Yixing Liu
ca3fc0aa08 net: amd8111e: fix inappropriate spaces
Delete unncecessary spaces and add some reasonable spaces according to the
coding-style of kernel.

Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:34:09 -07:00
Yixing Liu
e355fa6a3f net: ena: remove extra words from comments
Remove the redundant "for" from the commment.

Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:34:08 -07:00
Yixing Liu
b788ff0a7d net: ena: fix inaccurate print type
Use "%u" to replace "hu%".

Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:34:08 -07:00
Matthew Wilcox (Oracle)
3cbf7530a1 qrtr: Convert qrtr_ports from IDR to XArray
The XArray interface is easier for this driver to use.  Also fixes a
bug reported by the improper use of GFP_ATOMIC.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:32:20 -07:00
Wan Jiabing
53f7c5e140 net: ethernet: stmicro: Remove duplicate struct declaration
struct stmmac_safety_stats is declared twice. One has been
declared at 29th line. Remove the duplicate.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:31:14 -07:00
Eric Dumazet
48bb569726 ip6_tunnel: sit: proper dev_{hold|put} in ndo_[un]init methods
Same reasons than for the previous commits :
6289a98f08 ("sit: proper dev_{hold|put} in ndo_[un]init methods")
40cb881b5a ("ip6_vti: proper dev_{hold|put} in ndo_[un]init methods")
7f700334be ("ip6_gre: proper dev_{hold|put} in ndo_[un]init methods")

After adopting CONFIG_PCPU_DEV_REFCNT=n option, syzbot was able to trigger
a warning [1]

Issue here is that:

- all dev_put() should be paired with a corresponding prior dev_hold().

- A driver doing a dev_put() in its ndo_uninit() MUST also
  do a dev_hold() in its ndo_init(), only when ndo_init()
  is returning 0.

Otherwise, register_netdevice() would call ndo_uninit()
in its error path and release a refcount too soon.

[1]
WARNING: CPU: 1 PID: 21059 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
Modules linked in:
CPU: 1 PID: 21059 Comm: syz-executor.4 Not tainted 5.12.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
Code: 1d 6a 5a e8 09 31 ff 89 de e8 8d 1a ab fd 84 db 75 e0 e8 d4 13 ab fd 48 c7 c7 a0 e1 c1 89 c6 05 4a 5a e8 09 01 e8 2e 36 fb 04 <0f> 0b eb c4 e8 b8 13 ab fd 0f b6 1d 39 5a e8 09 31 ff 89 de e8 58
RSP: 0018:ffffc900025aefe8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000040000 RSI: ffffffff815c51f5 RDI: fffff520004b5def
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815bdf8e R11: 0000000000000000 R12: ffff888023488568
R13: ffff8880254e9000 R14: 00000000dfd82cfd R15: ffff88802ee2d7c0
FS:  00007f13bc590700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0943e74000 CR3: 0000000025273000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __refcount_dec include/linux/refcount.h:344 [inline]
 refcount_dec include/linux/refcount.h:359 [inline]
 dev_put include/linux/netdevice.h:4135 [inline]
 ip6_tnl_dev_uninit+0x370/0x3d0 net/ipv6/ip6_tunnel.c:387
 register_netdevice+0xadf/0x1500 net/core/dev.c:10308
 ip6_tnl_create2+0x1b5/0x400 net/ipv6/ip6_tunnel.c:263
 ip6_tnl_newlink+0x312/0x580 net/ipv6/ip6_tunnel.c:2052
 __rtnl_newlink+0x1062/0x1710 net/core/rtnetlink.c:3443
 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3491
 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5553
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:674
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: 919067cc84 ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:16:43 -07:00
David S. Miller
e3f685aa73 Merge branch 'ethtool-fec-netlink'
Jakub Kicinski says:

====================
ethtool: support FEC configuration over netlink

This series adds support for the equivalents of ETHTOOL_GFECPARAM
and ETHTOOL_SFECPARAM over netlink.

As a reminder - this is an API which allows user to query current
FEC mode, as well as set FEC manually if autoneg is disabled.
It does not configure anything if autoneg is enabled (that said
few/no drivers currently reject .set_fecparam calls while autoneg
is disabled, hopefully FW will just ignore the settings).

The existing functionality is mostly preserved in the new API.
The ioctl interface uses a set of flags, and link modes to tell
user which modes are supported. Here is how the flags translate
to the new interface (skipping descriptions for actual FEC modes):

  ioctl flag      |   description         |  new API
================================================================
ETHTOOL_FEC_OFF   | disabled (supported)  | \
ETHTOOL_FEC_RS    |                       |  ` link mode bitset
ETHTOOL_FEC_BASER |                       |  / .._A_FEC_MODES
ETHTOOL_FEC_LLRS  |                       | /
ETHTOOL_FEC_AUTO  | pick based on cable   | bool .._A_FEC_AUTO
ETHTOOL_FEC_NONE  | not supported         | no bit, no AUTO reported

Since link modes are already depended on (although somewhat implicitly)
for expressing supported modes - the new interface uses them for
the manual configuration, as well as uses link mode bit number
to communicate the active mode.

Use of link modes allows us to define any number of FEC modes we want,
and reuse the strset we already have defined.

Separating AUTO as its own attribute is the biggest changed compared
to the ioctl. It means drivers can no longer report AUTO as the
active FEC mode because there is no link mode for AUTO.
active_fec == AUTO makes little sense in the first place IMHO,
active_fec should be the actual mode, so hopefully this is fine.

The other minor departure is that None is no longer explicitly
expressed in the API. But drivers are reasonable in handling of
this somewhat pointless bit, so I'm not expecting any issues there.

One extension which could be considered would be moving active FEC
to ETHTOOL_MSG_LINKMODE_*, but then why not move all of FEC into
link modes? I don't know where to draw the line.

netdevsim support and a simple self test are included.

Next step is adding stats similar to the ones added for pause.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

,
2021-03-31 14:15:23 -07:00
Jakub Kicinski
1da07e5db3 selftests: ethtool: add a netdevsim FEC test
Test FEC settings, iterate over configs.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:15:23 -07:00
Jakub Kicinski
0d7f76dc11 netdevsim: add FEC settings support
Add support for ethtool FEC and some ethtool error injection.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:15:23 -07:00
Jakub Kicinski
1e5d1f69d9 ethtool: support FEC settings over netlink
Add FEC API to netlink.

This is not a 1-to-1 conversion.

FEC settings already depend on link modes to tell user which
modes are supported. Take this further an use link modes for
manual configuration. Old struct ethtool_fecparam is still
used to talk to the drivers, so we need to translate back
and forth. We can revisit the internal API if number of FEC
encodings starts to grow.

Enforce only one active FEC bit (by using a bit position
rather than another mask).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:15:23 -07:00
Eric Lin
28110056f2 net: ethernet: Fix typo of 'network' in comment
Signed-off-by: Eric Lin <dslin1010@gmail.com>
Reported-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 18:14:11 -07:00
Ido Schimmel
7866f265b8 mlxsw: spectrum_router: Only perform atomic nexthop bucket replacement when requested
When cleared, the 'force' parameter in nexthop bucket replacement
notifications indicates that a driver should try to perform an atomic
replacement. Meaning, only update the contents of the bucket if it is
inactive.

Since mlxsw only queries buckets' activity once every second, there is
no point in trying an atomic replacement if the idle timer interval is
smaller than 1 second.

Currently, mlxsw ignores the original value of 'force' and will always
try an atomic replacement if the idle timer is not smaller than 1
second.

Fix this by taking the original value of 'force' into account and never
promoting a non-atomic replacement to an atomic one.

Fixes: 617a77f044 ("mlxsw: spectrum_router: Add nexthop bucket replacement support")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:51:21 -07:00
David S. Miller
65550f03e9 Merge branch 'mptcp-subflow-disconnected'
Mat Martineau says:

====================
MPTCP: Allow initial subflow to be disconnected

An MPTCP connection is aggregated from multiple TCP subflows, and can
involve multiple IP addresses on either peer. The addresses used in the
initial subflow connection are assigned address id 0 on each side of the
link. More addresses can be added and shared with the peer using address
IDs of 1 or larger. MPTCP in Linux shares non-zero address IDs across
all MPTCP connections in a net namespace, which allows userspace to
manage subflow connections across a number of sockets. However, this
makes the address with id 0 a special case, since the IP address
associated with id 0 is potentially different for each socket.

This patch set allows the initial subflow to be disconnected when
userspace specifies an address to remove using both id 0 and an IP
address, or when the peer sends an RM_ADDR for id 0.

Patches 1 and 3 implement the change for requests from the peer and
userspace, respectively.

Patch 2 consolidates some code for disconnecting subflows.

Patches 4-6 update the self tests to cover removal of subflows using
address id 0.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:42:23 -07:00
Geliang Tang
5e287fe761 selftests: mptcp: remove id 0 address testcases
This patch added the testcases for removing the id 0 subflow and the id 0
address.

In do_transfer, use the removing addresses number '9' for deleting the id
0 address.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:42:23 -07:00
Geliang Tang
2d121c9a88 selftests: mptcp: add addr argument for del_addr
For the id 0 address, different MPTCP connections could be using
different IP addresses for id 0.

This patch added an extra argument IP address for del_addr when
using id 0.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:42:23 -07:00
Matthieu Baerts
6254ad4088 selftests: mptcp: avoid calling pm_nl_ctl with bad IDs
IDs are supposed to be between 0 and 255.

In pm_nl_ctl, for both the 'add' and 'get' instruction, the ID is casted
in a u_int8_t. So if we give 256, we will delete ID 0. Obviously, the
goal is not to delete this ID by giving 256.

We could modify pm_nl_ctl and stop if the ID is negative or higher than
255 but probably better not to increase the number of lines for such
things in this tool which is only used in selftests. Instead, we use it
within the limits.

This modification also means that we will no longer add a new ID for the
2nd entry. That's why we removed an expected entry from the dump and
introduced with
commit dc8eb10e95 ("selftests: mptcp: add testcases for setting the address ID").

So now we delete ID 9 like before and we add entries for IDs 10 to 255
that are deleted just after.

Note that this could be seen as a fix but it was not really an issue so
far: we were simply playing with ID 0/1 once again. With the following
commit ("selftests: mptcp: add addr argument for del_addr"), it will be
different because ID 0 is going to required an address. We don't want
errors when trying to delete ID 0 without the address argument.

Acked-and-tested-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:42:23 -07:00
Geliang Tang
740d798e87 mptcp: remove id 0 address
This patch added a new function mptcp_nl_remove_id_zero_address to
remove the id 0 address.

In this function, traverse all the existing msk sockets to find the
msk matched the input IP address. Then fill the removing list with
id 0, and pass it to mptcp_pm_remove_addr and mptcp_pm_remove_subflow.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:42:23 -07:00
Geliang Tang
9f12e97bf1 mptcp: unify RM_ADDR and RM_SUBFLOW receiving
There are some duplicate code in mptcp_pm_nl_rm_addr_received and
mptcp_pm_nl_rm_subflow_received. This patch unifies them into a new
function named mptcp_pm_nl_rm_addr_or_subflow. In it, use the input
parameter rm_type to identify it's now removing an address or a subflow.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:42:23 -07:00
Geliang Tang
774c8a8dcb mptcp: remove all subflows involving id 0 address
There's only one subflow involving the non-zero id address, but there
may be multi subflows involving the id 0 address.

Here's an example:

 local_id=0, remote_id=0
 local_id=1, remote_id=0
 local_id=0, remote_id=1

If the removing address id is 0, all the subflows involving the id 0
address need to be removed.

In mptcp_pm_nl_rm_addr_received/mptcp_pm_nl_rm_subflow_received, the
"break" prevents the iteration to the next subflow, so this patch
dropped them.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:42:23 -07:00
Eric Dumazet
b8128656a5 net: fix icmp_echo_enable_probe sysctl
sysctl_icmp_echo_enable_probe is an u8.

ipv4_net_table entry should use
 .maxlen       = sizeof(u8).
 .proc_handler = proc_dou8vec_minmax,

Fixes: f1b8fa9fa5 ("net: add sysctl for enabling RFC 8335 PROBE messages")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:38:43 -07:00
David S. Miller
3c7a83fa42 Merge branch 'ionic-cleanups'
Shannon Nelson says:

====================
ionic: code cleanup for heartbeat, dma error counts, sizeof, stats

These patches are a few more bits of code cleanup found in
testing and review: count all our dma error instances, make
better use of sizeof, fix a race in our device heartbeat check,
and clean up code formatting in the ethtool stats collection.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:37:13 -07:00
Shannon Nelson
aa620993b1 ionic: pull per-q stats work out of queue loops
Abstract out the per-queue data collection work into separate
functions from the per-queue loops in the stats reporting,
similar to what Alex did for the data label strings in
commit acebe5b610 ("ionic: Update driver to use ethtool_sprintf")

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:37:13 -07:00
Shannon Nelson
b2b9a8d7ed ionic: avoid races in ionic_heartbeat_check
Rework the heartbeat checks to be sure that we're getting an
atomic operation.  Through testing we found occasions where a
separate thread could clash with this check and cause erroneous
heartbeat check results.

Signed-off-by: Allen Hubbe <allenbh@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:37:12 -07:00
Shannon Nelson
230efff47a ionic: fix sizeof usage
Use the actual pointer that we care about as the subject of the
sizeof, rather than a struct name.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:37:12 -07:00