Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the i40e driver.
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Use accessor functions for skb fragment's page_offset instead
of direct references, in preparation for bvec conversion.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for unifying the skb_frag and bio_vec, use the fine
accessors which already exist and use skb_frag_t instead of
struct skb_frag_struct.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The counter variable in i40e_clean_tx_irq starts out negative and climbs
to 0. So it should not be defined as a u16. This was working by accident
due to the fact the u16 overflows and underflows predictably.
Replace the u16 with int, which is signed and can handle the negativity.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Pull networking updates from David Miller:
"Highlights:
1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.
2) Add fib_sync_mem to control the amount of dirty memory we allow to
queue up between synchronize RCU calls, from David Ahern.
3) Make flow classifier more lockless, from Vlad Buslov.
4) Add PHY downshift support to aquantia driver, from Heiner
Kallweit.
5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
contention on SLAB spinlocks in heavy RPC workloads.
6) Partial GSO offload support in XFRM, from Boris Pismenny.
7) Add fast link down support to ethtool, from Heiner Kallweit.
8) Use siphash for IP ID generator, from Eric Dumazet.
9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
entries, from David Ahern.
10) Move skb->xmit_more into a per-cpu variable, from Florian
Westphal.
11) Improve eBPF verifier speed and increase maximum program size,
from Alexei Starovoitov.
12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
spinlocks. From Neil Brown.
13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.
14) Improve link partner cap detection in generic PHY code, from
Heiner Kallweit.
15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
Maguire.
16) Remove SKB list implementation assumptions in SCTP, your's truly.
17) Various cleanups, optimizations, and simplifications in r8169
driver. From Heiner Kallweit.
18) Add memory accounting on TX and RX path of SCTP, from Xin Long.
19) Switch PHY drivers over to use dynamic featue detection, from
Heiner Kallweit.
20) Support flow steering without masking in dpaa2-eth, from Ioana
Ciocoi.
21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
Pirko.
22) Increase the strict parsing of current and future netlink
attributes, also export such policies to userspace. From Johannes
Berg.
23) Allow DSA tag drivers to be modular, from Andrew Lunn.
24) Remove legacy DSA probing support, also from Andrew Lunn.
25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
Haabendal.
26) Add a generic tracepoint for TX queue timeouts to ease debugging,
from Cong Wang.
27) More indirect call optimizations, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
cxgb4: Fix error path in cxgb4_init_module
net: phy: improve pause mode reporting in phy_print_status
dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
net: macb: Change interrupt and napi enable order in open
net: ll_temac: Improve error message on error IRQ
net/sched: remove block pointer from common offload structure
net: ethernet: support of_get_mac_address new ERR_PTR error
net: usb: smsc: fix warning reported by kbuild test robot
staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
net: dsa: support of_get_mac_address new ERR_PTR error
net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
vrf: sit mtu should not be updated when vrf netdev is the link
net: dsa: Fix error cleanup path in dsa_init_module
l2tp: Fix possible NULL pointer dereference
taprio: add null check on sched_nest to avoid potential null pointer dereference
net: mvpp2: cls: fix less than zero check on a u32 variable
net_sched: sch_fq: handle non connected flows
net_sched: sch_fq: do not assume EDT packets are ordered
net: hns3: use devm_kcalloc when allocating desc_cb
net: hns3: some cleanup for struct hns3_enet_ring
...
Update all users of eth_get_headlen to pass network device, fetch
network namespace from it and pass it down to the flow dissector.
This commit is a noop until administrator inserts BPF flow dissector
program.
Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
mmiowb() is now implied by spin_unlock() on architectures that require
it, so there is no reason to call it from driver code. This patch was
generated using coccinelle:
@mmiowb@
@@
- mmiowb();
and invoked as:
$ for d in drivers include/linux/qed sound; do \
spatch --include-headers --sp-file mmiowb.cocci --dir $d --in-place; done
NOTE: mmiowb() has only ever guaranteed ordering in conjunction with
spin_unlock(). However, pairing each mmiowb() removal in this patch with
the corresponding call to spin_unlock() is not at all trivial, so there
is a small chance that this change may regress any drivers incorrectly
relying on mmiowb() to order MMIO writes between CPUs using lock-free
synchronisation. If you've ended up bisecting to this commit, you can
reintroduce the mmiowb() calls using wmb() instead, which should restore
the old behaviour on all architectures other than some esoteric ia64
systems.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
There are two reasons for this.
First, the xmit_more flag conceptually doesn't fit into the skb, as
xmit_more is not a property related to the skb.
Its only a hint to the driver that the stack is about to transmit another
packet immediately.
Second, it was only done this way to not have to pass another argument
to ndo_start_xmit().
We can place xmit_more in the softnet data, next to the device recursion.
The recursion counter is already written to on each transmit. The "more"
indicator is placed right next to it.
Drivers can use the netdev_xmit_more() helper instead of skb->xmit_more
to check the "more packets coming" hint.
skb->xmit_more is retained (but always 0) to not cause build breakage.
This change takes care of the simple s/skb->xmit_more/netdev_xmit_more()/
conversions. Remaining drivers are converted in the next patches.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the driver clears the XDP xmit ring due to re-configuration or
teardown, in-progress ndo_xdp_xmit must be taken into consideration.
The ndo_xdp_xmit function is typically called from a NAPI context that
the driver does not control. Therefore, we must be careful not to
clear the XDP ring, while the call is on-going. This patch adds a
synchronize_rcu() to wait for napi(s) (preempt-disable regions and
softirqs), prior clearing the queue. Further, the __I40E_CONFIG_BUSY
flag is checked in the ndo_xdp_xmit implementation to avoid touching
the XDP xmit queue during re-configuration.
Fixes: d9314c474d ("i40e: add support for XDP_REDIRECT")
Fixes: 123cecd427 ("i40e: added queue pair disable/enable functions")
Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.
Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
Move rx_ptype extracting to i40e_process_skb_fields() to avoid
duplicating the code.
Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This fixes two bugs in hardware VLAN offload:
1. VLAN.TCI == 0 was being dropped
2. there was a race between disabling of VLAN RX feature in hardware
and processing RX queue, where packets processed in this window
could have their VLAN information dropped
Fix moves the VLAN handling into i40e_process_skb_fields() to save on
duplicated code. i40e_receive_skb() becomes trivial and so is removed.
Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
While reviewing code, I noticed that Eric Dumazet recommends that
drivers check the return code of napi_complete_done, and use that
to decide to enable interrupts or not when exiting poll. One of
the Intel drivers was already fixed (ixgbe).
Upon looking at the Intel drivers as a whole, we are handling our
polling and NAPI exit in a few different ways based on whether we
have multiqueue and whether we have Tx cleanup included. Several
drivers had the bug of exiting NAPI with return 0, which appears
to mess up the accounting in the stack.
Consolidate all the NAPI routines to do best known way of exiting
and to just mostly look like each other.
1) check return code of napi_complete_done to control interrupt enable
2) return the actual amount of work done.
3) return budget immediately if need NAPI poll again
Tested the changes on e1000e with a high interrupt rate set, and
it shows about an 8% reduction in the CPU utilization when busy
polling because we aren't re-enabling interrupts when we're about
to be polled.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Use a local variable to make the code a bit more readable.
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Many of the Intel Ethernet drivers call skb_tx_timestamp() earlier than
necessary. Move the calls to this function to the latest point possible,
just prior to notifying hardware of the new Tx packet when we bump the
tail register.
This affects i40e, iavf, igb, igc, and ixgbe.
The e100, e1000, e1000e, fm10k, and ice drivers already call the
skb_tx_timestamp() function just prior to indicating the Tx packet to
hardware, so they do not need to be changed.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Drivers using generic NAPI interface no longer need to include
<net/busy_poll.h>, since busy polling was moved to core networking
stack long ago.
See commit 79e7fff47b ("net: remove support for per driver
ndo_busy_poll()") for reference.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Outstanding Rx descriptors are temporarily stored on a stash/reuse
queue. When/if the HW rings comes up again, entries from the stash are
used to re-populate the ring.
The latter required some restructuring of the allocation scheme for
the AF_XDP zero-copy implementation. There is now a fast, and a slow
allocation. The "fast allocation" is used from the fast-path and
obtains free buffers from the fill ring and the internal recycle
mechanism. The "slow allocation" is only used in ring setup, and
obtains buffers from the fill ring and the stash (if any).
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
When the zero-copy enabled XDP Tx ring is torn down, due to
configuration changes, outstanding frames on the hardware descriptor
ring are queued on the completion ring.
The completion ring has a back-pressure mechanism that will guarantee
that there is sufficient space on the ring.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch adds zero-copy Tx support for AF_XDP sockets. It implements
the ndo_xsk_async_xmit netdev ndo and performs all the Tx logic from a
NAPI context. This means pulling egress packets from the Tx ring,
placing the frames on the NIC HW descriptor ring and completing sent
frames back to the application via the completion ring.
The regular XDP Tx ring is used for AF_XDP as well. This rationale for
this is as follows: XDP_REDIRECT guarantees mutual exclusion between
different NAPI contexts based on CPU id. In other words, a netdev can
XDP_REDIRECT to another netdev with a different NAPI context, since
the operation is bound to a specific core and each core has its own
hardware ring.
As the AF_XDP Tx action is running in the same NAPI context and using
the same ring, it will also be protected from XDP_REDIRECT actions
with the exact same mechanism.
As with AF_XDP Rx, all AF_XDP Tx specific functions are added to
i40e_xsk.c.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch prepares for the upcoming zero-copy Tx functionality, by
moving common functions and refactor chunks of code into re-usable
functions, used both by the regular path and zero-copy path.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
queue.
All AF_XDP specific functions are added to a new file, i40e_xsk.c.
Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
will allocate a new buffer and copy the zero-copy frame prior passing
it to the kernel stack.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch prepares for the upcoming zero-copy Rx functionality, by
moving/changing linkage of common functions, used both by the regular
path and zero-copy path.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In this commit, the Rx path is refactored some, as a step torwards the
introduction AF_XDP Rx zero-copy.
The page re-use counter is moved into the i40e_reuse_rx_page, instead
of bumping the counter in many places. The Rx buffer page clearing is
moved for better readability. Lastely, functions to update statistics
and bump the XDP Tx ring are introduced.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 114791 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver was combining the XDP_TX tail flush and XDP_REDIRECT
map flushing (xdp_do_flush_map). This is suboptimal, these two
flush operations should be kept separate.
It looks like the mistake was copy-pasted from ixgbe.
Fixes: d9314c474d ("i40e: add support for XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start))
is clearly wrong since I40E_SKB_PAD already points to the offset where
the original xdp->data was sitting since xdp->data_hard_start is defined
as xdp->data - i40e_rx_offset(rx_ring) where latter offsets to I40E_SKB_PAD
when build skb is used.
However, also before cc5b114dcf ("bpf, i40e: add meta data support")
this seems broken since bpf_xdp_adjust_head() helper could have been used
to alter headroom and enlarge / shrink the frame and with that the assumption
that the xdp->data remains unchanged does not hold and would push a bogus
packet to upper stack.
ixgbe got this right in 9247080816 ("ixgbe: add XDP support for pass and
drop actions"). In any case, fix it by removing the I40E_SKB_PAD from both
skb_reserve() and truesize calculation.
Fixes: cc5b114dcf ("bpf, i40e: add meta data support")
Fixes: 0c8493d90b ("i40e: add XDP support for pass and drop actions")
Reported-by: Keith Busch <keith.busch@linux.intel.com>
Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Tested-by: Keith Busch <keith.busch@linux.intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-06-05
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add a new BPF hook for sendmsg similar to existing hooks for bind and
connect: "This allows to override source IP (including the case when it's
set via cmsg(3)) and destination IP:port for unconnected UDP (slow path).
TCP and connected UDP (fast path) are not affected. This makes UDP support
complete, that is, connected UDP is handled by connect hooks, unconnected
by sendmsg ones.", from Andrey.
2) Rework of the AF_XDP API to allow extending it in future for type writer
model if necessary. In this mode a memory window is passed to hardware
and multiple frames might be filled into that window instead of just one
that is the case in the current fixed frame-size model. With the new
changes made this can be supported without having to add a new descriptor
format. Also, core bits for the zero-copy support for AF_XDP have been
merged as agreed upon, where i40e bits will be routed via Jeff later on.
Various improvements to documentation and sample programs included as
well, all from Björn and Magnus.
3) Given BPF's flexibility, a new program type has been added to implement
infrared decoders. Quote: "The kernel IR decoders support the most
widely used IR protocols, but there are many protocols which are not
supported. [...] There is a 'long tail' of unsupported IR protocols,
for which lircd is need to decode the IR. IR encoding is done in such
a way that some simple circuit can decode it; therefore, BPF is ideal.
[...] user-space can define a decoder in BPF, attach it to the rc
device through the lirc chardev.", from Sean.
4) Several improvements and fixes to BPF core, among others, dumping map
and prog IDs into fdinfo which is a straight forward way to correlate
BPF objects used by applications, removing an indirect call and therefore
retpoline in all map lookup/update/delete calls by invoking the callback
directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper
for tc BPF programs to have an efficient way of looking up cgroup v2 id
for policy or other use cases. Fixes to make sure we zero tunnel/xfrm
state that hasn't been filled, to allow context access wrt pt_regs in
32 bit archs for tracing, and last but not least various test cases
for fixes that landed in bpf earlier, from Daniel.
5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with
a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect
call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper.
6) Add a new bpf_get_current_cgroup_id() helper that can be used in
tracing to retrieve the cgroup id from the current process in order
to allow for e.g. aggregation of container-level events, from Yonghong.
7) Two follow-up fixes for BTF to reject invalid input values and
related to that also two test cases for BPF kselftests, from Martin.
8) Various API improvements to the bpf_fib_lookup() helper, that is,
dropping MPLS bits which are not fully hashed out yet, rejecting
invalid helper flags, returning error for unsupported address
families as well as renaming flowlabel to flowinfo, from David.
9) Various fixes and improvements to sockmap BPF kselftests in particular
in proper error detection and data verification, from Prashant.
10) Two arm32 BPF JIT improvements. One is to fix imm range check with
regards to whether immediate fits into 24 bits, and a naming cleanup
to get functions related to rsh handling consistent to those handling
lsh, from Wang.
11) Two compile warning fixes in BPF, one for BTF and a false positive
to silent gcc in stack_map_get_build_id_offset(), from Arnd.
12) Add missing seg6.h header into tools include infrastructure in order
to fix compilation of BPF kselftests, from Mathieu.
13) Several formatting cleanups in the BPF UAPI helper description that
also fix an error during rst2man compilation, from Quentin.
14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is
not built into the kernel, from Yue.
15) Remove a useless double assignment in dev_map_enqueue(), from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the ndo_xdp_flush call implementation i40e_xdp_flush
as no callers of ndo_xdp_flush are left.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add support for XDP meta data when using build skb variant of
the i40e driver. Implementation is analogous to the existing
ixgbe and ixgbevf support for meta data from 366a88fe2f ("bpf,
ixgbe: add meta data support") and be8333322e ("ixgbevf: Add
support for meta data"). With the build skb variant we get
192 bytes of extra headroom which can be used for encaps or
meta data.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
When passed the XDP_XMIT_FLUSH flag i40e_xdp_xmit now performs the
same kind of ring tail update as in i40e_xdp_flush. The advantage is
that all the necessary checks have been performed and xdp_ring can be
updated, instead of having to perform the exact same steps/checks in
i40e_xdp_flush
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch only change the API and reject any use of flags. This is an
intermediate step that allows us to implement the flush flag operation
later, for each individual driver in a separate patch.
The plan is to implement flush operation via XDP_XMIT_FLUSH flag
and then remove XDP_XMIT_FLAGS_NONE when done.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Recent versions of the Linux kernel now warn about incorrect parameter
definitions for function comments. Fix up several function comments to
correctly reflect the current function arguments. This cleans up the
warnings and helps ensure our documentation is accurate.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
After many years of having a ~30 line copyright and license header to our
source files, we are finally able to reduce that to one line with the
advent of the SPDX identifier.
Also caught a few files missing the SPDX license identifier, so fixed
them up.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync.
This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.
V4: Adjust for commit 59655a5b6c ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda42 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also convert driver i40e, which very recently got XDP_REDIRECT support
in commit d9314c474d ("i40e: add support for XDP_REDIRECT").
V7: This patch got added in V7 of this patchset.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver now acts upon the XDP_REDIRECT return action. Two new ndos
are implemented, ndo_xdp_xmit and ndo_xdp_flush.
XDP_REDIRECT action enables XDP program to redirect frames to other
netdevs.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This commit tweaks the page counting for XDP_REDIRECT to function
properly. XDP_REDIRECT support will be added in a future commit.
The current page counting scheme assumes that the reference count
cannot decrease until the received frame is sent to the upper layers
of the networking stack. This assumption does not hold for the
XDP_REDIRECT action, since a page (pointed out by xdp_buff) can have
its reference count decreased via the xdp_do_redirect call.
To work around that, we now start off by a large page count and then
don't allow a refcount less than two.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The two Flow Directory auto disable flags are used at run time to mark
when the flow director features needed to be disabled. Thus the flags
could change even when the RTNL lock is not held.
They also have some code constructions which really should be
test_and_set or test_and_clear using atomic bit operations.
Create new state fields to mark this, and stop including them in
pf->flags.
This is part of a larger effort to remove the need for cmpxchg64 in
i40e_set_priv_flags().
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Add the SPDX identifiers to all the Intel wired LAN driver files, as
outlined in Documentation/process/license-rules.rst.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The i40e_detect_recover_hung function uses the i40e_get_tx_pending
function to determine if there are packets stalled on the ring.
i40e_get_tx_pending calculates the pending packets using the head
writeback value and HW tail. If the queue is stopped and we lose the
interrupt to update our next_to_clean then we a) won't get another
interrupt to clean because queue is stopped b) we won't catch the
problem with i40e_detect_recover_hung because the HW values look like
there's no packets waiting to be transmitted. Using the SW values we
can catch the issue because next_to_clean will be out of sync with head
writeback.
This has the added benefit being less CPU intensive because we don't
need to reach into the hardware to get the values.
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch replaces the existing mechanism for determining the correct
value to program for adaptive ITR with yet another new and more
complicated approach.
The basic idea from a 30K foot view is that this new approach will push the
Rx interrupt moderation up so that by default it starts in low latency and
is gradually pushed up into a higher latency setup as long as doing so
increases the number of packets processed, if the number of packets drops
to 4 to 1 per packet we will reset and just base our ITR on the size of the
packets being received. For Tx we leave it floating at a high interrupt
delay and do not pull it down unless we start processing more than 112
packets per interrupt. If we start exceeding that we will cut our interrupt
rates in half until we are back below 112.
The side effect of these patches are that we will be processing more
packets per interrupt. This is both a good and a bad thing as it means we
will not be blocking processing in the case of things like pktgen and XDP,
but we will also be consuming a bit more CPU in the cases of things such as
network throughput tests using netperf.
One delta from this versus the ixgbe version of the changes is that I have
made the interrupt moderation a bit more aggressive when we are in bulk
mode by moving our "goldilocks zone" up from 48 to 96 to 56 to 112. The
main motivation behind moving this is to address the fact that we need to
update less frequently, and have more fine grained control due to the
separate Tx and Rx ITR times.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch is mostly prep-work for replacing the current approach to
programming the dynamic aka adaptive ITR. Specifically here what we are
doing is splitting the Tx and Rx ITR each into two separate values.
The first value current_itr represents the current value of the register.
The second value target_itr represents the desired value of the register.
The general plan by doing this is to allow for deferring the update of the
ITR value under certain circumstances. For now we will work with what we
have, but in the future I hope to change the behavior so that we always
only update one ITR at a time using some simple logic to determine which
ITR requires an update.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Instead of using the register value for the defines when setting up the
ring ITR we can just use the actual values and avoid the use of shifts and
macros to translate between the values we have and the values we want.
This helps to make the code more readable as we can quickly translate from
one value to the other.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The CLEARPBA bit in the dynamic interrupt control register actually has
no effect either way on the hardware. As per errata 28 in the XL710
specification update the interrupt is actually cleared any time the
register is written with the INTENA_MSK bit set to 0. As such the act of
toggling the enable bit actually will trigger the interrupt being
cleared and could lead to potential lost events if auto-masking is
not enabled.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The logic for dynamic ITR update is confusing at best as there were odd
paths chosen for how to find the rings associated with a given queue based
on the vector index and other inconsistencies throughout the code.
This patch is an attempt to clean up the logic so that we can more easily
understand what is going on. Specifically if there is a Rx or Tx ring that
is enabled in dynamic mode on the q_vector it is allowed to override the
other side of the interrupt moderation. While it isn't correct all this
patch is doing is cleaning up the logic for now so that when we come
through and fix it we can more easily identify that this is wrong.
The other big change made here is that we replace references to:
vsi->rx_rings[q_vector->v_idx]->itr_setting
with:
q_vector->rx.ring->itr_setting
The general idea is we can avoid the long pointer chase since just
accessing q_vector->rx.ring is a single pointer access versus having to
chase down vsi->rx_rings, and then finding the pointer in the array, and
finally chasing down the itr_setting from there.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The rings are already split out into Tx and Rx rings so it doesn't make
sense to have any single ring store both a Tx and Rx itr_setting value.
Since that is the case drop the pair in favor of storing just a single ITR
value.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
'bufer' should be 'buffer'
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The drivers for i40e and i40evf had a reg_idx value stored in the q_vector
that was going completely unused. I can only assume this was copied over
from ixgbe and nobody knew how to use it.
I'm going to make use of the value to avoid having to compute the vector
and thus the register index for multiple paths throughout the drivers.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>