This patch adds pretty print capability for data with BTF type info.
The current usage is to allow pretty print for a BPF map.
The next few patches will allow a read() on a pinned map with BTF
type info for its key and value.
This patch uses the seq_printf() infra.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch checks a few things of struct's members:
1) It has a valid size (e.g. a "const void" is invalid)
2) A member's size (+ its member's offset) does not exceed
the containing struct's size.
3) The member's offset satisfies the alignment requirement
The above can only be done after the needs_resolve member's type
is resolved. Hence, the above is done together in
btf_struct_resolve().
Each possible member's type (e.g. int, enum, modifier...) implements
the check_member() ops which will be called from btf_struct_resolve().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
After collecting all btf_type in the first pass in an earlier patch,
the second pass (in this patch) can validate the reference types
(e.g. the referring type does exist and it does not refer to itself).
While checking the reference type, it also gathers other information (e.g.
the size of an array). This info will be useful in checking the
struct's members in a later patch. They will also be useful in doing
pretty print later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch introduces BPF type Format (BTF).
BTF (BPF Type Format) is the meta data format which describes
the data types of BPF program/map. Hence, it basically focus
on the C programming language which the modern BPF is primary
using. The first use case is to provide a generic pretty print
capability for a BPF map.
BTF has its root from CTF (Compact C-Type format). To simplify
the handling of BTF data, BTF removes the differences between
small and big type/struct-member. Hence, BTF consistently uses u32
instead of supporting both "one u16" and "two u32 (+padding)" in
describing type and struct-member.
It also raises the number of types (and functions) limit
from 0x7fff to 0x7fffffff.
Due to the above changes, the format is not compatible to CTF.
Hence, BTF starts with a new BTF_MAGIC and version number.
This patch does the first verification pass to the BTF. The first
pass checks:
1. meta-data size (e.g. It does not go beyond the total btf's size)
2. name_offset is valid
3. Each BTF_KIND (e.g. int, enum, struct....) does its
own check of its meta-data.
Some other checks, like checking a struct's member is referring
to a valid type, can only be done in the second pass. The second
verification pass will be implemented in the next patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit 6dfb970d3d ("xdp: avoid leaking info stored in frame data on
page reuse") tried to allow user/bpf_prog to (re)use area used by
xdp_frame (stored in frame headroom), by memset clearing area when
bpf_xdp_adjust_head give bpf_prog access to headroom area.
The mentioned commit had two bugs. (1) Didn't take bpf_xdp_adjust_meta
into account. (2) a combination of bpf_xdp_adjust_head calls, where
xdp->data is moved into xdp_frame section, can cause clearing
xdp_frame area again for area previously granted to bpf_prog.
After discussions with Daniel, we choose to implement a simpler
solution to the problem, which is to reserve the headroom used by
xdp_frame info.
This also avoids the situation where bpf_prog is allowed to adjust/add
headers, and then XDP_REDIRECT later drops the packet due to lack of
headroom for the xdp_frame. This would likely confuse the end-user.
Fixes: 6dfb970d3d ("xdp: avoid leaking info stored in frame data on page reuse")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Nikita V. Shirokov says:
====================
In this patch series i'm add new bpf helper which allow to manupulate
xdp's data_end pointer. right now only "shrinking" (reduce packet's size
by moving pointer) is supported (and i see no use case for "growing").
Main use case for such helper is to be able to generate controll (ICMP)
messages from XDP context. such messages usually contains first N bytes
from original packets as a payload, and this is exactly what this helper
would allow us to do (see patch 3 for sample program, where we generate
ICMP "packet too big" message). This helper could be usefull for load
balancing applications where after additional encapsulation, resulting
packet could be bigger then interface MTU.
Aside from new helper this patch series contains minor changes in device
drivers (for ones which requires), so they would recal packet's length
not only when head pointer was adjusted, but if tail's one as well.
v2->v3:
* adding missed "signed off by" in v2
v1->v2:
* fixed kbuild warning
* made offset eq 0 invalid for xdp_bpf_adjust_tail
* splitted bpf_prog_test_run fix and selftests in sep commits
* added SPDX licence where applicable
* some reshuffling in patches order (tests now in the end)
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
adding bpf's sample program which is using bpf_xdp_adjust_tail helper
by generating ICMPv4 "packet to big" message if ingress packet's size is
bigger then 600 bytes
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
adding selftests for bpf_xdp_adjust_tail helper. in this synthetic test
we are testing that 1) if data_end < data helper will return EINVAL
2) for normal use case packet's length would be reduced.
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
after introduction of bpf_xdp_adjust_tail helper packet length
could be changed not only if xdp->data pointer has been changed
but xdp->data_end as well. making bpf_prog_test_run aware of this
possibility
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for virtio driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for tun driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
(in case if after xdp's prog run data_end pointer was adjusted)
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for nfp driver we will just calculate packet's length unconditionally
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for cavium's thunder driver we will just calculate packet's length
unconditionally
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for bnxt driver we will just calculate packet's length unconditionally
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for mlx4 driver we will just calculate packet's length unconditionally
(the same way as it's already being done in mlx5)
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for generic XDP we need to reflect this packet's length change by
adjusting skb's tail pointer
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Adding new bpf helper which would allow us to manipulate
xdp's data_end pointer, and allow us to reduce packet's size
indended use case: to generate ICMP messages from XDP context,
where such message would contain truncated original packet.
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
bpftool uses hexadecimal values when it dumps map contents:
# bpftool map dump id 1337
key: ff 13 37 ff value: a1 b2 c3 d4 ff ff ff ff
Found 1 element
In order to lookup or update values with bpftool, the natural reflex is
then to copy and paste the values to the command line, and to try to run
something like:
# bpftool map update id 1337 key ff 13 37 ff \
value 00 00 00 00 00 00 1a 2b
Error: error parsing byte: ff
bpftool complains, because it uses strtoul() with a 0 base to parse the
bytes, and that without a "0x" prefix, the bytes are considered as
decimal values (or even octal if they start with "0").
To feed hexadecimal values instead, one needs to add "0x" prefixes
everywhere necessary:
# bpftool map update id 1337 key 0xff 0x13 0x37 0xff \
value 0 0 0 0 0 0 0x1a 0x2b
To make it easier to use hexadecimal values, add an optional "hex"
keyword to put after "key" or "value" to tell bpftool to consider the
digits as hexadecimal. We can now do:
# bpftool map update id 1337 key hex ff 13 37 ff \
value hex 0 0 0 0 0 0 1a 2b
Without the "hex" keyword, the bytes are still parsed according to
normal integer notation (decimal if no prefix, or hexadecimal or octal
if "0x" or "0" prefix is used, respectively).
The patch also add related documentation and bash completion for the
"hex" keyword.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Suggested-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The variable rec_i contains an XDP action code not an error.
Thus, using err2str() was wrong, it should have been action2str().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The program run against loopback interace "lo", not "eth0".
Correct the comment.
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Andrey Ignatov says:
====================
v1->v2:
- add new types to bpftool-cgroup man page;
- add new types to bash completion for bpftool;
- don't add types that should not be in bpftool cgroup.
Add support for various BPF prog types and attach types that have been
added to kernel recently but not to bpftool or libbpf yet.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Add missing pieces for BPF_PROG_TYPE_RAW_TRACEPOINT in libbpf:
* is- and set- functions;
* support guessing prog type.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
libbpf can guess prog type and expected attach type based on section
name. Add hints for "cgroup/post_bind4" and "cgroup/post_bind6" section
names.
Existing "cgroup/sock" is not changed, i.e. expected_attach_type for it
is not set to `BPF_CGROUP_INET_SOCK_CREATE`, for backward compatibility.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Send a netlink notification when userspace adds a manually configured
address if DAD is enabled and optimistic flag isn't set.
Moreover send RTM_DELADDR notifications for tentative addresses.
Some userspace applications (e.g. NetworkManager) are interested in
addr netlink events albeit the address is still in tentative state,
however events are not sent if DAD process is not completed.
If the address is added and immediately removed userspace listeners
are not notified. This behaviour can be easily reproduced by using
veth interfaces:
$ ip -b - <<EOF
> link add dev vm1 type veth peer name vm2
> link set dev vm1 up
> link set dev vm2 up
> addr add 2001:db8:a🅱️1:2:3:4/64 dev vm1
> addr del 2001:db8:a🅱️1:2:3:4/64 dev vm1
EOF
This patch reverts the behaviour introduced by the commit f784ad3d79
("ipv6: do not send RTM_DELADDR for tentative addresses")
Suggested-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like tos inherit, ttl inherit should also means inherit the inner protocol's
ttl values, which actually not implemented in vxlan yet.
But we could not treat ttl == 0 as "use the inner TTL", because that would be
used also when the "ttl" option is not specified and that would be a behavior
change, and breaking real use cases.
So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is
specified with ip cmd.
Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NCSI driver defines a generic ncsi_channel_filter struct that can be
used to store arbitrarily formatted filters, and several generic methods
of accessing data stored in such a filter.
However in both the driver and as defined in the NCSI specification
there are only two actual filters: VLAN ID filters and MAC address
filters. The splitting of the MAC filter into unicast, multicast, and
mixed is also technically not necessary as these are stored in the same
location in hardware.
To save complexity, particularly in the set up and accessing of these
generic filters, remove them in favour of two specific structs. These
can be acted on directly and do not need several generic helper
functions to use.
This also fixes a memory error found by KASAN on ARM32 (which is not
upstream yet), where response handlers accessing a filter's data field
could write past allocated memory.
[ 114.926512] ==================================================================
[ 114.933861] BUG: KASAN: slab-out-of-bounds in ncsi_configure_channel+0x4b8/0xc58
[ 114.941304] Read of size 2 at addr 94888558 by task kworker/0:2/546
[ 114.947593]
[ 114.949146] CPU: 0 PID: 546 Comm: kworker/0:2 Not tainted 4.16.0-rc6-00119-ge156398bfcad #13
...
[ 115.170233] The buggy address belongs to the object at 94888540
[ 115.170233] which belongs to the cache kmalloc-32 of size 32
[ 115.181917] The buggy address is located 24 bytes inside of
[ 115.181917] 32-byte region [94888540, 94888560)
[ 115.192115] The buggy address belongs to the page:
[ 115.196943] page:9eeac100 count:1 mapcount:0 mapping:94888000 index:0x94888fc1
[ 115.204200] flags: 0x100(slab)
[ 115.207330] raw: 00000100 94888000 94888fc1 0000003f 00000001 9eea2014 9eecaa74 96c003e0
[ 115.215444] page dumped because: kasan: bad access detected
[ 115.221036]
[ 115.222544] Memory state around the buggy address:
[ 115.227384] 94888400: fb fb fb fb fc fc fc fc 04 fc fc fc fc fc fc fc
[ 115.233959] 94888480: 00 00 00 fc fc fc fc fc 00 04 fc fc fc fc fc fc
[ 115.240529] >94888500: 00 00 04 fc fc fc fc fc 00 00 04 fc fc fc fc fc
[ 115.247077] ^
[ 115.252523] 94888580: 00 04 fc fc fc fc fc fc 06 fc fc fc fc fc fc fc
[ 115.259093] 94888600: 00 00 06 fc fc fc fc fc 00 00 04 fc fc fc fc fc
[ 115.265639] ==================================================================
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a dns_resolver key whose payload contains a very long option name
resulted in that string being printed in full. This hit the WARN_ONCE()
in set_precision() during the printk(), because printk() only supports a
precision of up to 32767 bytes:
precision 1000000 too large
WARNING: CPU: 0 PID: 752 at lib/vsprintf.c:2189 vsnprintf+0x4bc/0x5b0
Fix it by limiting option strings (combined name + value) to a much more
reasonable 128 bytes. The exact limit is arbitrary, but currently the
only recognized option is formatted as "dnserror=%lu" which fits well
within this limit.
Also ratelimit the printks.
Reproducer:
perl -e 'print "#", "A" x 1000000, "\x00"' | keyctl padd dns_resolver desc @s
This bug was found using syzkaller.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 4a2d789267 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The statistics such as InHdrErrors should be counted on the ingress
netdev rather than on the dev from the dst, which is the egress.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF core gets access to __inet6_bind via ipv6_bpf_stub_impl, so it is
not invoked directly outside of af_inet6.c. Make it static and move
inet6_bind after to avoid forward declaration.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer says:
====================
XDP redirect memory return API
Submitted against net-next, as it contains NIC driver changes.
This patchset works towards supporting different XDP RX-ring memory
allocators. As this will be needed by the AF_XDP zero-copy mode.
The patchset uses mlx5 as the sample driver, which gets implemented
XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
change thought the patchset).
A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.
Support for a driver supplied allocator is implemented, and a
refurbished version of page_pool is the first return allocator type
introduced. This will be a integration point for AF_XDP zero-copy.
The mlx5 driver evolve into using the page_pool, and see a performance
increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.
The patchset stop at 16 patches (one over limit), but more API changes
are planned. Specifically extending ndo_xdp_xmit and xdp_return_frame
APIs to support bulking. As this will address some known limits.
V2: Updated according to Tariq's feedback
V3: Updated based on feedback from Jason Wang and Alex Duyck
V4: Updated based on feedback from Tariq and Jason
V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test
V6: Updated based on feedback from Eric Dumazet and Alex Duyck
V7: Adapt to i40e that got XDP_REDIRECT support in-between
V8:
Updated based on feedback kbuild test robot, and adjust for mlx5 changes
page_pool only compiled into kernel when drivers Kconfig 'select' feature
V9:
Remove some inline statements, let compiler decide what to inline
Fix return value in virtio_net driver
Adjust for mlx5 changes in-between submissions
V10:
Minor adjust for mlx5 requested by Tariq
Resubmit against net-next
V11: avoid leaking info stored in frame data on page reuse
====================
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bpf infrastructure and verifier goes to great length to avoid
bpf progs leaking kernel (pointer) info.
For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
Checks are in place to assure enough headroom is available for this.
This info is not cleared, and if the frame is reused, then a
malicious user could use bpf_xdp_adjust_head helper to move
xdp->data into this area. Thus, making this area readable.
This is not super critical as XDP progs requires root or
CAP_SYS_ADMIN, which are privileged enough for such info. An
effort (is underway) towards moving networking bpf hooks to the
lesser privileged mode CAP_NET_ADMIN, where leaking such info
should be avoided. Thus, this patch to clear the info when
needed.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync.
This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.
V4: Adjust for commit 59655a5b6c ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda42 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e892754 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a3 ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.
The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed). For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.
As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call. This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).
V8: address issues found by kbuild test robot
- Address sparse should be static warnings
- Allow xdp.o to be compiled without page_pool.o
V9: Remove inline from .c file, compiler knows best
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.
Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page. The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.
[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes, don't return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
e.g. remove a WARN as pointed out by Tariq
e.g. quicker fallback if ptr_ring is empty.
V5: Fixed SPDX license as pointed out by Alexei
V6: Adjustments requested by Eric Dumazet
- Adjust ____cacheline_aligned_in_smp usage/placement
- Move rcu_head in struct page_pool
- Free pages quicker on destroy, minimize resources delayed an RCU period
- Remove code for forward/backward compat ABI interface
V8: Issues found by kbuild test robot
- Address sparse should be static warnings
- Only compile+link when a driver use/select page_pool,
mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.
The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short. The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.
For more advanced allocators there is a need to store a pointer to the
registered allocator. Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU. The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().
It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function. In any case,
this must happen via RCU freeing.
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.
V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
XDP_REDIRECT, even-though it's not strictly necessary when
allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch
V8: Address sparse should be static warnings (from kbuild test robot)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.
V2: Fixed issues pointed out by Tariq.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also convert driver i40e, which very recently got XDP_REDIRECT support
in commit d9314c474d ("i40e: add support for XDP_REDIRECT").
V7: This patch got added in V7 of this patchset.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format. It is now time to convert it over to the generic
xdp_frame format. The cpumap needs one extra field dev_rx.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page). Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.
Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.
V8: Avoid endianness issues (found by kbuild test robot)
V9: Change __virtnet_xdp_xmit from bool to int return value (found by Dan Carpenter)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tuntap driver invented it's own driver specific way of queuing
XDP packets, by storing the xdp_buff information in the top of
the XDP frame data.
Convert it over to use the more generic xdp_frame structure. The
main problem with the in-driver method is that the xdp_rxq_info pointer
cannot be trused/used when dequeueing the frame.
V3: Remove check based on feedback from Jason
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is needed to convert drivers tuntap and virtio_net.
This is a generalization of what is done inside cpumap, which will be
converted later.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done to prepare for the next patch, and it is also
nice to move this XDP related struct out of filter.h.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend struct ixgbe_tx_buffer to store the xdp_mem_info.
Notice that this could be optimized further by putting this into
a union in the struct ixgbe_tx_buffer, but this patchset
works towards removing this again. Thus, this is not done.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.
V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
V6: Remove comment that id will be added later (Req by Alex Duyck)
V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements basic XDP redirect support in mlx5 driver.
Notice that the ndo_xdp_xmit() is NOT implemented, because that API
need some changes that this patchset is working towards.
The main purpose of this patch is have different drivers doing
XDP_REDIRECT to show how different memory models behave in a cross
driver world.
Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect,
as the return API does not exist yet to to keep this mapped.
Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing,
introduce xdpsq.db.redirect_flush boolian.
V9: Adjust for commit 121e892754 ("net/mlx5e: Refactor RQ XDP_TX indication")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>