Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'. Thanks to Daniel for the merge resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a small BPF helper similar to bpf_skb_load_bytes() that
is able to load relative to mac/net header offset from the skb's
linear data. Compared to bpf_skb_load_bytes(), it takes a fifth
argument namely start_header, which is either BPF_HDR_START_MAC
or BPF_HDR_START_NET. This allows for a more flexible alternative
compared to LD_ABS/LD_IND with negative offset. It's enabled for
tc BPF programs as well as sock filter program types where it's
mainly useful in reuseport programs to ease access to lower header
data.
Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:
* fdfaf64e75 ("x86: bpf_jit: support negative offsets")
* 35607b02db ("sparc: bpf_jit: fix loads from negative offsets")
* e0ee9c1215 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
* 07aee94394 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
* 6d59b7dbf7 ("bpf, s390x: do not reload skb pointers in non-skb context")
* 87338c8e2c ("bpf, ppc64: do not reload skb pointers in non-skb context")
For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.
In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.
The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:
test_bpf tcpdump port 22 tcpdump complex
x64 - before 15 21 10 14 19 18
- after 7 10 10 7 10 15
arm64 - before 40 91 92 40 91 151
- after 51 64 73 51 62 113
For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
is that the eBPF tests from test_bpf module do not go via BPF verifier
and therefore any instruction rewrites from verifier cannot take place.
Therefore, move them into test_verifier which runs out of user space,
so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
It will have the same effect since runtime tests are also performed from
there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
and keep it internal to core kernel.
Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
test_bpf.ko suite.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
No change in functionality, just remove the '__' prefix and replace it
with a 'bpf_' prefix instead. We later on add a couple of more helpers
for cBPF and keeping the scheme with '__' is suboptimal there.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The new dev_direct_xmit will be used by AF_XDP in later commits.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit wires up the xskmap to XDP_SKB layer.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit wires up the xskmap to XDP_DRV layer.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Here the actual receive functions of AF_XDP are implemented, that in a
later commit, will be called from the XDP layers.
There's one set of functions for the XDP_DRV side and another for
XDP_SKB (generic).
A new XDP API, xdp_return_buff, is also introduced.
Adding xdp_return_buff, which is analogous to xdp_return_frame, but
acts upon an struct xdp_buff. The API will be used by AF_XDP in future
commits.
Support for the poll syscall is also implemented.
v2: xskq_validate_id did not update cons_tail.
The entries variable was calculated twice in xskq_nb_avail.
Squashed xdp_return_buff commit.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Buildable skeleton of AF_XDP without any functionality. Just what it
takes to register a new address family.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We do not require this inline function to be used in multiple different
locations, just inline it where it gets used in register_netdevice().
Suggested-by: David Miller <davem@davemloft.net>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ethtool_get_rxnfc(), the object "info" is firstly copied from
user-space. If the FLOW_RSS flag is set in the member field flow_type of
"info" (and cmd is ETHTOOL_GRXFH), info needs to be copied again from
user-space because FLOW_RSS is newer and has new definition, as mentioned
in the comment. However, given that the user data resides in user-space, a
malicious user can race to change the data after the first copy. By doing
so, the user can inject inconsistent data. For example, in the second
copy, the FLOW_RSS flag could be cleared in the field flow_type of "info".
In the following execution, "info" will be used in the function
ops->get_rxnfc(). Such inconsistent data can potentially lead to unexpected
information leakage since ops->get_rxnfc() will prepare various types of
data according to flow_type, and the prepared data will be eventually
copied to user-space. This inconsistent data may also cause undefined
behaviors based on how ops->get_rxnfc() is implemented.
This patch simply re-verifies the flow_type field of "info" after the
second copy. If the value is not as expected, an error code will be
returned.
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a netdev feature to configure TLS TX offloads.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have about 53 netdev_features_t bits defined and counting, add a
build time check to catch when an u64 type will not be enough and we
will have to convert that to a bitmap. This is done in
register_netdevice() for convenience.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I am dropping the export of __skb_tx_hash as after my patches nobody is
using it outside of the net/core/dev.c file. In addition I am renaming and
repurposing it to just be a static declaration of skb_tx_hash since that
was the only user for it at this point. By doing this the compiler can
inline it into __netdev_pick_tx as that will improve performance.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kbuild test robot says:
>coccinelle warnings: (new ones prefixed by >>)
>>> net/core/dev.c:1588:2-3: Unneeded semicolon
So, let's remove it.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new callback: get_ethtool_phy_stats() which allows network device
drivers not making use of the PHY library to return PHY statistics.
Update ethtool_get_phy_stats(), __ethtool_get_sset_count() and
__ethtool_get_strings() accordingly to interogate the network device
about ETH_SS_PHY_STATS.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to make it possible for network device drivers that do not
necessarily have a phy_device attached, but still report PHY statistics,
have a preliminary refactoring consisting in creating helper functions
that encapsulate the PHY device driver knowledge within PHYLIB.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-04-27
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add extensive BPF helper description into include/uapi/linux/bpf.h
and a new script bpf_helpers_doc.py which allows for generating a
man page out of it. Thus, every helper in BPF now comes with proper
function signature, detailed description and return code explanation,
from Quentin.
2) Migrate the BPF collect metadata tunnel tests from BPF samples over
to the BPF selftests and further extend them with v6 vxlan, geneve
and ipip tests, simplify the ipip tests, improve documentation and
convert to bpf_ntoh*() / bpf_hton*() api, from William.
3) Currently, helpers that expect ARG_PTR_TO_MAP_{KEY,VALUE} can only
access stack and packet memory. Extend this to allow such helpers
to also use map values, which enabled use cases where value from
a first lookup can be directly used as a key for a second lookup,
from Paul.
4) Add a new helper bpf_skb_get_xfrm_state() for tc BPF programs in
order to retrieve XFRM state information containing SPI, peer
address and reqid values, from Eyal.
5) Various optimizations in nfp driver's BPF JIT in order to turn ADD
and SUB instructions with negative immediate into the opposite
operation with a positive immediate such that nfp can better fit
small immediates into instructions. Savings in instruction count
up to 4% have been observed, from Jakub.
6) Add the BPF prog's gpl_compatible flag to struct bpf_prog_info
and add support for dumping this through bpftool, from Jiri.
7) Move the BPF sockmap samples over into BPF selftests instead since
sockmap was rather a series of tests than sample anyway and this way
this can be run from automated bots, from John.
8) Follow-up fix for bpf_adjust_tail() helper in order to make it work
with generic XDP, from Nikita.
9) Some follow-up cleanups to BTF, namely, removing unused defines from
BTF uapi header and renaming 'name' struct btf_* members into name_off
to make it more clear they are offsets into string section, from Martin.
10) Remove test_sock_addr from TEST_GEN_PROGS in BPF selftests since
not run directly but invoked from test_sock_addr.sh, from Yonghong.
11) Remove redundant ret assignment in sample BPF loader, from Wang.
12) Add couple of missing files to BPF selftest's gitignore, from Anders.
There are two trivial merge conflicts while pulling:
1) Remove samples/sockmap/Makefile since all sockmap tests have been
moved to selftests.
2) Add both hunks from tools/testing/selftests/bpf/.gitignore to the
file since git should ignore all of them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When bpf_adjust_tail was introduced for generic xdp, it changed skb's tail
pointer, so it was pointing to the new "end of the packet". However skb's
len field wasn't properly modified, so on the wire ethernet frame had
original (or even bigger, if adjust_head was used) size. This diff is
fixing this.
Fixes: 198d83bb3 (" bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail")
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Virtual devices such as tunnels and bonding can handle large packets.
Only segment packets when reaching a physical or loopback device.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement generic segmentation offload support for udp datagrams. A
follow-up patch adds support to the protocol stack to generate such
packets.
UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.
The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO).
IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
registered.
[ Export __udp_gso_segment for ipv6. -DaveM ]
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the check on FRA_L3MDEV attribute to helper to improve the
readability of fib_nl2rule. Update the extack messages to be
clear when the configuration option is disabled versus an invalid
value has been passed.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends NTF_EXT_LEARNED support to the neighbour system.
Example use-case: An Ethernet VPN implementation (eg in FRR routing suite)
can use this flag to add dynamic reachable external neigh entires
learned via control plane. The use of neigh NTF_EXT_LEARNED in this
patch is consistent with its use with bridge and vxlan fdb entries.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The percpu metadata_dst might carry the stale ip_tunnel_info
and cause incorrect behavior. When mixing tests using ipv4/ipv6
bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
ipv4's src ip addr as its ipv6 src address, because the previous
tunnel info does not clean up. The patch zeros the fields in
ip_tunnel_info.
Signed-off-by: William Tu <u9012063@gmail.com>
Reported-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.
Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions
The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.
struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.
Typical usage:
struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This reduces code duplication in the fib rule add and del paths.
Get rid of validate_rulemsg. This became obvious when adding duplicate
extack support in fib newrule/delrule error paths.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-04-21
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Initial work on BPF Type Format (BTF) is added, which is a meta
data format which describes the data types of BPF programs / maps.
BTF has its roots from CTF (Compact C-Type format) with a number
of changes to it. First use case is to provide a generic pretty
print capability for BPF maps inspection, later work will also
add BTF to bpftool. pahole support to convert dwarf to BTF will
be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
from Martin.
2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
for changing the data_end pointer. Only shrinking is currently
supported which helps for crafting ICMP control messages. Minor
changes in drivers have been added where needed so they recalc
the packet's length also when data_end was adjusted, from Nikita.
3) Improve bpftool to make it easier to feed hex bytes via cmdline
for map operations, from Quentin.
4) Add support for various missing BPF prog types and attach types
that have been added to kernel recently but neither to bpftool
nor libbpf yet. Doc and bash completion updates have been added
as well for bpftool, from Andrey.
5) Proper fix for avoiding to leak info stored in frame data on page
reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
to move the pointers into struct xdp_frame area, from Jesper.
6) Follow-up compile fix from BTF in order to include stdbool.h in
libbpf, from Björn.
7) Few fixes in BPF sample code, that is, a typo on the netdevice
in a comment and fixup proper dump of XDP action code in the
tracepoint exception, from Wang and Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.
While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.
We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The comment of dev_mc_init() is wrong. which use dev_mc_flush
instead of dev_mc_init.
Signed-off-by: Lianwen Sun <sunlw.fnst@cn.fujitsu.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6dfb970d3d ("xdp: avoid leaking info stored in frame data on
page reuse") tried to allow user/bpf_prog to (re)use area used by
xdp_frame (stored in frame headroom), by memset clearing area when
bpf_xdp_adjust_head give bpf_prog access to headroom area.
The mentioned commit had two bugs. (1) Didn't take bpf_xdp_adjust_meta
into account. (2) a combination of bpf_xdp_adjust_head calls, where
xdp->data is moved into xdp_frame section, can cause clearing
xdp_frame area again for area previously granted to bpf_prog.
After discussions with Daniel, we choose to implement a simpler
solution to the problem, which is to reserve the headroom used by
xdp_frame info.
This also avoids the situation where bpf_prog is allowed to adjust/add
headers, and then XDP_REDIRECT later drops the packet due to lack of
headroom for the xdp_frame. This would likely confuse the end-user.
Fixes: 6dfb970d3d ("xdp: avoid leaking info stored in frame data on page reuse")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch implement the 'Device Naming' feature of the Hyper-V
network device API. In Hyper-V on the host through the GUI or PowerShell
it is possible to enable the device naming feature which causes
the host to make available to the guest the name of the device.
This shows up in the RNDIS protocol as the friendly name.
The name has no particular meaning and is limited to 256 characters.
The value can only be set via PowerShell on the host, but could
be scripted for mass deployments. The default value is the
string 'Network Adapter' and since that is the same for all devices
and useless, the driver ignores it.
In Windows, the value goes into a registry key for use in SNMP
ifAlias. For Linux, this patch puts the value in the network
device alias property; where it is visible in ip tools and SNMP.
The host provided ifAlias is just a suggestion, and can be
overridden by later ip commands.
Also requires exporting dev_set_alias in netdev core.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for generic XDP we need to reflect this packet's length change by
adjusting skb's tail pointer
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Adding new bpf helper which would allow us to manipulate
xdp's data_end pointer, and allow us to reduce packet's size
indended use case: to generate ICMP messages from XDP context,
where such message would contain truncated original packet.
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Similar to IPv4, add fib metrics to the fib struct, which at the moment
is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics
into dst by reference using refcount.
To make the transition:
- add dst_metrics to rt6_info. Default to dst_default_metrics if no
metrics are passed during route add. No need for a separate pmtu
entry; it can reference the MTU slot in fib6_metrics
- ip6_convert_metrics allocates memory in the FIB entry and uses
ip_metrics_convert to copy from netlink attribute to metrics entry
- the convert metrics call is done in ip6_route_info_create simplifying
the route add path
+ fib6_commit_metrics and fib6_copy_metrics and the temporary
mx6_config are no longer needed
- add fib6_metric_set helper to change the value of a metric in the
fib entry since dst_metric_set can no longer be used
- cow_metrics for IPv6 can drop to dst_cow_metrics_generic
- rt6_dst_from_metrics_check is no longer needed
- rt6_fill_node needs the FIB entry and dst as separate arguments to
keep compatibility with existing output. Current dst address is
renamed to dest.
(to be consistent with IPv4 rt6_fill_node really should be split
into 2 functions similar to fib_dump_info and rt_fill_info)
- rt6_fill_node no longer needs the temporary metrics variable
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to keep expires time for IPv6 routes in a dump of FIB entries.
Update rtnl_put_cacheinfo to allow dst to be NULL in which case
rta_cacheinfo will only contain non-dst data.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bpf infrastructure and verifier goes to great length to avoid
bpf progs leaking kernel (pointer) info.
For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
Checks are in place to assure enough headroom is available for this.
This info is not cleared, and if the frame is reused, then a
malicious user could use bpf_xdp_adjust_head helper to move
xdp->data into this area. Thus, making this area readable.
This is not super critical as XDP progs requires root or
CAP_SYS_ADMIN, which are privileged enough for such info. An
effort (is underway) towards moving networking bpf hooks to the
lesser privileged mode CAP_NET_ADMIN, where leaking such info
should be avoided. Thus, this patch to clear the info when
needed.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync.
This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.
V4: Adjust for commit 59655a5b6c ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda42 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.
The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed). For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.
As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call. This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).
V8: address issues found by kbuild test robot
- Address sparse should be static warnings
- Allow xdp.o to be compiled without page_pool.o
V9: Remove inline from .c file, compiler knows best
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.
Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page. The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.
[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes, don't return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
e.g. remove a WARN as pointed out by Tariq
e.g. quicker fallback if ptr_ring is empty.
V5: Fixed SPDX license as pointed out by Alexei
V6: Adjustments requested by Eric Dumazet
- Adjust ____cacheline_aligned_in_smp usage/placement
- Move rcu_head in struct page_pool
- Free pages quicker on destroy, minimize resources delayed an RCU period
- Remove code for forward/backward compat ABI interface
V8: Issues found by kbuild test robot
- Address sparse should be static warnings
- Only compile+link when a driver use/select page_pool,
mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.
The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short. The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.
For more advanced allocators there is a need to store a pointer to the
registered allocator. Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU. The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().
It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function. In any case,
this must happen via RCU freeing.
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.
V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
XDP_REDIRECT, even-though it's not strictly necessary when
allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch
V8: Address sparse should be static warnings (from kbuild test robot)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>