Buildable skeleton of AF_XDP without any functionality. Just what it
takes to register a new address family.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-04-27
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add extensive BPF helper description into include/uapi/linux/bpf.h
and a new script bpf_helpers_doc.py which allows for generating a
man page out of it. Thus, every helper in BPF now comes with proper
function signature, detailed description and return code explanation,
from Quentin.
2) Migrate the BPF collect metadata tunnel tests from BPF samples over
to the BPF selftests and further extend them with v6 vxlan, geneve
and ipip tests, simplify the ipip tests, improve documentation and
convert to bpf_ntoh*() / bpf_hton*() api, from William.
3) Currently, helpers that expect ARG_PTR_TO_MAP_{KEY,VALUE} can only
access stack and packet memory. Extend this to allow such helpers
to also use map values, which enabled use cases where value from
a first lookup can be directly used as a key for a second lookup,
from Paul.
4) Add a new helper bpf_skb_get_xfrm_state() for tc BPF programs in
order to retrieve XFRM state information containing SPI, peer
address and reqid values, from Eyal.
5) Various optimizations in nfp driver's BPF JIT in order to turn ADD
and SUB instructions with negative immediate into the opposite
operation with a positive immediate such that nfp can better fit
small immediates into instructions. Savings in instruction count
up to 4% have been observed, from Jakub.
6) Add the BPF prog's gpl_compatible flag to struct bpf_prog_info
and add support for dumping this through bpftool, from Jiri.
7) Move the BPF sockmap samples over into BPF selftests instead since
sockmap was rather a series of tests than sample anyway and this way
this can be run from automated bots, from John.
8) Follow-up fix for bpf_adjust_tail() helper in order to make it work
with generic XDP, from Nikita.
9) Some follow-up cleanups to BTF, namely, removing unused defines from
BTF uapi header and renaming 'name' struct btf_* members into name_off
to make it more clear they are offsets into string section, from Martin.
10) Remove test_sock_addr from TEST_GEN_PROGS in BPF selftests since
not run directly but invoked from test_sock_addr.sh, from Yonghong.
11) Remove redundant ret assignment in sample BPF loader, from Wang.
12) Add couple of missing files to BPF selftest's gitignore, from Anders.
There are two trivial merge conflicts while pulling:
1) Remove samples/sockmap/Makefile since all sockmap tests have been
moved to selftests.
2) Add both hunks from tools/testing/selftests/bpf/.gitignore to the
file since git should ignore all of them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When bpf_adjust_tail was introduced for generic xdp, it changed skb's tail
pointer, so it was pointing to the new "end of the packet". However skb's
len field wasn't properly modified, so on the wire ethernet frame had
original (or even bigger, if adjust_head was used) size. This diff is
fixing this.
Fixes: 198d83bb3 (" bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail")
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Virtual devices such as tunnels and bonding can handle large packets.
Only segment packets when reaching a physical or loopback device.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement generic segmentation offload support for udp datagrams. A
follow-up patch adds support to the protocol stack to generate such
packets.
UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.
The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO).
IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
registered.
[ Export __udp_gso_segment for ipv6. -DaveM ]
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the check on FRA_L3MDEV attribute to helper to improve the
readability of fib_nl2rule. Update the extack messages to be
clear when the configuration option is disabled versus an invalid
value has been passed.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends NTF_EXT_LEARNED support to the neighbour system.
Example use-case: An Ethernet VPN implementation (eg in FRR routing suite)
can use this flag to add dynamic reachable external neigh entires
learned via control plane. The use of neigh NTF_EXT_LEARNED in this
patch is consistent with its use with bridge and vxlan fdb entries.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The percpu metadata_dst might carry the stale ip_tunnel_info
and cause incorrect behavior. When mixing tests using ipv4/ipv6
bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
ipv4's src ip addr as its ipv6 src address, because the previous
tunnel info does not clean up. The patch zeros the fields in
ip_tunnel_info.
Signed-off-by: William Tu <u9012063@gmail.com>
Reported-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.
Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions
The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.
struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.
Typical usage:
struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This reduces code duplication in the fib rule add and del paths.
Get rid of validate_rulemsg. This became obvious when adding duplicate
extack support in fib newrule/delrule error paths.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-04-21
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Initial work on BPF Type Format (BTF) is added, which is a meta
data format which describes the data types of BPF programs / maps.
BTF has its roots from CTF (Compact C-Type format) with a number
of changes to it. First use case is to provide a generic pretty
print capability for BPF maps inspection, later work will also
add BTF to bpftool. pahole support to convert dwarf to BTF will
be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
from Martin.
2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
for changing the data_end pointer. Only shrinking is currently
supported which helps for crafting ICMP control messages. Minor
changes in drivers have been added where needed so they recalc
the packet's length also when data_end was adjusted, from Nikita.
3) Improve bpftool to make it easier to feed hex bytes via cmdline
for map operations, from Quentin.
4) Add support for various missing BPF prog types and attach types
that have been added to kernel recently but neither to bpftool
nor libbpf yet. Doc and bash completion updates have been added
as well for bpftool, from Andrey.
5) Proper fix for avoiding to leak info stored in frame data on page
reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
to move the pointers into struct xdp_frame area, from Jesper.
6) Follow-up compile fix from BTF in order to include stdbool.h in
libbpf, from Björn.
7) Few fixes in BPF sample code, that is, a typo on the netdevice
in a comment and fixup proper dump of XDP action code in the
tracepoint exception, from Wang and Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.
While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.
We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The comment of dev_mc_init() is wrong. which use dev_mc_flush
instead of dev_mc_init.
Signed-off-by: Lianwen Sun <sunlw.fnst@cn.fujitsu.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6dfb970d3d ("xdp: avoid leaking info stored in frame data on
page reuse") tried to allow user/bpf_prog to (re)use area used by
xdp_frame (stored in frame headroom), by memset clearing area when
bpf_xdp_adjust_head give bpf_prog access to headroom area.
The mentioned commit had two bugs. (1) Didn't take bpf_xdp_adjust_meta
into account. (2) a combination of bpf_xdp_adjust_head calls, where
xdp->data is moved into xdp_frame section, can cause clearing
xdp_frame area again for area previously granted to bpf_prog.
After discussions with Daniel, we choose to implement a simpler
solution to the problem, which is to reserve the headroom used by
xdp_frame info.
This also avoids the situation where bpf_prog is allowed to adjust/add
headers, and then XDP_REDIRECT later drops the packet due to lack of
headroom for the xdp_frame. This would likely confuse the end-user.
Fixes: 6dfb970d3d ("xdp: avoid leaking info stored in frame data on page reuse")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch implement the 'Device Naming' feature of the Hyper-V
network device API. In Hyper-V on the host through the GUI or PowerShell
it is possible to enable the device naming feature which causes
the host to make available to the guest the name of the device.
This shows up in the RNDIS protocol as the friendly name.
The name has no particular meaning and is limited to 256 characters.
The value can only be set via PowerShell on the host, but could
be scripted for mass deployments. The default value is the
string 'Network Adapter' and since that is the same for all devices
and useless, the driver ignores it.
In Windows, the value goes into a registry key for use in SNMP
ifAlias. For Linux, this patch puts the value in the network
device alias property; where it is visible in ip tools and SNMP.
The host provided ifAlias is just a suggestion, and can be
overridden by later ip commands.
Also requires exporting dev_set_alias in netdev core.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for generic XDP we need to reflect this packet's length change by
adjusting skb's tail pointer
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Adding new bpf helper which would allow us to manipulate
xdp's data_end pointer, and allow us to reduce packet's size
indended use case: to generate ICMP messages from XDP context,
where such message would contain truncated original packet.
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Similar to IPv4, add fib metrics to the fib struct, which at the moment
is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics
into dst by reference using refcount.
To make the transition:
- add dst_metrics to rt6_info. Default to dst_default_metrics if no
metrics are passed during route add. No need for a separate pmtu
entry; it can reference the MTU slot in fib6_metrics
- ip6_convert_metrics allocates memory in the FIB entry and uses
ip_metrics_convert to copy from netlink attribute to metrics entry
- the convert metrics call is done in ip6_route_info_create simplifying
the route add path
+ fib6_commit_metrics and fib6_copy_metrics and the temporary
mx6_config are no longer needed
- add fib6_metric_set helper to change the value of a metric in the
fib entry since dst_metric_set can no longer be used
- cow_metrics for IPv6 can drop to dst_cow_metrics_generic
- rt6_dst_from_metrics_check is no longer needed
- rt6_fill_node needs the FIB entry and dst as separate arguments to
keep compatibility with existing output. Current dst address is
renamed to dest.
(to be consistent with IPv4 rt6_fill_node really should be split
into 2 functions similar to fib_dump_info and rt_fill_info)
- rt6_fill_node no longer needs the temporary metrics variable
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to keep expires time for IPv6 routes in a dump of FIB entries.
Update rtnl_put_cacheinfo to allow dst to be NULL in which case
rta_cacheinfo will only contain non-dst data.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bpf infrastructure and verifier goes to great length to avoid
bpf progs leaking kernel (pointer) info.
For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
Checks are in place to assure enough headroom is available for this.
This info is not cleared, and if the frame is reused, then a
malicious user could use bpf_xdp_adjust_head helper to move
xdp->data into this area. Thus, making this area readable.
This is not super critical as XDP progs requires root or
CAP_SYS_ADMIN, which are privileged enough for such info. An
effort (is underway) towards moving networking bpf hooks to the
lesser privileged mode CAP_NET_ADMIN, where leaking such info
should be avoided. Thus, this patch to clear the info when
needed.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync.
This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.
V4: Adjust for commit 59655a5b6c ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda42 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.
The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed). For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.
As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call. This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).
V8: address issues found by kbuild test robot
- Address sparse should be static warnings
- Allow xdp.o to be compiled without page_pool.o
V9: Remove inline from .c file, compiler knows best
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.
Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page. The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.
[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes, don't return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
e.g. remove a WARN as pointed out by Tariq
e.g. quicker fallback if ptr_ring is empty.
V5: Fixed SPDX license as pointed out by Alexei
V6: Adjustments requested by Eric Dumazet
- Adjust ____cacheline_aligned_in_smp usage/placement
- Move rcu_head in struct page_pool
- Free pages quicker on destroy, minimize resources delayed an RCU period
- Remove code for forward/backward compat ABI interface
V8: Issues found by kbuild test robot
- Address sparse should be static warnings
- Only compile+link when a driver use/select page_pool,
mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.
The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short. The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.
For more advanced allocators there is a need to store a pointer to the
registered allocator. Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU. The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().
It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function. In any case,
this must happen via RCU freeing.
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.
V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
XDP_REDIRECT, even-though it's not strictly necessary when
allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch
V8: Address sparse should be static warnings (from kbuild test robot)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.
V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
V6: Remove comment that id will be added later (Req by Alex Duyck)
V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Applications might use SO_RCVLOWAT on TCP socket hoping to receive
one [E]POLLIN event only when a given amount of bytes are ready in socket
receive queue.
Problem is that receive autotuning is not aware of this constraint,
meaning sk_rcvbuf might be too small to allow all bytes to be stored.
Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
can override the default setsockopt(SO_RCVLOWAT) behavior.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When coming from ndisc_netdev_event() in net/ipv6/ndisc.c,
neigh_ifdown() is called with &nd_tbl, locking this while
clearing the proxy neighbor entries when eg. deleting an
interface. Calling the table's pndisc_destructor() with the
lock still held, however, can cause a deadlock: When a
multicast listener is available an IGMP packet of type
ICMPV6_MGM_REDUCTION may be sent out. When reaching
ip6_finish_output2(), if no neighbor entry for the target
address is found, __neigh_create() is called with &nd_tbl,
which it'll want to lock.
Move the elements into their own list, then unlock the table
and perform the destruction.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199289
Fixes: 6fd6ce2056 ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) The sockmap code has to free socket memory on close if there is
corked data, from John Fastabend.
2) Tunnel names coming from userspace need to be length validated. From
Eric Dumazet.
3) arp_filter() has to take VRFs properly into account, from Miguel
Fadon Perlines.
4) Fix oops in error path of tcf_bpf_init(), from Davide Caratti.
5) Missing idr_remove() in u32_delete_key(), from Cong Wang.
6) More syzbot stuff. Several use of uninitialized value fixes all
over, from Eric Dumazet.
7) Do not leak kernel memory to userspace in sctp, also from Eric
Dumazet.
8) Discard frames from unused ports in DSA, from Andrew Lunn.
9) Fix DMA mapping and reset/failover problems in ibmvnic, from Thomas
Falcon.
10) Do not access dp83640 PHY registers prematurely after reset, from
Esben Haabendal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
vhost-net: set packet weight of tx polling to 2 * vq size
net: thunderx: rework mac addresses list to u64 array
inetpeer: fix uninit-value in inet_getpeer
dp83640: Ensure against premature access to PHY registers after reset
devlink: convert occ_get op to separate registration
ARM: dts: ls1021a: Specify TBIPA register address
net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
ibmvnic: Do not reset CRQ for Mobility driver resets
ibmvnic: Fix failover case for non-redundant configuration
ibmvnic: Fix reset scheduler error handling
ibmvnic: Zero used TX descriptor counter on reset
ibmvnic: Fix DMA mapping mistakes
tipc: use the right skb in tipc_sk_fill_sock_diag()
sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
net: dsa: Discard frames from unused ports
sctp: do not leak kernel memory to user space
soreuseport: initialise timewait reuseport field
ipv4: fix uninit-value in ip_route_output_key_hash_rcu()
dccp: initialize ireq->ir_mark
net: fix uninit-value in __hw_addr_add_ex()
...
This resolves race during initialization where the resources with
ops are registered before driver and the structures used by occ_get
op is initialized. So keep occ_get callbacks registered only when
all structs are initialized.
The example flows, as it is in mlxsw:
1) driver load/asic probe:
mlxsw_core
-> mlxsw_sp_resources_register
-> mlxsw_sp_kvdl_resources_register
-> devlink_resource_register IDX
mlxsw_spectrum
-> mlxsw_sp_kvdl_init
-> mlxsw_sp_kvdl_parts_init
-> mlxsw_sp_kvdl_part_init
-> devlink_resource_size_get IDX (to get the current setup
size from devlink)
-> devlink_resource_occ_get_register IDX (register current
occupancy getter)
2) reload triggered by devlink command:
-> mlxsw_devlink_core_bus_device_reload
-> mlxsw_sp_fini
-> mlxsw_sp_kvdl_fini
-> devlink_resource_occ_get_unregister IDX
(struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
which is using mlxsw_sp would cause use-after free)
-> mlxsw_sp_init
-> mlxsw_sp_kvdl_init
-> mlxsw_sp_kvdl_parts_init
-> mlxsw_sp_kvdl_part_init
-> devlink_resource_size_get IDX (to get the current setup
size from devlink)
-> devlink_resource_occ_get_register IDX (register current
occupancy getter)
Fixes: d9f9b9a4d0 ("devlink: Add support for resource abstraction")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported __skb_try_recv_from_queue() was using skb->peeked
while it was potentially unitialized.
We need to clear it in __skb_clone()
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- the v9fs maintainers have been missing for a long time. I've taken
over v9fs patch slinging.
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
mm,oom_reaper: check for MMF_OOM_SKIP before complaining
mm/ksm: fix interaction with THP
mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
headers: untangle kmemleak.h from mm.h
include/linux/mmdebug.h: make VM_WARN* non-rvals
mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
mm: change return type to vm_fault_t
mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
kernel/fork.c: detect early free of a live mm
mm: make counting of list_lru_one::nr_items lockless
mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
block_invalidatepage(): only release page if the full page was invalidated
mm: kernel-doc: add missing parameter descriptions
mm/swap.c: remove @cold parameter description for release_pages()
mm/nommu: remove description of alloc_vm_area
zram: drop max_zpage_size and use zs_huge_class_size()
zsmalloc: introduce zs_huge_class_size()
mm: fix races between swapoff and flush dcache
fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
...
Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
reason. It looks like it's only a convenience, so remove kmemleak.h
from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
don't already #include it. Also remove <linux/kmemleak.h> from source
files that do not use it.
This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
be good to run it through the 0day bot for other $ARCHes. I have
neither the horsepower nor the storage space for the other $ARCHes.
Update: This patch has been extensively build-tested by both the 0day
bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
for which patches are included here (in v2).
[ slab.h is the second most used header file after module.h; kernel.h is
right there with slab.h. There could be some minor error in the
counting due to some #includes having comments after them and I didn't
combine all of those. ]
[akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au> [2 build failures]
Reported-by: Fengguang Wu <fengguang.wu@intel.com> [2 build failures]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
+Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
5J1axMgVH5LAsIEcEQVq
=RVhK
-----END PGP SIGNATURE-----
Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"It's a pretty quiet round this time, which is nice. This contains:
- series from Bart, cleaning up the way we set/test/clear atomic
queue flags.
- series from Bart, fixing races between gendisk and queue
registration and removal.
- set of bcache fixes and improvements from various folks, by way of
Michael Lyle.
- set of lightnvm updates from Matias, most of it being the 1.2 to
2.0 transition.
- removal of unused DIO flags from Nikolay.
- blk-mq/sbitmap memory ordering fixes from Omar.
- divide-by-zero fix for BFQ from Paolo.
- minor documentation patches from Randy.
- timeout fix from Tejun.
- Alpha "can't write a char atomically" fix from Mikulas.
- set of NVMe fixes by way of Keith.
- bsg and bsg-lib improvements from Christoph.
- a few sed-opal fixes from Jonas.
- cdrom check-disk-change deadlock fix from Maurizio.
- various little fixes, comment fixes, etc from various folks"
* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
blk-mq: Directly schedule q->timeout_work when aborting a request
blktrace: fix comment in blktrace_api.h
lightnvm: remove function name in strings
lightnvm: pblk: remove some unnecessary NULL checks
lightnvm: pblk: don't recover unwritten lines
lightnvm: pblk: implement 2.0 support
lightnvm: pblk: implement get log report chunk
lightnvm: pblk: rename ppaf* to addrf*
lightnvm: pblk: check for supported version
lightnvm: implement get log report chunk helpers
lightnvm: make address conversions depend on generic device
lightnvm: add support for 2.0 address format
lightnvm: normalize geometry nomenclature
lightnvm: complete geo structure with maxoc*
lightnvm: add shorten OCSSD version in geo
lightnvm: add minor version to generic geometry
lightnvm: simplify geometry structure
lightnvm: pblk: refactor init/exit sequences
lightnvm: Avoid validation of default op value
lightnvm: centralize permission check for lightnvm ioctl
...
We want to use dev_valid_name() to validate tunnel names,
so better use strnlen(name, IFNAMSIZ) than strlen(name) to make
sure to not upset KASAN.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Ethtool option enables TLS record offload on HW, user
configures the feature for netdev capable of Inline TLS.
This allows user to define custom sk_prot for Inline TLS sock
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-03-31
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add raw BPF tracepoint API in order to have a BPF program type that
can access kernel internal arguments of the tracepoints in their
raw form similar to kprobes based BPF programs. This infrastructure
also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
returns an anon-inode backed fd for the tracepoint object that allows
for automatic detach of the BPF program resp. unregistering of the
tracepoint probe on fd release, from Alexei.
2) Add new BPF cgroup hooks at bind() and connect() entry in order to
allow BPF programs to reject, inspect or modify user space passed
struct sockaddr, and as well a hook at post bind time once the port
has been allocated. They are used in FB's container management engine
for implementing policy, replacing fragile LD_PRELOAD wrapper
intercepting bind() and connect() calls that only works in limited
scenarios like glibc based apps but not for other runtimes in
containerized applications, from Andrey.
3) BPF_F_INGRESS flag support has been added to sockmap programs for
their redirect helper call bringing it in line with cls_bpf based
programs. Support is added for both variants of sockmap programs,
meaning for tx ULP hooks as well as recv skb hooks, from John.
4) Various improvements on BPF side for the nfp driver, besides others
this work adds BPF map update and delete helper call support from
the datapath, JITing of 32 and 64 bit XADD instructions as well as
offload support of bpf_get_prandom_u32() call. Initial implementation
of nfp packet cache has been tackled that optimizes memory access
(see merge commit for further details), from Jakub and Jiong.
5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
API has been done in order to prepare to use print_bpf_insn() soon
out of perf tool directly. This makes the print_bpf_insn() API more
generic and pushes the env into private data. bpftool is adjusted
as well with the print_bpf_insn() argument removal, from Jiri.
6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
Format). The latter will reuse the current BPF verifier log as
well, thus bpf_verifier_log() is further generalized, from Martin.
7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
and write support has been added in similar fashion to existing
IPv6 IPV6_TCLASS socket option we already have, from Nikita.
8) Fixes in recent sockmap scatterlist API usage, which did not use
sg_init_table() for initialization thus triggering a BUG_ON() in
scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
uses a small helper sg_init_marker() to properly handle the affected
cases, from Prashant.
9) Let the BPF core follow IDR code convention and therefore use the
idr_preload() and idr_preload_end() helpers, which would also help
idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
pressure, from Shaohua.
10) Last but not least, a spelling fix in an error message for the
BPF cookie UID helper under BPF sample code, from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This function calls call_netdevice_notifier(), which also
may take net_rwsem. So, we can't use net_rwsem here.
This patch makes callers of this functions take pernet_ops_rwsem,
like register_netdevice_notifier() does. This will protect
the modifications of net_namespace_list, and allows notifiers
to take it (they won't have to care about context).
Since __rtnl_link_unregister() is used on module load
and unload (which are not frequent operations), this looks
for me better, than make all call_netdevice_notifier()
always executing in "protected net_namespace_list" context.
Also, this fixes the problem we had a deal in 328fbe747a
"Close race between {un, }register_netdevice_notifier and ...",
and guarantees __rtnl_link_unregister() does not skip
exitting net.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions take net_rwsem, while wireless_nlevent_flush()
also takes it. But down_read() can't be taken recursive,
because of rw_semaphore design, which prevents it to be occupied
by only readers forever.
Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(),
net list can't change, so these down_read()/up_read() can be removed.
Fixes: f0b07bb151 "net: Introduce net_rwsem to protect net_namespace_list"
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.
Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to fail and
return EPERM to user.
Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port_n, it can be recorded to a map. And
later when some TCP client inside same cgroup is trying to connect to
127.0.0.1:port_n, BPF hook for sys_connect can override the destination
and connect application to IP:port_n instead of 127.0.0.1:port_n. That
helps forcing all applications inside a cgroup to use desired IP and not
break those applications if they e.g. use localhost to communicate
between each other.
== Implementation details ==
Post-hooks are implemented as two new attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
Separate attach types for IPv4 and IPv6 are introduced to avoid access
to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
== The problem ==
See description of the problem in the initial patch of this patch set.
== The solution ==
The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.
It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
`BPF_CGROUP_INET6_CONNECT` for program type
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
source and destination of a connection at connect(2) time.
Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
significantly;
* there is no use-case for port.
As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.
Support is added for IPv4 and IPv6, for TCP and UDP.
IPv4 and IPv6 have separate attach types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.
== Implementation notes ==
The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.
bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.
bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
and __inet6_bind() since all call-sites, where bpf_bind() is called,
already hold socket lock.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
== The problem ==
There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured. Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.
Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.
== The solution ==
The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).
It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.
The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.
== Implementation notes ==
[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.
[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".
There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>