This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
into XDP_REDIRECT after BPF program run when the ingress device
is a bond slave.
The dev_xdp_prog_count is exposed so that slave devices can be checked
for loaded XDP programs in order to avoid the situation where both
bond master and slave have programs loaded according to xdp_state.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
In preparation for adding XDP support to the bonding driver
refactor the packet hashing functions to be able to work with
any linear data buffer without an skb.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-2-joamaki@gmail.com
Simon Horman says:
====================
This short series provides minor enhancements to the
sample code in samples/bpf/xdpsock_user.c.
Each change is explained more fully in its own commit message.
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
There is a forward declaration of ip_fast_csum() just before its
implementation, remove the unneeded forward declaration.
While at it mark the implementation as static inline.
Signed-off-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Link: https://lore.kernel.org/bpf/20210806122855.26115-3-simon.horman@corigine.com
The xdpsock sample application is a useful base for experiment's around
AF_XDP sockets. Compiling the sample outside of the kernel tree is made
harder then it has to be as the sample includes two headers and that are
not installed by 'make install_header' nor are usually part of
distributions kernel headers.
The first header asm/barrier.h is not used and can just be dropped.
The second linux/compiler.h are only needed for the decorator __force
and are only used in ip_fast_csum(), csum_fold() and
csum_tcpudp_nofold(). These functions are copied verbatim from
include/asm-generic/checksum.h and lib/checksum.c. While it's fine to
copy and use these functions in the sample application the decorator
brings no value and can be dropped together with the include.
With this change it's trivial to compile the xdpsock sample outside the
kernel tree from xdpsock_user.c and xdpsock.h.
$ gcc -o xdpsock xdpsock_user.c -lbpf -lpthread
Signed-off-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Link: https://lore.kernel.org/bpf/20210806122855.26115-2-simon.horman@corigine.com
BPF programs for reference_tracking selftest use "fail_" prefix to notify that
they are expected to fail. This is really confusing and inconvenient when
trying to grep through test_progs output to find *actually* failed tests. So
rename the prefix from "fail_" to "err_".
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210805230734.437914-1-andrii@kernel.org
Currently, this test is incorrectly printing the destination port in
place of the destination IP.
Fixes: 2767c97765 ("selftests/bpf: Implement sample tcp/tcp6 bpf_iter programs")
Signed-off-by: Jose Blanquicet <josebl@microsoft.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210805164044.527903-1-josebl@microsoft.com
Rewrite to skel and ASSERT macros as well while we are at it.
v3:
- replace -f with -A to make it work with busybox ping.
-A is available on both busybox and iputils, from the man page:
On networks with low RTT this mode is essentially equivalent to
flood mode.
v2:
- don't check result of bpf_map__fd (Yonghong Song)
- remove from .gitignore (Andrii Nakryiko)
- move ping_command into network_helpers (Andrii Nakryiko)
- remove assert() (Andrii Nakryiko)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210804205524.3748709-1-sdf@google.com
Commit ce4dade7f1 ("samples/bpf: xdp_redirect_cpu: Load a eBPF program
on cpumap") added the following option, but missed adding it to optstring:
- mprog-disable: disable loading XDP program on cpumap entries
Fix it and add the missing option character.
Fixes: ce4dade7f1 ("samples/bpf: xdp_redirect_cpu: Load a eBPF program on cpumap")
Signed-off-by: Matthew Cover <matthew.cover@stackpath.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210731005632.13228-1-matthew.cover@stackpath.com
During recent net into net-next merge ([0]) a piece of old logic ([1]) got
reintroduced accidentally while resolving merge conflict between bpf's [2]
and bpf-next's [3]. This check was removed in bpf-next tree to allow extra
ctx_in parameter passed for XDP test runs. Reinstating the check breaks
bpf_prog_test_run_xdp logic and causes a corresponding xdp_context_test_run
selftest failure. Fix by removing the check and allow ctx_in for XDP test
runs.
[0] 5af84df962 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
[1] 947e8b595b ("bpf: explicitly prohibit ctx_{in, out} in non-skb BPF_PROG_TEST_RUN")
[2] 5e21bb4e81 ("bpf, test: fix NULL pointer dereference on invalid expected_attach_type")
[3] 47316f4a30 ("bpf: Support input xdp_md context in BPF_PROG_TEST_RUN")
Fixes: 5af84df962 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
As of now, only AF_UNIX datagram socket supports sockmap. But
unix_proto is shared for all kinds of AF_UNIX sockets, so we
have to check the socket type in unix_bpf_update_proto() to
explicitly reject other types, otherwise they could be added
into sockmap, too.
Fixes: c63829182c ("af_unix: Implement ->psock_update_sk_prot()")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210731195038.8084-1-xiyou.wangcong@gmail.com
Before, the interpreter allowed up to MAX_TAIL_CALL_CNT + 1 tail calls.
Now precisely MAX_TAIL_CALL_CNT is allowed, which is in line with the
behavior of the x86 JITs.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210728164741.350370-1-johan.almbladh@anyfinetworks.com
Don't populate the array states on the stack but instead it
static const. Makes the object code smaller by 79 bytes.
Before:
text data bss dec hex filename
21309 8304 192 29805 746d drivers/net/ethernet/mellanox/mlx4/qp.o
After:
text data bss dec hex filename
21166 8368 192 29726 741e drivers/net/ethernet/mellanox/mlx4/qp.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210801153742.147304-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't populate the array if_names on the stack but instead it
static const. Makes the object code smaller by 99 bytes.
Before:
text data bss dec hex filename
27886 10752 672 39310 998e ./drivers/net/ethernet/3com/3c509.o
After:
text data bss dec hex filename
27723 10816 672 39211 992b ./drivers/net/ethernet/3com/3c509.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801152650.146572-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't populate the array faf_bits on the stack but instead it
static const. Makes the object code smaller by 175 bytes.
Before:
text data bss dec hex filename
9645 4552 0 14197 3775 ../freescale/dpaa2/dpaa2-eth-devlink.o
After:
text data bss dec hex filename
9406 4616 0 14022 36c6 ../freescale/dpaa2/dpaa2-eth-devlink.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801152209.146359-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't populate the array random_data on the stack but instead it
static const. Makes the object code smaller by 66 bytes.
Before:
text data bss dec hex filename
52895 10976 0 63871 f97f ../qlogic/qlcnic/qlcnic_ethtool.o
After:
text data bss dec hex filename
52701 11104 0 63805 f93d ../qlogic//qlcnic/qlcnic_ethtool.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801151659.146113-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't populate the const array name on the stack but instead it
static. Makes the object code smaller by 28 bytes. Add a missing
const to clean up a checkpatch warning.
Before:
text data bss dec hex filename
124565 31565 384 156514 26362 drivers/net/ethernet/marvell/sky2.o
After:
text data bss dec hex filename
124441 31661 384 156486 26346 drivers/net/ethernet/marvell/sky2.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801150647.145728-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't populate the array match_all_mac on the stack but instead it
static const. Makes the object code smaller by 75 bytes.
Before:
text data bss dec hex filename
46701 8960 64 55725 d9ad ../chelsio/cxgb4/cxgb4_filter.o
After:
text data bss dec hex filename
46338 9120 192 55650 d962 ../chelsio/cxgb4/cxgb4_filter.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801151205.145924-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ioana reported a refcount warning when booting over NFS:
[ 5.042532] ------------[ cut here ]------------
[ 5.047184] refcount_t: addition on 0; use-after-free.
[ 5.052324] WARNING: CPU: 7 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0xa4/0x150
...
[ 5.167201] Call trace:
[ 5.169635] refcount_warn_saturate+0xa4/0x150
[ 5.174067] fib_create_info+0xc00/0xc90
[ 5.177982] fib_table_insert+0x8c/0x620
[ 5.181893] fib_magic.isra.0+0x110/0x11c
[ 5.185891] fib_add_ifaddr+0xb8/0x190
[ 5.189629] fib_inetaddr_event+0x8c/0x140
fib_treeref needs to be set after kzalloc. The old code had a ++ which
led to the confusion when the int was replaced by a refcount_t.
Fixes: 79976892f7 ("net: convert fib_treeref from int to refcount_t")
Signed-off-by: David Ahern <dsahern@kernel.org>
Reported-by: Ioana Ciornei <ciorneiioana@gmail.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20210802160221.27263-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't populate arrays on the stack but instead them static const.
Makes the object code smaller by 280 bytes.
Before:
text data bss dec hex filename
24142 4368 192 28702 701e ./drivers/net/phy/mscc/mscc_ptp.o
After:
text data bss dec hex filename
23830 4400 192 28422 6f06 ./drivers/net/phy/mscc/mscc_ptp.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801070155.139057-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't populate the array res_ids on the stack but instead it
static const. Makes the object code smaller by 14 bytes.
Before:
text data bss dec hex filename
50833 8314 256 59403 e80b ./drivers/net/netdevsim/fib.o
After:
text data bss dec hex filename
50755 8378 256 59389 e7fd ./drivers/net/netdevsim/fib.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210801065328.138906-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].
Use an anonymous union with a couple of anonymous structs in order to
keep userspace unchanged:
$ pahole -C ip_msfilter net/ipv4/ip_sockglue.o
struct ip_msfilter {
union {
struct {
__be32 imsf_multiaddr_aux; /* 0 4 */
__be32 imsf_interface_aux; /* 4 4 */
__u32 imsf_fmode_aux; /* 8 4 */
__u32 imsf_numsrc_aux; /* 12 4 */
__be32 imsf_slist[1]; /* 16 4 */
}; /* 0 20 */
struct {
__be32 imsf_multiaddr; /* 0 4 */
__be32 imsf_interface; /* 4 4 */
__u32 imsf_fmode; /* 8 4 */
__u32 imsf_numsrc; /* 12 4 */
__be32 imsf_slist_flex[0]; /* 16 0 */
}; /* 0 16 */
}; /* 0 20 */
/* size: 20, cachelines: 1, members: 1 */
/* last cacheline: 20 bytes */
};
Also, refactor the code accordingly and make use of the struct_size()
and flex_array_size() helpers.
This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays
Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/109
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nci_request() receives a callback function and unsigned long data
argument "opt" which is passed to the callback. Almost all of the
nci_request() callers pass pointer to a stack variable as data argument.
Only few pass scalar value (e.g. u8).
All such callbacks do not modify passed data argument and in previous
commit they were made as const. However passing pointers via unsigned
long removes the const annotation. The callback could simply cast
unsigned long to a pointer to writeable memory.
Use "const void *" as type of this "opt" argument to solve this and
prevent modifying the pointed contents. This is also consistent with
generic pattern of passing data arguments - via "void *". In few places
which pass scalar values, use casts via "unsigned long" to suppress any
warnings.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is desirable to reduce the surface of DSA_TAG_PROTO_NONE as much as
we can, because we now have options for switches without hardware
support for DSA tagging, and the occurrence in the mt7530 driver is in
fact quite gratuitout and easy to remove. Since ds->ops->get_tag_protocol()
is only called for CPU ports, the checks for a CPU port in
mtk_get_tag_protocol() are redundant and can be removed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham says:
====================
cn10k: DWRR MTU and weights configuration
On OcteonTx2 DWRR quantum is directly configured into each of
the transmit scheduler queues. And PF/VF drivers were free to
config any value upto 2^24.
On CN10K, HW is modified, the quantum configuration at scheduler
queues is in terms of weight. And SW needs to setup a base DWRR MTU
at NIX_AF_DWRR_RPM_MTU / NIX_AF_DWRR_SDP_MTU. HW will do
'DWRR MTU * weight' to get the quantum.
This patch series addresses this HW change on CN10K silicons,
both admin function and PF/VF drivers are modified.
Also added support to program DWRR MTU via devlink params.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Program SQ, MDQ, TL4 to TL2 transmit scheduler queues' DWRR
weight based on DWRR MTU programmed at NIX_AF_DWRR_RPM_MTU.
The DWRR MTU from admin function is retrieved via mbox.
On OcteaonTx2 silicon, admin function driver responds with DWRR
MTU as '1'. This helps to avoid silicon specific transmit
scheduler DWRR quantum/weight configuration logic.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On OcteonTx2 DWRR quantum is directly configured into each of
the transmit scheduler queues. And PF/VF drivers were free to
config any value upto 2^24.
On CN10K, HW is modified, the quantum configuration at scheduler
queues is in terms of weight. And SW needs to setup a base DWRR MTU
at NIX_AF_DWRR_RPM_MTU / NIX_AF_DWRR_SDP_MTU. HW will do
'DWRR MTU * weight' to get the quantum. For LBK traffic, value
programmed into NIX_AF_DWRR_RPM_MTU register is considered as
DWRR MTU.
This patch programs a default DWRR MTU of 8192 into HW and also
provides a way to change this via devlink params.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removed the 'raw gso min size - 1' test which
always fails now:
./in_netns.sh ./psock_snd -v -c -g -l "${mss}"
raw gso min size - 1 (expected to fail)
tx: 1524
rx: 1472
OK
After commit 7c6d2ecbda ("net: be more gentle about silly
gso requests coming from user"), we relaxed the min gso_size
check in virtio_net_hdr_to_skb().
So when a packet which is smaller then the gso_size,
GSO for this packet will not be set, the packet will be
send/recv successfully.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some time ago, I reported a calltrace issue
"did not find a suitable aggregator", please see[1].
After a period of analysis and reproduction, I find
that this problem is caused by concurrency.
Before the problem occurs, the bond structure is like follows:
bond0 - slaver0(eth0) - agg0.lag_ports -> port0 - port1
\
port0
\
slaver1(eth1) - agg1.lag_ports -> NULL
\
port1
If we run 'ifenslave bond0 -d eth1', the process is like below:
excuting __bond_release_one()
|
bond_upper_dev_unlink()[step1]
| | |
| | bond_3ad_lacpdu_recv()
| | ->bond_3ad_rx_indication()
| | spin_lock_bh()
| | ->ad_rx_machine()
| | ->__record_pdu()[step2]
| | spin_unlock_bh()
| | |
| bond_3ad_state_machine_handler()
| spin_lock_bh()
| ->ad_port_selection_logic()
| ->try to find free aggregator[step3]
| ->try to find suitable aggregator[step4]
| ->did not find a suitable aggregator[step5]
| spin_unlock_bh()
| |
| |
bond_3ad_unbind_slave() |
spin_lock_bh()
spin_unlock_bh()
step1: already removed slaver1(eth1) from list, but port1 remains
step2: receive a lacpdu and update port0
step3: port0 will be removed from agg0.lag_ports. The struct is
"agg0.lag_ports -> port1" now, and agg0 is not free. At the
same time, slaver1/agg1 has been removed from the list by step1.
So we can't find a free aggregator now.
step4: can't find suitable aggregator because of step2
step5: cause a calltrace since port->aggregator is NULL
To solve this concurrency problem, put bond_upper_dev_unlink()
after bond_3ad_unbind_slave(). In this way, we can invalid the port
first and skip this port in bond_3ad_state_machine_handler(). This
eliminates the situation that the slaver has been removed from the
list but the port is still valid.
[1]https://lore.kernel.org/netdev/10374.1611947473@famine/
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TC action ->init() API has 10 parameters, it becomes harder
to read. Some of them are just boolean and can be replaced
by flags. Similarly for the internal API tcf_action_init()
and tcf_exts_validate().
This patch converts them to flags and fold them into
the upper 16 bits of "flags", whose lower 16 bits are still
reserved for user-space. More specifically, the following
kernel flags are introduced:
TCA_ACT_FLAGS_POLICE replace 'name' in a few contexts, to
distinguish whether it is compatible with policer.
TCA_ACT_FLAGS_BIND replaces 'bind', to indicate whether
this action is bound to a filter.
TCA_ACT_FLAGS_REPLACE replaces 'ovr' in most contexts,
means we are replacing an existing action.
TCA_ACT_FLAGS_NO_RTNL replaces 'rtnl_held' but has the
opposite meaning, because we still hold RTNL in most
cases.
The only user-space flag TCA_ACT_FLAGS_NO_PERCPU_STATS is
untouched and still stored as before.
I have tested this patch with tdc and I do not see any
failure related to this patch.
Tested-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In three places, the driver calls of_get_property and reads the property
length although the length is not used. Update the calls to not request
the length.
Signed-off-by: Martin Kaiser <martin@kaiser.cx>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrii Nakryiko says:
====================
bpf-next 2021-07-30
We've added 64 non-merge commits during the last 15 day(s) which contain
a total of 83 files changed, 5027 insertions(+), 1808 deletions(-).
The main changes are:
1) BTF-guided binary data dumping libbpf API, from Alan.
2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei.
3) Ambient BPF run context and cgroup storage cleanup, from Andrii.
4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi.
5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri.
6) bpf_{get,set}sockopt() support in BPF iterators, from Martin.
7) BPF map pinning improvements in libbpf, from Martynas.
8) Improved module BTF support in libbpf and bpftool, from Quentin.
9) Bpftool cleanups and documentation improvements, from Quentin.
10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi.
11) Increased maximum cgroup storage size, from Stanislav.
12) Small fixes and improvements to BPF tests and samples, from various folks.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
tools: bpftool: Complete metrics list in "bpftool prog profile" doc
tools: bpftool: Document and add bash completion for -L, -B options
selftests/bpf: Update bpftool's consistency script for checking options
tools: bpftool: Update and synchronise option list in doc and help msg
tools: bpftool: Complete and synchronise attach or map types
selftests/bpf: Check consistency between bpftool source, doc, completion
tools: bpftool: Slightly ease bash completion updates
unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
tools: bpftool: Support dumping split BTF by id
libbpf: Add split BTF support for btf__load_from_kernel_by_id()
tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()
tools: Free BTF objects at various locations
libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()
libbpf: Rename btf__load() as btf__load_into_kernel()
libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()
bpf: Emit better log message if bpf_iter ctx arg btf_id == 0
tools/resolve_btfids: Emit warnings and patch zero id for missing symbols
bpf: Increase supported cgroup storage value size
libbpf: Fix race when pinning maps in parallel
...
====================
Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
and netfilter trees.
Current release - regressions:
- mac80211: fix starting aggregation sessions on mesh interfaces
Current release - new code bugs:
- sctp: send pmtu probe only if packet loss in Search Complete state
- bnxt_en: add missing periodic PHC overflow check
- devlink: fix phys_port_name of virtual port and merge error
- hns3: change the method of obtaining default ptp cycle
- can: mcba_usb_start(): add missing urb->transfer_dma initialization
Previous releases - regressions:
- set true network header for ECN decapsulation
- mlx5e: RX, avoid possible data corruption w/ relaxed ordering and LRO
- phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811 PHY
- sctp: fix return value check in __sctp_rcv_asconf_lookup
Previous releases - always broken:
- bpf:
- more spectre corner case fixes, introduce a BPF nospec
instruction for mitigating Spectre v4
- fix OOB read when printing XDP link fdinfo
- sockmap: fix cleanup related races
- mac80211: fix enabling 4-address mode on a sta vif after assoc
- can:
- raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
- j1939: j1939_session_deactivate(): clarify lifetime of
session object, avoid UAF
- fix number of identical memory leaks in USB drivers
- tipc:
- do not blindly write skb_shinfo frags when doing decryption
- fix sleeping in tipc accept routine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEEWm8ACgkQMUZtbf5S
Irv84A//V/nn9VRdpDpmodwBWVEc9SA00M/nmziRBLwRyG+fRMtnePY4Ha40TPbh
LL6orth08hZKOjVmMc6Ea4EjZbV5E3iAKtAnaX6wi1HpEXVxKtFYnWxu9ydwTEd9
An1fltDtWYkNi3kiq7il+Tp1/yZAQ+NYv5zQZCWJ47kkN3jkjULdAEBqODA2A6Ul
0PQgS1rKzXukE19PlXDuaNuEekhTiEfaTwzHjdBJZkj1toGJGfHsvdQ/YJjixzB9
44SjE4PfxIaMWP0BVaD6hwzaVQhaZETXhZZufdIDdQd7sDbmd6CPODX6mXfLEq4u
JaWylgobsK+5ScHE6siVI+ZlW7stq9l1Ynm10ADiwsZVzKEoP745484aEFOLO6Z+
Ln/IqDQCP/yJQmnl2i0+TfqVDh6BKYoIfUUK/+nzHw4Otycy0m3kj4P+74aYfjOv
Q+cUgbXUemcrpq6wGUK+zK0NyNHVILvdPDnHPMMypwqPk18y5ZmFvaJAVUPSavD9
N7t9LoLyGwK3i/Ir4l+JJZ1KgAv1+TbmyNBWvY1Yk/r/vHU3nBPIv26s7YarNAwD
094vJEJ0+mqO4h+Xj1Nc7HEBFi46JfpN2L8uYoM7gpwziIRMdmpXVLmpEk43WmFi
UMwWJWqabPEXaozC2UFcFLSk+jS7DiD+G5eG+Fd5HecmKzd7RI0=
=sKPI
-----END PGP SIGNATURE-----
Merge tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
(mac80211) and netfilter trees.
Current release - regressions:
- mac80211: fix starting aggregation sessions on mesh interfaces
Current release - new code bugs:
- sctp: send pmtu probe only if packet loss in Search Complete state
- bnxt_en: add missing periodic PHC overflow check
- devlink: fix phys_port_name of virtual port and merge error
- hns3: change the method of obtaining default ptp cycle
- can: mcba_usb_start(): add missing urb->transfer_dma initialization
Previous releases - regressions:
- set true network header for ECN decapsulation
- mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
LRO
- phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
PHY
- sctp: fix return value check in __sctp_rcv_asconf_lookup
Previous releases - always broken:
- bpf:
- more spectre corner case fixes, introduce a BPF nospec
instruction for mitigating Spectre v4
- fix OOB read when printing XDP link fdinfo
- sockmap: fix cleanup related races
- mac80211: fix enabling 4-address mode on a sta vif after assoc
- can:
- raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
- j1939: j1939_session_deactivate(): clarify lifetime of session
object, avoid UAF
- fix number of identical memory leaks in USB drivers
- tipc:
- do not blindly write skb_shinfo frags when doing decryption
- fix sleeping in tipc accept routine"
* tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
gve: Update MAINTAINERS list
can: esd_usb2: fix memory leak
can: ems_usb: fix memory leak
can: usb_8dev: fix memory leak
can: mcba_usb_start(): add missing urb->transfer_dma initialization
can: hi311x: fix a signedness bug in hi3110_cmd()
MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
bpf: Fix leakage due to insufficient speculative store bypass mitigation
bpf: Introduce BPF nospec instruction for mitigating Spectre v4
sis900: Fix missing pci_disable_device() in probe and remove
net: let flow have same hash in two directions
nfc: nfcsim: fix use after free during module unload
tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
sctp: fix return value check in __sctp_rcv_asconf_lookup
nfc: s3fwrn5: fix undefined parameter values in dev_err()
net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
net/mlx5: Unload device upon firmware fatal error
net/mlx5e: Fix page allocation failure for ptp-RQ over SF
net/mlx5e: Fix page allocation failure for trap-RQ over SF
...
- Revert recent change of the ACPI IRQ resources handling that
attempted to improve the ACPI IRQ override selection logic, but
introduced serious regressions on some systems (Hui Wang).
- Fix up quirks for AMD platforms in the suspend-to-idle support
code so as to take upcoming systems using uPEP HID AMDI007 into
account as appropriate (Mario Limonciello).
- Fix the code retrieving DPTF attributes of the PCH FIVR so that
it agrees on the return data type with the ACPI control method
evaluated for this purpose (Srinivas Pandruvada).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmEESVASHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxAg4P/3XEhqudZI7+5VsGgjdVSuYIQZEVBiIt
6M1WOIVwbL+pybeadYVTIzjsBIWWyem48mvUf9liqpzNfgTdyVdbANyJNct9NsgN
fETI/ethoBE0+VA8m5Z59ze3vwLWHII++GrL2J6/XQm+VV1mul2FsZ9GJj8v+8zf
rD/OatQZMkdLQ5Z7E3OfeNRETmyuxd95wI3xmNcUxtMWkpWq41tRCoXHRPGuWxGF
xJKRHDtN7MqXI6WPvdKLMZ2XXbbbmwr4fw5/r5schkP8dbtzMLMPhd7blZlA81jF
no7jNQ8EPs5IIgpMkxNo1wVnYK0ALDWrAKifODsQe1WJbRThz9SRAssYD7WQkczE
zoE6FcUt2rrKj91P0cnOUWJ+PI8WTa4RStjva1zxliwgv7pDn5SuedAdPv0P/9Zz
XO74NrnrF8P+H/rWMNX3/kVzzabw58gzr0o/2a17sFyk+dVAb3vsjQRS+MWy/GLA
OsiAqqRe3jC2OtyeQ053FLxacSqItBrfySPB9fGO5rpi5KOG/8ODNf3Y9Z+aWeln
LNi7/SU4ZUbklmr8BbXRAdfCnZBrmr9+ddeP4Qg4cBtoJQUsjQHSQzXEuFQWPFNL
L+oHkPLNw0Yqej5pJa5eoOKEH0lm0aBDivNV3zJ/0PqD2zFtNB6LQWIftOYARaz0
CDfY0XwUdq58
=IGYg
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These revert a recent IRQ resources handling modification that turned
out to be problematic, fix suspend-to-idle handling on AMD platforms
to take upcoming systems into account properly and fix the retrieval
of the DPTF attributes of the PCH FIVR.
Specifics:
- Revert recent change of the ACPI IRQ resources handling that
attempted to improve the ACPI IRQ override selection logic, but
introduced serious regressions on some systems (Hui Wang).
- Fix up quirks for AMD platforms in the suspend-to-idle support code
so as to take upcoming systems using uPEP HID AMDI007 into account
as appropriate (Mario Limonciello).
- Fix the code retrieving DPTF attributes of the PCH FIVR so that it
agrees on the return data type with the ACPI control method
evaluated for this purpose (Srinivas Pandruvada)"
* tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: DPTF: Fix reading of attributes
Revert "ACPI: resources: Add checks for ACPI IRQ override"
ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007
Since commit 1b6b26ae70 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.
In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already. Doing
extraneous wakeups will only cause potential thundering herd problems.
However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it". Even if there was no edge in sight.
Quoting Sandeep Patil:
"The commit 1b6b26ae70 ('pipe: fix and clarify pipe write wakeup
logic') changed pipe write logic to wakeup readers only if the pipe
was empty at the time of write. However, there are libraries that
relied upon the older behavior for notification scheme similar to
what's described in [1]
One such library 'realm-core'[2] is used by numerous Android
applications. The library uses a similar notification mechanism as GNU
Make but it never drains the pipe until it is full. When Android moved
to v5.10 kernel, all applications using this library stopped working.
The library has since been fixed[3] but it will be a while before all
applications incorporate the updated library"
Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong. Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.
So add the extraneous wakeup, to approximate the old behavior.
[ I say "approximate", because the exact old behavior was to do a wakeup
not for each write(), but for each pipe buffer chunk that was filled
in. The behavior introduced by this change is not that - this is just
"every write will cause a wakeup, whether necessary or not", which
seems to be sufficient for the broken library use. ]
It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.
See commit f467a6a664 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".
Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Quentin Monnet says:
====================
To work with the different program types, map types, attach types etc.
supported by eBPF, bpftool needs occasional updates to learn about the new
features supported by the kernel. When such types translate into new
keyword for the command line, updates are expected in several locations:
typically, the help message displayed from bpftool itself, the manual page,
and the bash completion file should be updated. The options used by the
different commands for bpftool should also remain synchronised at those
locations.
Several omissions have occurred in the past, and a number of types are
still missing today. This set is an attempt to improve the situation. It
brings up-to-date the lists of types or options in bpftool, and also adds a
Python script to the BPF selftests to automatically check that most of
these lists remain synchronised.
v2:
- Reformat some lines in the bash completion file.
- Do not reformat attach types, to preserve git-blame history.
- Do not call Python script from tools/testing/selftests/bpf/Makefile.
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Profiling programs with bpftool was extended some time ago to support
two new metrics, namely itlb_misses and dtlb_misses (misses for the
instruction/data translation lookaside buffer). Update the manual page
and bash completion accordingly.
Fixes: 450d060e8f ("bpftool: Add {i,d}tlb_misses support for bpftool profile")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-8-quentin@isovalent.com
The -L|--use-loader option for using loader programs when loading, or
when generating a skeleton, did not have any documentation or bash
completion. Same thing goes for -B|--base-btf, used to pass a path to a
base BTF object for split BTF such as BTF for kernel modules.
This patch documents and adds bash completion for those options.
Fixes: 75fa177769 ("tools/bpftool: Add bpftool support for split BTF")
Fixes: d510296d33 ("bpftool: Use syscall/loader program in "prog load" and "gen skeleton" command.")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-7-quentin@isovalent.com
Update the script responsible for checking that the different types used
at various places in bpftool are synchronised, and extend it to check
the consistency of options between the help messages in the source code
and the manual pages.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-6-quentin@isovalent.com
All bpftool commands support the options for JSON output and debug from
libbpf. In addition, some commands support additional options
corresponding to specific use cases.
The list of options described in the man pages for the different
commands are not always accurate. The messages for interactive help are
mostly limited to HELP_SPEC_OPTIONS, and are even less representative of
the actual set of options supported for the commands.
Let's update the lists:
- HELP_SPEC_OPTIONS is modified to contain the "default" options (JSON
and debug), and to be extensible (no ending curly bracket).
- All commands use HELP_SPEC_OPTIONS in their help message, and then
complete the list with their specific options.
- The lists of options in the man pages are updated.
- The formatting of the list for bpftool.rst is adjusted to match
formatting for the other man pages. This is for consistency, and also
because it will be helpful in a future patch to automatically check
that the files are synchronised.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-5-quentin@isovalent.com
Update bpftool's list of attach type names to tell it about the latest
attach types, or the "ringbuf" map. Also update the documentation, help
messages, and bash completion when relevant.
These missing items were reported by the newly added Python script used
to help maintain consistency in bpftool.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-4-quentin@isovalent.com
Whenever the eBPF subsystem gains new elements, such as new program or
map types, it is necessary to update bpftool if we want it able to
handle the new items.
In addition to the main arrays containing the names of these elements in
the source code, there are also multiple locations to update:
- The help message in the do_help() functions in bpftool's source code.
- The RST documentation files.
- The bash completion file.
This has led to omissions multiple times in the past. This patch
attempts to address this issue by adding consistency checks for all
these different locations. It also verifies that the bpf_prog_type,
bpf_map_type and bpf_attach_type enums from the UAPI BPF header have all
their members present in bpftool.
The script requires no argument to run, it reads and parses the
different files to check, and prints the mismatches, if any. It
currently reports a number of missing elements, which will be fixed in a
later patch:
$ ./test_bpftool_synctypes.py
Comparing [...]/linux/tools/bpf/bpftool/map.c (map_type_name) and [...]/linux/tools/bpf/bpftool/bash-completion/bpftool (BPFTOOL_MAP_CREATE_TYPES): {'ringbuf'}
Comparing BPF header (enum bpf_attach_type) and [...]/linux/tools/bpf/bpftool/common.c (attach_type_name): {'BPF_TRACE_ITER', 'BPF_XDP_DEVMAP', 'BPF_XDP', 'BPF_SK_REUSEPORT_SELECT', 'BPF_XDP_CPUMAP', 'BPF_SK_REUSEPORT_SELECT_OR_MIGRATE'}
Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/prog.c (do_help() ATTACH_TYPE): {'skb_verdict'}
Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/Documentation/bpftool-prog.rst (ATTACH_TYPE): {'skb_verdict'}
Comparing [...]/linux/tools/bpf/bpftool/prog.c (attach_type_strings) and [...]/linux/tools/bpf/bpftool/bash-completion/bpftool (BPFTOOL_PROG_ATTACH_TYPES): {'skb_verdict'}
Note that the script does NOT check for consistency between the list of
program types that bpftool claims it accepts and the actual list of
keywords that can be used. This is because bpftool does not "see" them,
they are ELF section names parsed by libbpf. It is not hard to parse the
section_defs[] array in libbpf, but some section names are associated
with program types that bpftool cannot load at the moment. For example,
some programs require a BTF target and an attach target that bpftool
cannot handle. The script may be extended to parse the array and check
only relevant values in the future.
The script is not added to the selftests' Makefile, because doing so
would require all patches with BPF UAPI change to also update bpftool.
Instead it is to be added to the CI.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-3-quentin@isovalent.com
Bash completion for bpftool gets two minor improvements in this patch.
Move the detection of attach types for "bpftool cgroup attach" outside
of the "case/esac" bloc, where we cannot reuse our variable holding the
list of supported attach types as a pattern list. After the change, we
have only one list of cgroup attach types to update when new types are
added, instead of the former two lists.
Also rename the variables holding lists of names for program types, map
types, and attach types, to make them more unique. This can make it
slightly easier to point people to the relevant variables to update, but
the main objective here is to help run a script to check that bash
completion is up-to-date with bpftool's source code.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210730215435.7095-2-quentin@isovalent.com
Leon Romanovsky says:
====================
Clean devlink net namespace operations
This short series continues my work on devlink core code to make devlink
reload less prone to errors and harden it from API abuse.
Despite first patch being a clear fix, I would ask you to apply it to
net-next anyway, because the fixed patch is anyway old and it will
help us to eliminate merge conflicts that will arise for following
patches or even for the second one.
====================
Link: https://lore.kernel.org/r/cover.1627578998.git.leonro@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no need in extra call indirection and check from impossible
flow where someone tries to set namespace without prior call
to devlink_alloc().
Instead of this extra logic and additional EXPORT_SYMBOL, use specialized
devlink allocation function that receives net namespace as an argument.
Such specialized API allows clear view when devlink initialized in wrong
net namespace and/or kernel users don't try to change devlink namespace
under the hood.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As Eric noticed, __unix_dgram_recvmsg() may acquire u->iolock
too, so we have to release it before calling this function.
Fixes: 9825d866ce ("af_unix: Implement unix_dgram_bpf_recvmsg()")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>