linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-26 21:54:11 +08:00

Author	SHA1	Message	Date
Martin KaFai Lau	c9ae8c966f	Merge branch 'fixes for concurrent htab updates' Hou Tao says: ==================== From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the issues found during investigating the syzkaller problem reported in [0]. It seems that the concurrent updates to the same hash-table bucket may fail as shown in patch 1. Patch 1 uses preempt_disable() to fix the problem for htab_use_raw_lock() case. For !htab_use_raw_lock() case, the problem is left to "BPF specific memory allocator" patchset [1] in which !htab_use_raw_lock() will be removed. Patch 2 fixes the out-of-bound memory read problem reported in [0]. The problem has the root cause as patch 1 and it is fixed by handling -EBUSY from htab_lock_bucket() correctly. Patch 3 add two cases for hash-table update: one for the reentrancy of bpf_map_update_elem(), and another one for concurrent updates of the same hash-table bucket. Comments are always welcome. Regards, Tao [0]: https://lore.kernel.org/bpf/CACkBjsbuxaR6cv0kXJoVnBfL9ZJXjjoUcMpw_Ogc313jSrg14A@mail.gmail.com/ [1]: https://lore.kernel.org/bpf/20220819214232.18784-1-alexei.starovoitov@gmail.com/ Change Log: v4: * rebased on bpf-next * add htab_update to DENYLIST.s390x v3: https://lore.kernel.org/bpf/20220829023709.1958204-1-houtao@huaweicloud.com/ * patch 1: update commit message and add Fixes tag * patch 2: add Fixes tag * patch 3: elaborate the description of test cases v2: https://lore.kernel.org/bpf/bd60ef93-1c6a-2db2-557d-b09b92ad22bd@huaweicloud.com/ * Note the fix is for CONFIG_PREEMPT case in commit message and add Reviewed-by tag for patch 1 * Drop patch "bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case" v1: https://lore.kernel.org/bpf/20220821033223.2598791-1-houtao@huaweicloud.com/ ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-08-31 14:10:01 -07:00
Hou Tao	1c636b6277	selftests/bpf: Add test cases for htab update One test demonstrates the reentrancy of hash map update on the same bucket should fail, and another one shows concureently updates of the same hash map bucket should succeed and not fail due to the reentrancy checking for bucket lock. There is no trampoline support on s390x, so move htab_update to denylist. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-4-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-08-31 14:10:01 -07:00
Hou Tao	66a7a92e4d	bpf: Propagate error from htab_lock_bucket() to userspace In __htab_map_lookup_and_delete_batch() if htab_lock_bucket() returns -EBUSY, it will go to next bucket. Going to next bucket may not only skip the elements in current bucket silently, but also incur out-of-bound memory access or expose kernel memory to userspace if current bucket_cnt is greater than bucket_size or zero. Fixing it by stopping batch operation and returning -EBUSY when htab_lock_bucket() fails, and the application can retry or skip the busy batch as needed. Fixes: `20b6cc34ea` ("bpf: Avoid hashtab deadlock with map_locked") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-3-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-08-31 14:10:01 -07:00
Hou Tao	2775da2162	bpf: Disable preemption when increasing per-cpu map_locked Per-cpu htab->map_locked is used to prohibit the concurrent accesses from both NMI and non-NMI contexts. But since commit `74d862b682` ("sched: Make migrate_disable/enable() independent of RT"), migrate_disable() is also preemptible under CONFIG_PREEMPT case, so now map_locked also disallows concurrent updates from normal contexts (e.g. userspace processes) unexpectedly as shown below: process A process B htab_map_update_elem() htab_lock_bucket() migrate_disable() /* return 1 / __this_cpu_inc_return() / preempted by B / htab_map_update_elem() / the same bucket as A / htab_lock_bucket() migrate_disable() / return 2, so lock fails */ __this_cpu_inc_return() return -EBUSY A fix that seems feasible is using in_nmi() in htab_lock_bucket() and only checking the value of map_locked for nmi context. But it will re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered through non-tracing program (e.g. fentry program). One cannot use preempt_disable() to fix this issue as htab_use_raw_lock being false causes the bucket lock to be a spin lock which can sleep and does not work with preempt_disable(). Therefore, use migrate_disable() when using the spinlock instead of preempt_disable() and defer fixing concurrent updates to when the kernel has its own BPF memory allocator. Fixes: `74d862b682` ("sched: Make migrate_disable/enable() independent of RT") Reviewed-by: Hao Luo <haoluo@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-2-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2022-08-31 14:10:01 -07:00
Martin KaFai Lau	197072945a	selftest/bpf: Ensure no module loading in bpf_setsockopt(TCP_CONGESTION) This patch adds a test to ensure bpf_setsockopt(TCP_CONGESTION, "not_exist") will not trigger the kernel module autoload. Before the fix: [ 40.535829] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 [...] [ 40.552134] tcp_ca_find_autoload.constprop.0+0xcb/0x200 [ 40.552689] tcp_set_congestion_control+0x99/0x7b0 [ 40.553203] do_tcp_setsockopt+0x3ed/0x2240 [...] [ 40.556041] __bpf_setsockopt+0x124/0x640 Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220830231953.792412-1-martin.lau@linux.dev	2022-08-31 22:22:29 +02:00
Martin KaFai Lau	84e5a0f208	bpf, net: Avoid loading module when calling bpf_setsockopt(TCP_CONGESTION) When bpf prog changes tcp-cc by calling bpf_setsockopt(TCP_CONGESTION), it should not try to load module which may be a blocking operation. This details was correct in the v1 [0] but missed by mistake in the later revision in commit `cb388e7ee3` ("bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()"). This patch fixes it by checking the has_current_bpf_ctx(). [0] https://lore.kernel.org/bpf/20220727060921.2373314-1-kafai@fb.com/ Fixes: `cb388e7ee3` ("bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()") Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220830231946.791504-1-martin.lau@linux.dev	2022-08-31 22:21:45 +02:00
James Hilliard	14e5ce7994	libbpf: Add GCC support for bpf_tail_call_static The bpf_tail_call_static function is currently not defined unless using clang >= 8. To support bpf_tail_call_static on GCC we can check if __clang__ is not defined to enable bpf_tail_call_static. We need to use GCC assembly syntax when the compiler does not define __clang__ as LLVM inline assembly is not fully compatible with GCC. Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220829210546.755377-1-james.hilliard1@gmail.com	2022-08-31 20:54:23 +02:00
Hao Luo	6f95de6d71	bpftool: Add support for querying cgroup_iter link Support dumping info of a cgroup_iter link. This includes showing the cgroup's id and the order for walking the cgroup hierarchy. Example output is as follows: > bpftool link show 1: iter prog 2 target_name bpf_map 2: iter prog 3 target_name bpf_prog 3: iter prog 12 target_name cgroup cgroup_id 72 order self_only > bpftool -p link show [{ "id": 1, "type": "iter", "prog_id": 2, "target_name": "bpf_map" },{ "id": 2, "type": "iter", "prog_id": 3, "target_name": "bpf_prog" },{ "id": 3, "type": "iter", "prog_id": 12, "target_name": "cgroup", "cgroup_id": 72, "order": "self_only" } ] Signed-off-by: Hao Luo <haoluo@google.com> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220829231828.1016835-1-haoluo@google.com Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev>	2022-08-30 11:02:03 -07:00
James Hilliard	2eb680401d	selftests/bpf: Fix connect4_prog tcp/socket header type conflict There is a potential for us to hit a type conflict when including netinet/tcp.h and sys/socket.h, we can replace both of these includes with linux/tcp.h and bpf_tcp_helpers.h to avoid this conflict. Fixes errors like the below when compiling with gcc BPF backend: In file included from /usr/include/netinet/tcp.h:91, from progs/connect4_prog.c:11: /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: error: conflicting types for 'int8_t'; have 'char' 34 \| typedef __INT8_TYPE__ int8_t; \| ^~~~~~ In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155, from /usr/include/x86_64-linux-gnu/bits/socket.h:29, from /usr/include/x86_64-linux-gnu/sys/socket.h:33, from progs/connect4_prog.c:10: /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'signed char'} 24 \| typedef __int8_t int8_t; \| ^~~~~~ /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: error: conflicting types for 'int64_t'; have 'long int' 43 \| typedef __INT64_TYPE__ int64_t; \| ^~~~~~~ /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long long int'} 27 \| typedef __int64_t int64_t; \| ^~~~~~~ Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220829154710.3870139-1-james.hilliard1@gmail.com	2022-08-29 22:19:48 +02:00
James Hilliard	3721359d39	selftests/bpf: Fix bind{4,6} tcp/socket header type conflict There is a potential for us to hit a type conflict when including netinet/tcp.h with sys/socket.h, we can remove these as they are not actually needed. Fixes errors like the below when compiling with gcc BPF backend: In file included from /usr/include/netinet/tcp.h:91, from progs/bind4_prog.c:10: /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: error: conflicting types for 'int8_t'; have 'char' 34 \| typedef __INT8_TYPE__ int8_t; \| ^~~~~~ In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155, from /usr/include/x86_64-linux-gnu/bits/socket.h:29, from /usr/include/x86_64-linux-gnu/sys/socket.h:33, from progs/bind4_prog.c:9: /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'signed char'} 24 \| typedef __int8_t int8_t; \| ^~~~~~ /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: error: conflicting types for 'int64_t'; have 'long int' 43 \| typedef __INT64_TYPE__ int64_t; \| ^~~~~~~ /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long long int'} 27 \| typedef __int64_t int64_t; \| ^~~~~~~ make: *** [Makefile:537: /home/buildroot/bpf-next/tools/testing/selftests/bpf/bpf_gcc/bind4_prog.o] Error 1 Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220826052925.980431-1-james.hilliard1@gmail.com	2022-08-29 17:00:41 +02:00
Tiezhu Yang	bbcf0f55e5	bpf, mips: No need to use min() to get MAX_TAIL_CALL_CNT MAX_TAIL_CALL_CNT is 33, so min(MAX_TAIL_CALL_CNT, 0xffff) is always MAX_TAIL_CALL_CNT, it is better to use MAX_TAIL_CALL_CNT directly. At the same time, add BUILD_BUG_ON(MAX_TAIL_CALL_CNT > 0xffff) with a comment on why the assertion is there. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Suggested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1661742309-2320-1-git-send-email-yangtiezhu@loongson.cn	2022-08-29 15:38:14 +02:00
Quentin Monnet	aa75622c3b	bpf: Fix a few typos in BPF helpers documentation Address a few typos in the documentation for the BPF helper functions. They were reported by Jakub [0], who ran spell checkers on the generated man page [1]. [0] https://lore.kernel.org/linux-man/d22dcd47-023c-8f52-d369-7b5308e6c842@gmail.com/T/#mb02e7d4b7fb61d98fa914c77b581184e9a9537af [1] https://lore.kernel.org/linux-man/eb6a1e41-c48e-ac45-5154-ac57a2c76108@gmail.com/T/#m4a8d1b003616928013ffcd1450437309ab652f9f v3: Do not copy unrelated (and breaking) elements to tools/ header v2: Turn a ',' into a ';' Reported-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220825220806.107143-1-quentin@isovalent.com	2022-08-26 22:19:31 -07:00
James Hilliard	b05d64efbb	selftests/bpf: Declare subprog_noise as static in tailcall_bpf2bpf4 Due to bpf_map_lookup_elem being declared static we need to also declare subprog_noise as static. Fixes the following error: progs/tailcall_bpf2bpf4.c:26:9: error: 'bpf_map_lookup_elem' is static but used in inline function 'subprog_noise' which is not static [-Werror] 26 \| bpf_map_lookup_elem(&nop_table, &key); \| ^~~~~~~~~~~~~~~~~~~ Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20220826035141.737919-1-james.hilliard1@gmail.com	2022-08-26 22:07:01 -07:00
James Hilliard	ab9ac19c4d	selftests/bpf: fix type conflict in test_tc_dtime The sys/socket.h header isn't required to build test_tc_dtime and may cause a type conflict. Fixes the following error: In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155, from /usr/include/x86_64-linux-gnu/bits/socket.h:29, from /usr/include/x86_64-linux-gnu/sys/socket.h:33, from progs/test_tc_dtime.c:18: /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: error: conflicting types for 'int8_t'; have '__int8_t' {aka 'signed char'} 24 \| typedef __int8_t int8_t; \| ^~~~~~ In file included from progs/test_tc_dtime.c:5: /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'char'} 34 \| typedef __INT8_TYPE__ int8_t; \| ^~~~~~ /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: error: conflicting types for 'int64_t'; have '__int64_t' {aka 'long long int'} 27 \| typedef __int64_t int64_t; \| ^~~~~~~ /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long int'} 43 \| typedef __INT64_TYPE__ int64_t; \| ^~~~~~~ make: *** [Makefile:537: /home/buildroot/bpf-next/tools/testing/selftests/bpf/bpf_gcc/test_tc_dtime.o] Error 1 Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Link: https://lore.kernel.org/r/20220826050703.869571-1-james.hilliard1@gmail.com Signed-off-by: Martin KaFai Lau <kafai@fb.com>	2022-08-26 14:55:38 -07:00
Benjamin Tissoires	343949e107	libbpf: add map_get_fd_by_id and map_delete_elem in light skeleton This allows to have a better control over maps from the kernel when preloading eBPF programs. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220824134055.1328882-8-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 18:52:29 -07:00
Benjamin Tissoires	b88df69796	bpf: prepare for more bpf syscall to be used from kernel and user space. Add BPF_MAP_GET_FD_BY_ID and BPF_MAP_DELETE_PROG. Only BPF_MAP_GET_FD_BY_ID needs to be amended to be able to access the bpf pointer either from the userspace or the kernel. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220824134055.1328882-7-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 18:52:05 -07:00
Hao Luo	d4ffb6f39f	bpf: Add CGROUP prefix to cgroup_iter_order bpf_cgroup_iter_order is globally visible but the entries do not have CGROUP prefix. As requested by Andrii, put a CGROUP in the names in bpf_cgroup_iter_order. This patch fixes two previous commits: one introduced the API and the other uses the API in bpf selftest (that is, the selftest cgroup_hierarchical_stats). I tested this patch via the following command: test_progs -t cgroup,iter,btf_dump Fixes: `d4ccaf58a8` ("bpf: Introduce cgroup iter") Fixes: `88886309d2` ("selftests/bpf: add a selftest for cgroup hierarchical stats collection") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220825223936.1865810-1-haoluo@google.com Signed-off-by: Martin KaFai Lau <kafai@fb.com>	2022-08-25 16:26:37 -07:00
Eyal Birger	0a0d55ef3e	bpf/scripts: Assert helper enum value is aligned with comment order The helper value is ABI as defined by enum bpf_func_id. As bpf_helper_defs.h is used for the userpace part, it must be consistent with this enum. Before this change the comments order was used by the bpf_doc script in order to set the helper values defined in the helpers file. When adding new helpers it is very puzzling when the userspace application breaks in weird places if the comment is inserted instead of appended - because the generated helper ABI is incorrect and shifted. This commit sets the helper value to the enum value. In addition it is currently the practice to have the comments appended and kept in the same order as the enum. As such, add an assertion validating the comment order is consistent with enum value. In case a different comments ordering is desired, this assertion can be lifted. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20220824181043.1601429-1-eyal.birger@gmail.com	2022-08-25 11:49:14 -07:00
Lam Thai	7184aef9c0	bpftool: Fix a wrong type cast in btf_dumper_int When `data` points to a boolean value, casting it to `int ` is problematic and could lead to a wrong value being passed to `jsonw_bool`. Change the cast to `bool ` instead. Fixes: `b12d6ec097` ("bpf: btf: add btf print functionality") Signed-off-by: Lam Thai <lamthai@arista.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220824225859.9038-1-lamthai@arista.com	2022-08-25 11:43:08 -07:00
Alexei Starovoitov	eef3c3d337	Merge branch 'bpf: rstat: cgroup hierarchical' Hao Luo says: ==================== This patch series allows for using bpf to collect hierarchical cgroup stats efficiently by integrating with the rstat framework. The rstat framework provides an efficient way to collect cgroup stats percpu and propagate them through the cgroup hierarchy. The stats are exposed to userspace in textual form by reading files in bpffs, similar to cgroupfs stats by using a cgroup_iter program. cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes: - walking a cgroup's descendants in pre-order. - walking a cgroup's descendants in post-order. - walking a cgroup's ancestors. - process only a single object. When attaching cgroup_iter, one needs to set a cgroup to the iter_link created from attaching. This cgroup can be passed either as a file descriptor or a cgroup id. That cgroup serves as the starting point of the walk. One can also terminate the walk early by returning 1 from the iter program. Note that because walking cgroup hierarchy holds cgroup_mutex, the iter program is called with cgroup_mutex held. Background on rstat for stats collection (I am using a subscriber analogy that is not commonly used) The rstat framework maintains a tree of cgroups that have updates and which cpus have updates. A subscriber to the rstat framework maintains their own stats. The framework is used to tell the subscriber when and what to flush, for the most efficient stats propagation. The workflow is as follows: - When a subscriber updates a cgroup on a cpu, it informs the rstat framework by calling cgroup_rstat_updated(cgrp, cpu). - When a subscriber wants to read some stats for a cgroup, it asks the rstat framework to initiate a stats flush (propagation) by calling cgroup_rstat_flush(cgrp). - When the rstat framework initiates a flush, it makes callbacks to subscribers to aggregate stats on cpus that have updates, and propagate updates to their parent. Currently, the main subscribers to the rstat framework are cgroup subsystems (e.g. memory, block). This patch series allow bpf programs to become subscribers as well. Patches in this series are organized as follows: * Patches 1-2 introduce cgroup_iter prog, and a selftest. * Patches 3-5 allow bpf programs to integrate with rstat by adding the necessary hook points and kfunc. A comprehensive selftest that demonstrates the entire workflow for using bpf and rstat to efficiently collect and output cgroup stats is added. --- Changelog: v8 -> v9: - Make UNSPEC (an invalid option) as the default order for cgroup_iter. - Use enum for specifying cgroup_iter order, instead of u32. - Add BPF_ITER_RESHCED to cgroup_iter. - Add cgroup_hierarchical_stats to s390x denylist. v7 -> v8: - Removed the confusing BPF_ITER_DEFAULT (Andrii) - s/SELF/SELF_ONLY/g - Fixed typo (e.g. outputing) (Andrii) - Use "descendants_pre", "descendants_post" etc. instead of "pre", "post" (Andrii) v6 -> v7: - Updated commit/comments in cgroup_iter for read() behavior (Yonghong) - Extracted BPF_ITER_SELF and other options out of cgroup_iter, so that they can be used in other iters. Also renamed them. (Andrii) - Supports both cgroup_fd and cgroup_id when specifying target cgroup. (Andrii) - Avoided using macro for formatting expected output in cgroup_iter selftest. (Andrii) - Applied 'static' on all vars and functions in cgroup_iter selftest. (Andrii) - Fixed broken buf reading in cgroup_iter selftest. (Andrii) - Switched to use bpf_link__destroy() unconditionally. (Andrii) - Removed 'volatile' for non-const global vars in selftests. (Andrii) - Started using bpf_core_enum_value() to get memory_cgrp_id. (Andrii) v5 -> v6: - Rebased on bpf-next - Tidy up cgroup_hierarchical_stats test (Andrii) * 'static' and 'inline' * avoid using libbpf_get_error() * string literals of cgroup paths. - Rename patch 8/8 to 'selftests/bpf' (Yonghong) - Fix cgroup_iter comments (e.g. PAGE_SIZE and uapi) (Yonghong) - Make sure further read() returns OK after previous read() finished properly (Yonghong) - Release cgroup_mutex before the last call of show() (Kumar) v4 -> v5: - Rebased on top of new kfunc flags infrastructure, updated patch 1 and patch 6 accordingly. - Added docs for sleepable kfuncs. v3 -> v4: - cgroup_iter: * reorder fields in bpf_link_info to avoid break uapi (Yonghong) * comment the behavior when cgroup_fd=0 (Yonghong) * comment on the limit of number of cgroups supported by cgroup_iter. (Yonghong) - cgroup_hierarchical_stats selftest: * Do not return -1 if stats are not found (causes overflow in userspace). * Check if child process failed to join cgroup. * Make buf and path arrays in get_cgroup_vmscan_delay() static. * Increase the test map sizes to accomodate cgroups that are not created by the test. v2 -> v3: - cgroup_iter: * Added conditional compilation of cgroup_iter.c in kernel/bpf/Makefile (kernel test) and dropped the !CONFIG_CGROUP patch. * Added validation of traversal_order when attaching (Yonghong). * Fixed previous wording "two modes" to "three modes" (Yonghong). * Fixed the btf_dump selftest broken by this patch (Yonghong). * Fixed ctx_arg_info[0] to use "PTR_TO_BTF_ID_OR_NULL" instead of "PTR_TO_BTF_ID", because the "cgroup" pointer passed to iter prog can be null. - Use __diag_push to eliminate __weak noinline warning in bpf_rstat_flush(). - cgroup_hierarchical_stats selftest: * Added write_cgroup_file_parent() helper. * Added error handling for failed map updates. * Added null check for cgroup in vmscan_flush. * Fixed the signature of vmscan_[start/end]. * Correctly return error code when attaching trace programs fail. * Make sure all links are destroyed correctly and not leaking in cgroup_hierarchical_stats selftest. * Use memory.reclaim instead of memory.high as a more reliable way to invoke reclaim. * Eliminated sleeps, the test now runs faster. v1 -> v2: - Redesign of cgroup_iter from v1, based on Alexei's idea [1]: * supports walking cgroup subtree. * supports walking ancestors of a cgroup. (Andrii) * supports terminating the walk early. * uses fd instead of cgroup_id as parameter for iter_link. Using fd is a convention in bpf. * gets cgroup's ref at attach time and deref at detach. * brought back cgroup1 support for cgroup_iter. - Squashed the patches adding the rstat flush hook points and kfuncs (Tejun). - Added a comment explaining why bpf_rstat_flush() needs to be weak (Tejun). - Updated the final selftest with the new cgroup_iter design. - Changed CHECKs in the selftest with ASSERTs (Yonghong, Andrii). - Removed empty line at the end of the selftest (Yonghong). - Renamed test files to cgroup_hierarchical_stats.c. - Reordered CGROUP_PATH params order to match struct declaration in the selftest (Michal). - Removed memory_subsys_enabled() and made sure memcg controller enablement checks make sense and are documented (Michal). RFC v2 -> v1: - Instead of introducing a new program type for rstat flushing, add an empty hook point, bpf_rstat_flush(), and use fentry bpf programs to attach to it and flush bpf stats. - Instead of using helpers, use kfuncs for rstat functions. - These changes simplify the patchset greatly, with minimal changes to uapi. RFC v1 -> RFC v2: - Instead of rstat flush programs attach to subsystems, they now attach to rstat (global flushers, not per-subsystem), based on discussions with Tejun. The first patch is entirely rewritten. - Pass cgroup pointers to rstat flushers instead of cgroup ids. This is much more flexibility and less likely to need a uapi update later. - rstat helpers are now only defined if CGROUP_CONFIG. - Most of the code is now only defined if CGROUP_CONFIG and CONFIG_BPF_SYSCALL. - Move rstat helper protos from bpf_base_func_proto() to tracing_prog_func_proto(). - rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not ARG_ANYTHING. - Rewrote the selftest to use the cgroup helpers. - Dropped bpf_map_lookup_percpu_elem (already added by Feng). - Dropped patch to support cgroup v1 for cgroup_iter. - Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The code that calls it is no longer compiled when !CONFIG_CGROUP. cgroup_iter was originally introduced in a different patch series[2]. Hao and I agreed that it fits better as part of this series. RFC v1 of this patch series had the following changes from [2]: - Getting the cgroup's reference at the time at attaching, instead of at the time when iterating. (Yonghong) - Remove .init_seq_private and .fini_seq_private callbacks for cgroup_iter. They are not needed now. (Yonghong) [1] https://lore.kernel.org/bpf/20220520221919.jnqgv52k4ajlgzcl@MBP-98dd607d3435.dhcp.thefacebook.com/ [2] https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/ Hao Luo (2): bpf: Introduce cgroup iter selftests/bpf: Test cgroup_iter. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 11:35:38 -07:00
Yosry Ahmed	88886309d2	selftests/bpf: add a selftest for cgroup hierarchical stats collection Add a selftest that tests the whole workflow for collecting, aggregating (flushing), and displaying cgroup hierarchical stats. TL;DR: - Userspace program creates a cgroup hierarchy and induces memcg reclaim in parts of it. - Whenever reclaim happens, vmscan_start and vmscan_end update per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs have updates. - When userspace tries to read the stats, vmscan_dump calls rstat to flush the stats, and outputs the stats in text format to userspace (similar to cgroupfs stats). - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has updates, vmscan_flush aggregates cpu readings and propagates updates to parents. - Userspace program makes sure the stats are aggregated and read correctly. Detailed explanation: - The test loads tracing bpf programs, vmscan_start and vmscan_end, to measure the latency of cgroup reclaim. Per-cgroup readings are stored in percpu maps for efficiency. When a cgroup reading is updated on a cpu, cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the rstat updated tree on that cpu. - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for each cgroup. Reading this file invokes the program, which calls cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all cpus and cgroups that have updates in this cgroup's subtree. Afterwards, the stats are exposed to the user. vmscan_dump returns 1 to terminate iteration early, so that we only expose stats for one cgroup per read. - An ftrace program, vmscan_flush, is also loaded and attached to bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked once for each (cgroup, cpu) pair that has updates. cgroups are popped from the rstat tree in a bottom-up fashion, so calls will always be made for cgroups that have updates before their parents. The program aggregates percpu readings to a total per-cgroup reading, and also propagates them to the parent cgroup. After rstat flushing is over, all cgroups will have correct updated hierarchical readings (including all cpus and all their descendants). - Finally, the test creates a cgroup hierarchy and induces memcg reclaim in parts of it, and makes sure that the stats collection, aggregation, and reading workflow works as expected. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-6-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 11:35:37 -07:00
Yosry Ahmed	434992bb60	selftests/bpf: extend cgroup helpers This patch extends bpf selft cgroup_helpers [ID] n various ways: - Add enable_controllers() that allows tests to enable all or a subset of controllers for a specific cgroup. - Add join_cgroup_parent(). The cgroup workdir is based on the pid, therefore a spawned child cannot join the same cgroup hierarchy of the test through join_cgroup(). join_cgroup_parent() is used in child processes to join a cgroup under the parent's workdir. - Add write_cgroup_file() and write_cgroup_file_parent() (similar to join_cgroup_parent() above). - Add get_root_cgroup() for tests that need to do checks on root cgroup. - Distinguish relative and absolute cgroup paths in function arguments. Now relative paths are called relative_path, and absolute paths are called cgroup_path. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-5-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 11:35:37 -07:00
Yosry Ahmed	a319185be9	cgroup: bpf: enable bpf programs to integrate with rstat Enable bpf programs to make use of rstat to collect cgroup hierarchical stats efficiently: - Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats. - Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats. - Add an empty bpf_rstat_flush() hook that is called during rstat flushing, for bpf progs that flush stats to attach to. Attaching a bpf prog to this hook effectively registers it as a flush callback. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-4-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 11:35:37 -07:00
Hao Luo	fe0dd9d4b7	selftests/bpf: Test cgroup_iter. Add a selftest for cgroup_iter. The selftest creates a mini cgroup tree of the following structure: ROOT (working cgroup) \| PARENT / \ CHILD1 CHILD2 and tests the following scenarios: - invalid cgroup fd. - pre-order walk over descendants from PARENT. - post-order walk over descendants from PARENT. - walk of ancestors from PARENT. - process only a single object (i.e. PARENT). - early termination. Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-3-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 11:35:37 -07:00
Hao Luo	d4ccaf58a8	bpf: Introduce cgroup iter Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes: - walking a cgroup's descendants in pre-order. - walking a cgroup's descendants in post-order. - walking a cgroup's ancestors. - process only the given cgroup. When attaching cgroup_iter, one can set a cgroup to the iter_link created from attaching. This cgroup is passed as a file descriptor or cgroup id and serves as the starting point of the walk. If no cgroup is specified, the starting point will be the root cgroup v2. For walking descendants, one can specify the order: either pre-order or post-order. For walking ancestors, the walk starts at the specified cgroup and ends at the root. One can also terminate the walk early by returning 1 from the iter program. Note that because walking cgroup hierarchy holds cgroup_mutex, the iter program is called with cgroup_mutex held. Currently only one session is supported, which means, depending on the volume of data bpf program intends to send to user space, the number of cgroups that can be walked is limited. For example, given the current buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can be walked is 512. This is a limitation of cgroup_iter. If the output data is larger than the kernel buffer size, after all data in the kernel buffer is consumed by user space, the subsequent read() syscall will signal EOPNOTSUPP. In order to work around, the user may have to update their program to reduce the volume of data sent to output. For example, skip some uninteresting cgroups. In future, we may extend bpf_iter flags to allow customizing buffer size. Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-25 11:35:37 -07:00
Yang Yingliang	7e165d1939	selftests/bpf: Fix wrong size passed to bpf_setsockopt() sizeof(new_cc) is not real memory size that new_cc points to; introduce a new_cc_len to store the size and then pass it to bpf_setsockopt(). Fixes: `31123c0360` ("selftests/bpf: bpf_setsockopt tests") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220824013907.380448-1-yangyingliang@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-24 18:59:04 -07:00
Daniel Müller	b03914f7ff	selftests/bpf: Add cb_refs test to s390x deny list The cb_refs BPF selftest is failing execution on s390x machines. This is a newly added test that requires a feature not presently supported on this architecture. Denylist the test for this architecture. Fixes: 3cf7e7d8685c ("selftests/bpf: Add tests for reference state fixes for callbacks") Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220824163906.1186832-1-deso@posteo.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-24 18:59:04 -07:00
Alexei Starovoitov	096830808c	Merge branch 'Fix reference state management for synchronous callbacks' Kumar Kartikeya Dwivedi says: ==================== This is patch 1, 2 + their individual tests split into a separate series from the RFC, so that these can be taken in, while we continue working towards a fix for handling stack access inside the callback. Changelog: ---------- v1 -> v2: v1: https://lore.kernel.org/bpf/20220822131923.21476-1-memxor@gmail.com * Fix error for test_progs-no_alu32 due to distinct alloc_insn in errstr RFC v1 -> v1: RFC v1: https://lore.kernel.org/bpf/20220815051540.18791-1-memxor@gmail.com * Fix up commit log to add more explanation (Alexei) * Split reference state fix out into a separate series ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-24 17:54:44 -07:00
Kumar Kartikeya Dwivedi	35f14dbd2f	selftests/bpf: Add tests for reference state fixes for callbacks These are regression tests to ensure we don't end up in invalid runtime state for helpers that execute callbacks multiple times. It exercises the fixes to verifier callback handling for reference state in previous patches. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220823013226.24988-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-24 17:54:40 -07:00
Kumar Kartikeya Dwivedi	9d9d00ac29	bpf: Fix reference state management for synchronous callbacks Currently, verifier verifies callback functions (sync and async) as if they will be executed once, (i.e. it explores execution state as if the function was being called once). The next insn to explore is set to start of subprog and the exit from nested frame is handled using curframe > 0 and prepare_func_exit. In case of async callback it uses a customized variant of push_stack simulating a kind of branch to set up custom state and execution context for the async callback. While this approach is simple and works when callback really will be executed only once, it is unsafe for all of our current helpers which are for_each style, i.e. they execute the callback multiple times. A callback releasing acquired references of the caller may do so multiple times, but currently verifier sees it as one call inside the frame, which then returns to caller. Hence, it thinks it released some reference that the cb e.g. got access through callback_ctx (register filled inside cb from spilled typed register on stack). Similarly, it may see that an acquire call is unpaired inside the callback, so the caller will copy the reference state of callback and then will have to release the register with new ref_obj_ids. But again, the callback may execute multiple times, but the verifier will only account for acquired references for a single symbolic execution of the callback, which will cause leaks. Note that for async callback case, things are different. While currently we have bpf_timer_set_callback which only executes it once, even for multiple executions it would be safe, as reference state is NULL and check_reference_leak would force program to release state before BPF_EXIT. The state is also unaffected by analysis for the caller frame. Hence async callback is safe. Since we want the reference state to be accessible, e.g. for pointers loaded from stack through callback_ctx's PTR_TO_STACK, we still have to copy caller's reference_state to callback's bpf_func_state, but we enforce that whatever references it adds to that reference_state has been released before it hits BPF_EXIT. This requires introducing a new callback_ref member in the reference state to distinguish between caller vs callee references. Hence, check_reference_leak now errors out if it sees we are in callback_fn and we have not released callback_ref refs. Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2 etc. we need to also distinguish between whether this particular ref belongs to this callback frame or parent, and only error for our own, so we store state->frameno (which is always non-zero for callbacks). In short, callbacks can read parent reference_state, but cannot mutate it, to be able to use pointers acquired by the caller. They must only undo their changes (by releasing their own acquired_refs before BPF_EXIT) on top of caller reference_state before returning (at which point the caller and callback state will match anyway, so no need to copy it back to caller). Fixes: `69c087ba62` ("bpf: Add bpf_for_each_map_elem() helper") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220823013125.24938-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-24 17:54:08 -07:00
Kumar Kartikeya Dwivedi	5679ff2f13	bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF They would require func_info which needs prog BTF anyway. Loading BTF and setting the prog btf_fd while loading the prog indirectly requires CAP_BPF, so just to reduce confusion, move both these helpers taking callback under bpf_capable() protection as well, since they cannot be used without CAP_BPF. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220823013117.24916-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:21:59 -07:00
Alexei Starovoitov	f52c894734	Merge branch 'bpf: expose bpf_{g,s}et_retval to more cgroup hooks' Stanislav Fomichev says: ==================== Apparently, only a small subset of cgroup hooks actually falls back to cgroup_base_func_proto. This leads to unexpected result where not all cgroup helpers have bpf_{g,s}et_retval. It's getting harder and harder to manage which helpers are exported to which hooks. We now have the following call chains: - cg_skb_func_proto - sk_filter_func_proto - bpf_sk_base_func_proto - bpf_base_func_proto So by looking at cg_skb_func_proto it's pretty hard to understand what's going on. For cgroup helpers, I'm proposing we do the following instead: func_proto = cgroup_common_func_proto(); if (func_proto) return func_proto; /* optional, if hook has 'current' / func_proto = cgroup_current_func_proto(); if (func_proto) return func_proto; ... switch (func_id) { / hook specific helpers / case BPF_FUNC_hook_specific_helper: return &xyz; default: / always fall back to plain bpf_base_func_proto */ bpf_base_func_proto(func_id); } If this turns out more workable, we can follow up with converting the rest to the same pattern. v5: - remove net/cls_cgroup.h include from patch 1/5 (Martin) - move endif changes from patch 1/5 to 3/5 (Martin) - don't define __weak protos, the ones in core.c suffice (Martin) v4: - don't touch existing helper.c helpers (Martin) - drop unneeded CONFIG_CGROUP_BPF in bpf_lsm_func_proto (Martin) v3: - expose strtol/strtoul everywhere (Martin) - move helpers declarations from bpf.h to bpf-cgroup.h (Martin) - revise bpf_{g,s}et_retval documentation (Martin) - don't expose bpf_{g,s}et_retval to cg_skb hooks (Martin) v2: - move everything into kernel/bpf/cgroup.c instead (Martin) - use cgroup_common_func_proto in lsm (Martin) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:22 -07:00
Stanislav Fomichev	e7215f5740	selftests/bpf: Make sure bpf_{g,s}et_retval is exposed everywhere For each hook, have a simple bpf_set_retval(bpf_get_retval) program and make sure it loads for the hooks we want. The exceptions are the hooks which don't propagate the error to the callers: - sockops - recvmsg - getpeername - getsockname - cg_skb ingress and egress Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220823222555.523590-6-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:22 -07:00
Stanislav Fomichev	2172fb8007	bpf: update bpf_{g,s}et_retval documentation * replace 'syscall' with 'upper layers', still mention that it's being exported via syscall errno * describe what happens in set_retval(-EPERM) + return 1 * describe what happens with bind's 'return 3' Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220823222555.523590-5-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:22 -07:00
Stanislav Fomichev	8a67f2de9b	bpf: expose bpf_strtol and bpf_strtoul to all program types bpf_strncmp is already exposed everywhere. The motivation is to keep those helpers in kernel/bpf/helpers.c. Otherwise it's tempting to move them under kernel/bpf/cgroup.c because they are currently only used by sysctl prog types. Suggested-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220823222555.523590-4-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:21 -07:00
Stanislav Fomichev	bed89185af	bpf: Use cgroup_{common,current}_func_proto in more hooks The following hooks are per-cgroup hooks but they are not using cgroup_{common,current}_func_proto, fix it: * BPF_PROG_TYPE_CGROUP_SKB (cg_skb) * BPF_PROG_TYPE_CGROUP_SOCK_ADDR (cg_sock_addr) * BPF_PROG_TYPE_CGROUP_SOCK (cg_sock) * BPF_PROG_TYPE_LSM+BPF_LSM_CGROUP Also: * move common func_proto's into cgroup func_proto handlers * make sure bpf_{g,s}et_retval are not accessible from recvmsg, getpeername and getsockname (return/errno is ignored in these places) * as a side effect, expose get_current_pid_tgid, get_current_comm_proto, get_current_ancestor_cgroup_id, get_cgroup_classid to more cgroup hooks Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220823222555.523590-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:21 -07:00
Stanislav Fomichev	dea6a4e170	bpf: Introduce cgroup_{common,current}_func_proto Split cgroup_base_func_proto into the following: * cgroup_common_func_proto - common helpers for all cgroup hooks * cgroup_current_func_proto - common helpers for all cgroup hooks running in the process context (== have meaningful 'current'). Move bpf_{g,s}et_retval and other cgroup-related helpers into kernel/bpf/cgroup.c so they closer to where they are being used. Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220823222555.523590-2-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-23 16:08:21 -07:00
Quentin Monnet	92ec1cc378	scripts/bpf: Set date attribute for bpf-helpers(7) man page The bpf-helpers(7) manual page shipped in the man-pages project is generated from the documentation contained in the BPF UAPI header, in the Linux repository, parsed by script/bpf_doc.py and then fed to rst2man. The man page should contain the date of last modification of the documentation. This commit adds the relevant date when generating the page. Before: $ ./scripts/bpf_doc.py helpers \| rst2man \| grep '\.TH' .TH BPF-HELPERS 7 "" "Linux v5.19-14022-g30d2a4d74e11" "" After: $ ./scripts/bpf_doc.py helpers \| rst2man \| grep '\.TH' .TH BPF-HELPERS 7 "2022-08-15" "Linux v5.19-14022-g30d2a4d74e11" "" We get the version by using "git log" to look for the commit date of the latest change to the section of the BPF header containing the documentation. If the command fails, we just skip the date field. and keep generating the page. Reported-by: Alejandro Colomar <alx.manpages@gmail.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com> Link: https://lore.kernel.org/bpf/20220823155327.98888-2-quentin@isovalent.com	2022-08-23 22:51:04 +02:00
Quentin Monnet	fd0a38f9c3	scripts/bpf: Set version attribute for bpf-helpers(7) man page The bpf-helpers(7) manual page shipped in the man-pages project is generated from the documentation contained in the BPF UAPI header, in the Linux repository, parsed by script/bpf_doc.py and then fed to rst2man. After a recent update of that page [0], Alejandro reported that the linter used to validate the man pages complains about the generated document [1]. The header for the page is supposed to contain some attributes that we do not set correctly with the script. This commit updates the "project and version" field. We discussed the format of those fields in [1] and [2]. Before: $ ./scripts/bpf_doc.py helpers \| rst2man \| grep '\.TH' .TH BPF-HELPERS 7 "" "" "" After: $ ./scripts/bpf_doc.py helpers \| rst2man \| grep '\.TH' .TH BPF-HELPERS 7 "" "Linux v5.19-14022-g30d2a4d74e11" "" We get the version from "git describe", but if unavailable, we fall back on "make kernelversion". If none works, for example because neither git nore make are installed, we just set the field to "Linux" and keep generating the page. [0] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/man7/bpf-helpers.7?id=19c7f78393f2b038e76099f87335ddf43a87f039 [1] https://lore.kernel.org/all/20220823084719.13613-1-quentin@isovalent.com/t/#m58a418a318642c6428e14ce9bb84eba5183b06e8 [2] https://lore.kernel.org/all/20220721110821.8240-1-alx.manpages@gmail.com/t/#m8e689a822e03f6e2530a0d6de9d128401916c5de Reported-by: Alejandro Colomar <alx.manpages@gmail.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com> Link: https://lore.kernel.org/bpf/20220823155327.98888-1-quentin@isovalent.com	2022-08-23 22:51:04 +02:00
Shmulik Ladkani	d6513727c2	bpf, selftests: Test BPF_FLOW_DISSECTOR_CONTINUE The dissector program returns BPF_FLOW_DISSECTOR_CONTINUE (and avoids setting skb->flow_keys or last_dissection map) in case it encounters IP packets whose (outer) source address is 127.0.0.127. Additional test is added to prog_tests/flow_dissector.c which sets this address as test's pkk.iph.saddr, with the expected retval of BPF_FLOW_DISSECTOR_CONTINUE. Also, legacy test_flow_dissector.sh was similarly augmented. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220821113519.116765-5-shmulik.ladkani@gmail.com	2022-08-23 22:48:12 +02:00
Shmulik Ladkani	5deedfbee8	bpf, test_run: Propagate bpf_flow_dissect's retval to user's bpf_attr.test.retval Formerly, a boolean denoting whether bpf_flow_dissect returned BPF_OK was set into 'bpf_attr.test.retval'. Augment this, so users can check the actual return code of the dissector program under test. Existing prog_tests/flow_dissector*.c tests were correspondingly changed to check against each test's expected retval. Also, tests' resulting 'flow_keys' are verified only in case the expected retval is BPF_OK. This allows adding new tests that expect non BPF_OK. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220821113519.116765-4-shmulik.ladkani@gmail.com	2022-08-23 22:48:03 +02:00
Shmulik Ladkani	91350fe152	bpf, flow_dissector: Introduce BPF_FLOW_DISSECTOR_CONTINUE retcode for bpf progs Currently, attaching BPF_PROG_TYPE_FLOW_DISSECTOR programs completely replaces the flow-dissector logic with custom dissection logic. This forces implementors to write programs that handle dissection for any flows expected in the namespace. It makes sense for flow-dissector BPF programs to just augment the dissector with custom logic (e.g. dissecting certain flows or custom protocols), while enjoying the broad capabilities of the standard dissector for any other traffic. Introduce BPF_FLOW_DISSECTOR_CONTINUE retcode. Flow-dissector BPF programs may return this to indicate no dissection was made, and fallback to the standard dissector is requested. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220821113519.116765-3-shmulik.ladkani@gmail.com	2022-08-23 22:47:55 +02:00
Shmulik Ladkani	0ba985024a	flow_dissector: Make 'bpf_flow_dissect' return the bpf program retcode Let 'bpf_flow_dissect' callers know the BPF program's retcode and act accordingly. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220821113519.116765-2-shmulik.ladkani@gmail.com	2022-08-23 22:47:42 +02:00
Martin KaFai Lau	b979f005d9	selftest/bpf: Add setget_sockopt to DENYLIST.s390x Trampoline is not supported in s390. Fixes: `31123c0360` ("selftests/bpf: bpf_setsockopt tests") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220819192155.91713-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-19 12:46:18 -07:00
Colin Ian King	e918cd231e	selftests/bpf: Fix spelling mistake. There is a spelling mistake in an ASSERT_OK literal string. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Mykola Lysenko <mykolal@fb.com> Link: https://lore.kernel.org/r/20220817213242.101277-1-colin.i.king@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-19 12:45:14 -07:00
Alexei Starovoitov	75179e2b7f	Merge branch 'bpf: net: Remove duplicated code from bpf_setsockopt()' Martin KaFai Lau says: ==================== The code in bpf_setsockopt() is mostly a copy-and-paste from the sock_setsockopt(), do_tcp_setsockopt(), do_ipv6_setsockopt(), and do_ip_setsockopt(). As the allowed optnames in bpf_setsockopt() grows, so are the duplicated code. The code between the copies also slowly drifted. This set is an effort to clean this up and reuse the existing {sock,do_tcp,do_ipv6,do_ip}_setsockopt() as much as possible. After the clean up, this set also adds a few allowed optnames that we need to the bpf_setsockopt(). The initial attempt was to clean up both bpf_setsockopt() and bpf_getsockopt() together. However, the patch set was getting too long. It is beneficial to leave the bpf_getsockopt() out for another patch set. Thus, this set is focusing on the bpf_setsockopt(). v4: - This set now depends on the commit `f574f7f839` ("net: bpf: Use the protocol's set_rcvlowat behavior if there is one") in the net-next tree. The commit calls a specific protocol's set_rcvlowat and it changed the bpf_setsockopt which this set has also changed. Because of this, patch 9 of this set has also adjusted and a 'sock' NULL check is added to the sk_setsockopt() because some of the bpf hooks have a NULL sk->sk_socket. This removes more dup code from the bpf_setsockopt() side. - Avoid mentioning specific prog types in the comment of the has_current_bpf_ctx(). (Andrii) - Replace signed with unsigned int bitfield in the patch 15 selftest. (Daniel) v3: - s/in_bpf/has_current_bpf_ctx/ (Andrii) - Add comment to has_current_bpf_ctx() and sockopt_lock_sock() (Stanislav) - Use vmlinux.h in selftest and add defines to bpf_tracing_net.h (Stanislav) - Use bpf_getsockopt(SO_MARK) in selftest (Stanislav) - Use BPF_CORE_READ_BITFIELD in selftest (Yonghong) v2: - A major change is to use in_bpf() to test if a setsockopt() is called by a bpf prog and use in_bpf() to skip capable check. Suggested by Stanislav. - Instead of passing is_locked through sockptr_t or through an extra argument to sk_setsockopt, v2 uses in_bpf() to skip the lock_sock() also because bpf prog has the lock acquired. - No change to the current sockptr_t in this revision - s/codes/code/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-18 17:06:14 -07:00
Martin KaFai Lau	31123c0360	selftests/bpf: bpf_setsockopt tests This patch adds tests to exercise optnames that are allowed in bpf_setsockopt(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061847.4182339-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-18 17:06:14 -07:00
Martin KaFai Lau	7e41df5dbb	bpf: Add a few optnames to bpf_setsockopt This patch adds a few optnames for bpf_setsockopt: SO_REUSEADDR, IPV6_AUTOFLOWLABEL, TCP_MAXSEG, TCP_NODELAY, and TCP_THIN_LINEAR_TIMEOUTS. Thanks to the previous patches of this set, all additions can reuse the sk_setsockopt(), do_ipv6_setsockopt(), and do_tcp_setsockopt(). The only change here is to allow them in bpf_setsockopt. The bpf prog has been able to read all members of a sk by using PTR_TO_BTF_ID of a sk. The optname additions here can also be read by the same approach. Meaning there is a way to read the values back. These optnames can also be added to bpf_getsockopt() later with another patch set that makes the bpf_getsockopt() to reuse the sock_getsockopt(), tcp_getsockopt(), and ip[v6]_getsockopt(). Thus, this patch does not add more duplicated code to bpf_getsockopt() now. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061841.4181642-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-18 17:06:14 -07:00
Martin KaFai Lau	75b64b68ee	bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt() After the prep work in the previous patches, this patch removes the dup code from bpf_setsockopt(SOL_IPV6) and reuses the implementation in do_ipv6_setsockopt(). ipv6 could be compiled as a module. Like how other code solved it with stubs in ipv6_stubs.h, this patch adds the do_ipv6_setsockopt to the ipv6_bpf_stub. The current bpf_setsockopt(IPV6_TCLASS) does not take the INET_ECN_MASK into the account for tcp. The do_ipv6_setsockopt(IPV6_TCLASS) will handle it correctly. The existing optname white-list is refactored into a new function sol_ipv6_setsockopt(). After this last SOL_IPV6 dup code removal, the __bpf_setsockopt() is simplified enough that the extra "{ }" around the if statement can be removed. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061834.4181198-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-18 17:06:13 -07:00
Martin KaFai Lau	ee7f1e1302	bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt() After the prep work in the previous patches, this patch removes the dup code from bpf_setsockopt(SOL_IP) and reuses the implementation in do_ip_setsockopt(). The existing optname white-list is refactored into a new function sol_ip_setsockopt(). NOTE, the current bpf_setsockopt(IP_TOS) is quite different from the the do_ip_setsockopt(IP_TOS). For example, it does not take the INET_ECN_MASK into the account for tcp and also does not adjust sk->sk_priority. It looks like the current bpf_setsockopt(IP_TOS) was referencing the IPV6_TCLASS implementation instead of IP_TOS. This patch tries to rectify that by using the do_ip_setsockopt(IP_TOS). While this is a behavior change, the do_ip_setsockopt(IP_TOS) behavior is arguably what the user is expecting. At least, the INET_ECN_MASK bits should be masked out for tcp. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061826.4180990-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-08-18 17:06:13 -07:00

1 2 3 4 5 ...

1121165 Commits