From 32337c0a28242f725c2c499c15100d67a4133050 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Sat, 26 Aug 2023 13:08:43 -0700 Subject: [PATCH 01/95] bpf: Prevent inlining of bpf_fentry_test7() With latest clang18, I hit test_progs failures for the following test: #13/2 bpf_cookie/multi_kprobe_link_api:FAIL #13/3 bpf_cookie/multi_kprobe_attach_api:FAIL #13 bpf_cookie:FAIL #75 fentry_fexit:FAIL #76/1 fentry_test/fentry:FAIL #76 fentry_test:FAIL #80/1 fexit_test/fexit:FAIL #80 fexit_test:FAIL #110/1 kprobe_multi_test/skel_api:FAIL #110/2 kprobe_multi_test/link_api_addrs:FAIL #110/3 kprobe_multi_test/link_api_syms:FAIL #110/4 kprobe_multi_test/attach_api_pattern:FAIL #110/5 kprobe_multi_test/attach_api_addrs:FAIL #110/6 kprobe_multi_test/attach_api_syms:FAIL #110 kprobe_multi_test:FAIL For example, for #13/2, the error messages are: [...] kprobe_multi_test_run:FAIL:kprobe_test7_result unexpected kprobe_test7_result: actual 0 != expected 1 [...] kprobe_multi_test_run:FAIL:kretprobe_test7_result unexpected kretprobe_test7_result: actual 0 != expected 1 clang17 does not have this issue. Further investigation shows that kernel func bpf_fentry_test7(), used in the above tests, is inlined by the compiler although it is marked as noinline. int noinline bpf_fentry_test7(struct bpf_fentry_test_t *arg) { return (long)arg; } It is known that for simple functions like the above (e.g. just returning a constant or an input argument), the clang compiler may still do inlining for a noinline function. Adding 'asm volatile ("")' in the beginning of the bpf_fentry_test7() can prevent inlining. Signed-off-by: Yonghong Song Signed-off-by: Daniel Borkmann Tested-by: Eduard Zingerman Link: https://lore.kernel.org/bpf/20230826200843.2210074-1-yonghong.song@linux.dev --- net/bpf/test_run.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 57a7a64b84ed..0841f8d82419 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -543,6 +543,7 @@ struct bpf_fentry_test_t { int noinline bpf_fentry_test7(struct bpf_fentry_test_t *arg) { + asm volatile (""); return (long)arg; } From 6a8faf10709161e7138202a8cf052b070971239f Mon Sep 17 00:00:00 2001 From: Yafang Shao Date: Wed, 30 Aug 2023 03:03:25 +0000 Subject: [PATCH 02/95] bpftool: Fix build warnings with -Wtype-limits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Quentin reported build warnings when building bpftool : link.c: In function ‘perf_config_hw_cache_str’: link.c:86:18: warning: comparison of unsigned expression in ‘>= 0’ is always true [-Wtype-limits] 86 | if ((id) >= 0 && (id) < ARRAY_SIZE(array)) \ | ^~ link.c:320:20: note: in expansion of macro ‘perf_event_name’ 320 | hw_cache = perf_event_name(evsel__hw_cache, config & 0xff); | ^~~~~~~~~~~~~~~ [... more of the same for the other calls to perf_event_name ...] He also pointed out the reason and the solution: We're always passing unsigned, so it should be safe to drop the check on (id) >= 0. Fixes: 62b57e3ddd64 ("bpftool: Add perf event names") Reported-by: Quentin Monnet Suggested-by: Quentin Monnet Signed-off-by: Yafang Shao Signed-off-by: Daniel Borkmann Acked-by: Quentin Monnet Closes: https://lore.kernel.org/bpf/a35d9a2d-54a0-49ec-9ed1-8fcf1369d3cc@isovalent.com Link: https://lore.kernel.org/bpf/20230830030325.3786-1-laoar.shao@gmail.com --- tools/bpf/bpftool/link.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c index 0b214f6ab5c8..2e5c231e08ac 100644 --- a/tools/bpf/bpftool/link.c +++ b/tools/bpf/bpftool/link.c @@ -83,7 +83,7 @@ const char *evsel__hw_cache_result[PERF_COUNT_HW_CACHE_RESULT_MAX] = { #define perf_event_name(array, id) ({ \ const char *event_str = NULL; \ \ - if ((id) >= 0 && (id) < ARRAY_SIZE(array)) \ + if ((id) < ARRAY_SIZE(array)) \ event_str = array[id]; \ event_str; \ }) From 9d0a67b9d42c630d5013ef81587335d975a7a4a9 Mon Sep 17 00:00:00 2001 From: Tirthendu Sarkar Date: Wed, 23 Aug 2023 20:17:13 +0530 Subject: [PATCH 03/95] xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR() Currently, xsk_build_skb() is a function that builds skb in two possible ways and then is ended with common error handling. We can distinguish four possible error paths and handling in xsk_build_skb(): 1. sock_alloc_send_skb fails: Retry (skb is NULL). 2. skb_store_bits fails : Free skb and retry. 3. MAX_SKB_FRAGS exceeded: Free skb, cleanup and drop packet. 4. alloc_page fails for frag: Retry page allocation w/o freeing skb 1] and 3] can happen in xsk_build_skb_zerocopy(), which is one of the two code paths responsible for building skb. Common error path in xsk_build_skb() assumes that in case errno != -EAGAIN, skb is a valid pointer, which is wrong as kernel test robot reports that in xsk_build_skb_zerocopy() other errno values are returned for skb being NULL. To fix this, set -EOVERFLOW as error when MAX_SKB_FRAGS are exceeded and packet needs to be dropped in both xsk_build_skb() and xsk_build_skb_zerocopy() and use this to distinguish against all other error cases. Also, add explicit kfree_skb() for 3] so that handling of 1], 2], and 3] becomes identical where allocation needs to be retried. Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path") Reported-by: kernel test robot Reported-by: Dan Carpenter Signed-off-by: Tirthendu Sarkar Signed-off-by: Daniel Borkmann Acked-by: Magnus Karlsson Closes: https://lore.kernel.org/r/202307210434.OjgqFcbB-lkp@intel.com Link: https://lore.kernel.org/bpf/20230823144713.2231808-1-tirthendu.sarkar@intel.com --- net/xdp/xsk.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index fcfc8472f73d..55f8b9b0e06d 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -602,7 +602,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, for (copied = 0, i = skb_shinfo(skb)->nr_frags; copied < len; i++) { if (unlikely(i >= MAX_SKB_FRAGS)) - return ERR_PTR(-EFAULT); + return ERR_PTR(-EOVERFLOW); page = pool->umem->pgs[addr >> PAGE_SHIFT]; get_page(page); @@ -655,15 +655,17 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, skb_put(skb, len); err = skb_store_bits(skb, 0, buffer, len); - if (unlikely(err)) + if (unlikely(err)) { + kfree_skb(skb); goto free_err; + } } else { int nr_frags = skb_shinfo(skb)->nr_frags; struct page *page; u8 *vaddr; if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) { - err = -EFAULT; + err = -EOVERFLOW; goto free_err; } @@ -690,12 +692,14 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, return skb; free_err: - if (err == -EAGAIN) { - xsk_cq_cancel_locked(xs, 1); - } else { - xsk_set_destructor_arg(skb); - xsk_drop_skb(skb); + if (err == -EOVERFLOW) { + /* Drop the packet */ + xsk_set_destructor_arg(xs->skb); + xsk_drop_skb(xs->skb); xskq_cons_release(xs->tx); + } else { + /* Let application retry */ + xsk_cq_cancel_locked(xs, 1); } return ERR_PTR(err); @@ -738,7 +742,7 @@ static int __xsk_generic_xmit(struct sock *sk) skb = xsk_build_skb(xs, &desc); if (IS_ERR(skb)) { err = PTR_ERR(skb); - if (err == -EAGAIN) + if (err != -EOVERFLOW) goto out; err = 0; continue; From 5439cfa7fe612e7d02d5a1234feda3fa6e483ba7 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Sun, 27 Aug 2023 08:05:51 -0700 Subject: [PATCH 04/95] selftests/bpf: Fix flaky cgroup_iter_sleepable subtest Occasionally, with './test_progs -j' on my vm, I will hit the following failure: test_cgrp_local_storage:PASS:join_cgroup /cgrp_local_storage 0 nsec test_cgroup_iter_sleepable:PASS:skel_open 0 nsec test_cgroup_iter_sleepable:PASS:skel_load 0 nsec test_cgroup_iter_sleepable:PASS:attach_iter 0 nsec test_cgroup_iter_sleepable:PASS:iter_create 0 nsec test_cgroup_iter_sleepable:FAIL:cgroup_id unexpected cgroup_id: actual 1 != expected 2812 #48/5 cgrp_local_storage/cgroup_iter_sleepable:FAIL #48 cgrp_local_storage:FAIL Finally, I decided to do some investigation since the test is introduced by myself. It turns out the reason is due to cgroup_fd with value 0. In cgroup_iter, a cgroup_fd of value 0 means the root cgroup. /* from cgroup_iter.c */ if (fd) cgrp = cgroup_v1v2_get_from_fd(fd); else if (id) cgrp = cgroup_get_from_id(id); else /* walk the entire hierarchy by default. */ cgrp = cgroup_get_from_path("/"); That is why we got cgroup_id 1 instead of expected 2812. Why we got a cgroup_fd 0? Nobody should really touch 'stdin' (fd 0) in test_progs. I traced 'close' syscall with stack trace and found the root cause, which is a bug in bpf_obj_pinning.c. Basically, the code closed fd 0 although it should not. Fixing the bug in bpf_obj_pinning.c also resolved the above cgroup_iter_sleepable subtest failure. Fixes: 3b22f98e5a05 ("selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests") Signed-off-by: Yonghong Song Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230827150551.1743497-1-yonghong.song@linux.dev --- tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c b/tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c index 31f1e815f671..ee0458a5ce78 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_obj_pinning.c @@ -8,6 +8,7 @@ #include #include #include +#include "bpf/libbpf_internal.h" static inline int sys_fsopen(const char *fsname, unsigned flags) { @@ -155,7 +156,7 @@ static void validate_pin(int map_fd, const char *map_name, int src_value, ASSERT_OK(err, "obj_pin"); /* cleanup */ - if (pin_opts.path_fd >= 0) + if (path_kind == PATH_FD_REL && pin_opts.path_fd >= 0) close(pin_opts.path_fd); if (old_cwd[0]) ASSERT_OK(chdir(old_cwd), "restore_cwd"); @@ -220,7 +221,7 @@ static void validate_get(int map_fd, const char *map_name, int src_value, goto cleanup; /* cleanup */ - if (get_opts.path_fd >= 0) + if (path_kind == PATH_FD_REL && get_opts.path_fd >= 0) close(get_opts.path_fd); if (old_cwd[0]) ASSERT_OK(chdir(old_cwd), "restore_cwd"); From 2d71a90f7e0fa3cd348602a36f6eb1237ab7cebb Mon Sep 17 00:00:00 2001 From: Will Hawkins Date: Sat, 26 Aug 2023 01:32:54 -0400 Subject: [PATCH 05/95] bpf, docs: Correct source of offset for program-local call The offset to use when calculating the target of a program-local call is in the instruction's imm field, not its offset field. Signed-off-by: Will Hawkins Signed-off-by: Daniel Borkmann Acked-by: Eduard Zingerman Acked-by: David Vernet Link: https://lore.kernel.org/bpf/20230826053258.1860167-1-hawkinsw@obs.cr --- Documentation/bpf/standardization/instruction-set.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst index 4f73e9dc8d9e..c5b0b2011f16 100644 --- a/Documentation/bpf/standardization/instruction-set.rst +++ b/Documentation/bpf/standardization/instruction-set.rst @@ -373,7 +373,7 @@ BPF_JNE 0x5 any PC += offset if dst != src BPF_JSGT 0x6 any PC += offset if dst > src signed BPF_JSGE 0x7 any PC += offset if dst >= src signed BPF_CALL 0x8 0x0 call helper function by address see `Helper functions`_ -BPF_CALL 0x8 0x1 call PC += offset see `Program-local functions`_ +BPF_CALL 0x8 0x1 call PC += imm see `Program-local functions`_ BPF_CALL 0x8 0x2 call helper function by BTF ID see `Helper functions`_ BPF_EXIT 0x9 0x0 return BPF_JMP only BPF_JLT 0xa any PC += offset if dst < src unsigned @@ -424,8 +424,8 @@ Program-local functions ~~~~~~~~~~~~~~~~~~~~~~~ Program-local functions are functions exposed by the same BPF program as the caller, and are referenced by offset from the call instruction, similar to -``BPF_JA``. A ``BPF_EXIT`` within the program-local function will return to -the caller. +``BPF_JA``. The offset is encoded in the imm field of the call instruction. +A ``BPF_EXIT`` within the program-local function will return to the caller. Load and store instructions =========================== From be4033d36070e44fba766a21ef2d0c24fa04c377 Mon Sep 17 00:00:00 2001 From: Eduard Zingerman Date: Sun, 27 Aug 2023 01:29:12 +0300 Subject: [PATCH 06/95] docs/bpf: Add description for CO-RE relocations Add a section on CO-RE relocations to llvm_relo.rst. Describe relevant .BTF.ext structure, `enum bpf_core_relo_kind` and `struct bpf_core_relo` in some detail. Description is based on doc-strings from: - include/uapi/linux/bpf.h:struct bpf_core_relo - tools/lib/bpf/relo_core.c:__bpf_core_types_match() Signed-off-by: Eduard Zingerman Signed-off-by: Daniel Borkmann Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20230826222912.2560865-2-eddyz87@gmail.com --- Documentation/bpf/btf.rst | 31 +++- Documentation/bpf/llvm_reloc.rst | 304 +++++++++++++++++++++++++++++++ 2 files changed, 329 insertions(+), 6 deletions(-) diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index f32db1f44ae9..ffc11afee569 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -726,8 +726,8 @@ same as the one describe in :ref:`BTF_Type_String`. 4.2 .BTF.ext section -------------------- -The .BTF.ext section encodes func_info and line_info which needs loader -manipulation before loading into the kernel. +The .BTF.ext section encodes func_info, line_info and CO-RE relocations +which needs loader manipulation before loading into the kernel. The specification for .BTF.ext section is defined at ``tools/lib/bpf/btf.h`` and ``tools/lib/bpf/btf.c``. @@ -745,15 +745,20 @@ The current header of .BTF.ext section:: __u32 func_info_len; __u32 line_info_off; __u32 line_info_len; + + /* optional part of .BTF.ext header */ + __u32 core_relo_off; + __u32 core_relo_len; }; It is very similar to .BTF section. Instead of type/string section, it -contains func_info and line_info section. See :ref:`BPF_Prog_Load` for details -about func_info and line_info record format. +contains func_info, line_info and core_relo sub-sections. +See :ref:`BPF_Prog_Load` for details about func_info and line_info +record format. The func_info is organized as below.:: - func_info_rec_size + func_info_rec_size /* __u32 value */ btf_ext_info_sec for section #1 /* func_info for section #1 */ btf_ext_info_sec for section #2 /* func_info for section #2 */ ... @@ -773,7 +778,7 @@ Here, num_info must be greater than 0. The line_info is organized as below.:: - line_info_rec_size + line_info_rec_size /* __u32 value */ btf_ext_info_sec for section #1 /* line_info for section #1 */ btf_ext_info_sec for section #2 /* line_info for section #2 */ ... @@ -787,6 +792,20 @@ kernel API, the ``insn_off`` is the instruction offset in the unit of ``struct bpf_insn``. For ELF API, the ``insn_off`` is the byte offset from the beginning of section (``btf_ext_info_sec->sec_name_off``). +The core_relo is organized as below.:: + + core_relo_rec_size /* __u32 value */ + btf_ext_info_sec for section #1 /* core_relo for section #1 */ + btf_ext_info_sec for section #2 /* core_relo for section #2 */ + +``core_relo_rec_size`` specifies the size of ``bpf_core_relo`` +structure when .BTF.ext is generated. All ``bpf_core_relo`` structures +within a single ``btf_ext_info_sec`` describe relocations applied to +section named by ``btf_ext_info_sec->sec_name_off``. + +See :ref:`Documentation/bpf/llvm_reloc ` +for more information on CO-RE relocations. + 4.2 .BTF_ids section -------------------- diff --git a/Documentation/bpf/llvm_reloc.rst b/Documentation/bpf/llvm_reloc.rst index 450e6403fe3d..73bf805000f2 100644 --- a/Documentation/bpf/llvm_reloc.rst +++ b/Documentation/bpf/llvm_reloc.rst @@ -240,3 +240,307 @@ The .BTF/.BTF.ext sections has R_BPF_64_NODYLD32 relocations:: Offset Info Type Symbol's Value Symbol's Name 000000000000002c 0000000200000004 R_BPF_64_NODYLD32 0000000000000000 .text 0000000000000040 0000000200000004 R_BPF_64_NODYLD32 0000000000000000 .text + +.. _btf-co-re-relocations: + +================= +CO-RE Relocations +================= + +From object file point of view CO-RE mechanism is implemented as a set +of CO-RE specific relocation records. These relocation records are not +related to ELF relocations and are encoded in .BTF.ext section. +See :ref:`Documentation/bpf/btf ` for more +information on .BTF.ext structure. + +CO-RE relocations are applied to BPF instructions to update immediate +or offset fields of the instruction at load time with information +relevant for target kernel. + +Field to patch is selected basing on the instruction class: + +* For BPF_ALU, BPF_ALU64, BPF_LD `immediate` field is patched; +* For BPF_LDX, BPF_STX, BPF_ST `offset` field is patched; +* BPF_JMP, BPF_JMP32 instructions **should not** be patched. + +Relocation kinds +================ + +There are several kinds of CO-RE relocations that could be split in +three groups: + +* Field-based - patch instruction with field related information, e.g. + change offset field of the BPF_LDX instruction to reflect offset + of a specific structure field in the target kernel. + +* Type-based - patch instruction with type related information, e.g. + change immediate field of the BPF_ALU move instruction to 0 or 1 to + reflect if specific type is present in the target kernel. + +* Enum-based - patch instruction with enum related information, e.g. + change immediate field of the BPF_LD_IMM64 instruction to reflect + value of a specific enum literal in the target kernel. + +The complete list of relocation kinds is represented by the following enum: + +.. code-block:: c + + enum bpf_core_relo_kind { + BPF_CORE_FIELD_BYTE_OFFSET = 0, /* field byte offset */ + BPF_CORE_FIELD_BYTE_SIZE = 1, /* field size in bytes */ + BPF_CORE_FIELD_EXISTS = 2, /* field existence in target kernel */ + BPF_CORE_FIELD_SIGNED = 3, /* field signedness (0 - unsigned, 1 - signed) */ + BPF_CORE_FIELD_LSHIFT_U64 = 4, /* bitfield-specific left bitshift */ + BPF_CORE_FIELD_RSHIFT_U64 = 5, /* bitfield-specific right bitshift */ + BPF_CORE_TYPE_ID_LOCAL = 6, /* type ID in local BPF object */ + BPF_CORE_TYPE_ID_TARGET = 7, /* type ID in target kernel */ + BPF_CORE_TYPE_EXISTS = 8, /* type existence in target kernel */ + BPF_CORE_TYPE_SIZE = 9, /* type size in bytes */ + BPF_CORE_ENUMVAL_EXISTS = 10, /* enum value existence in target kernel */ + BPF_CORE_ENUMVAL_VALUE = 11, /* enum value integer value */ + BPF_CORE_TYPE_MATCHES = 12, /* type match in target kernel */ + }; + +Notes: + +* ``BPF_CORE_FIELD_LSHIFT_U64`` and ``BPF_CORE_FIELD_RSHIFT_U64`` are + supposed to be used to read bitfield values using the following + algorithm: + + .. code-block:: c + + // To read bitfield ``f`` from ``struct s`` + is_signed = relo(s->f, BPF_CORE_FIELD_SIGNED) + off = relo(s->f, BPF_CORE_FIELD_BYTE_OFFSET) + sz = relo(s->f, BPF_CORE_FIELD_BYTE_SIZE) + l = relo(s->f, BPF_CORE_FIELD_LSHIFT_U64) + r = relo(s->f, BPF_CORE_FIELD_RSHIFT_U64) + // define ``v`` as signed or unsigned integer of size ``sz`` + v = *({s|u} *)((void *)s + off) + v <<= l + v >>= r + +* The ``BPF_CORE_TYPE_MATCHES`` queries matching relation, defined as + follows: + + * for integers: types match if size and signedness match; + * for arrays & pointers: target types are recursively matched; + * for structs & unions: + + * local members need to exist in target with the same name; + + * for each member we recursively check match unless it is already behind a + pointer, in which case we only check matching names and compatible kind; + + * for enums: + + * local variants have to have a match in target by symbolic name (but not + numeric value); + + * size has to match (but enum may match enum64 and vice versa); + + * for function pointers: + + * number and position of arguments in local type has to match target; + * for each argument and the return value we recursively check match. + +CO-RE Relocation Record +======================= + +Relocation record is encoded as the following structure: + +.. code-block:: c + + struct bpf_core_relo { + __u32 insn_off; + __u32 type_id; + __u32 access_str_off; + enum bpf_core_relo_kind kind; + }; + +* ``insn_off`` - instruction offset (in bytes) within a code section + associated with this relocation; + +* ``type_id`` - BTF type ID of the "root" (containing) entity of a + relocatable type or field; + +* ``access_str_off`` - offset into corresponding .BTF string section. + String interpretation depends on specific relocation kind: + + * for field-based relocations, string encodes an accessed field using + a sequence of field and array indices, separated by colon (:). It's + conceptually very close to LLVM's `getelementptr `_ instruction's + arguments for identifying offset to a field. For example, consider the + following C code: + + .. code-block:: c + + struct sample { + int a; + int b; + struct { int c[10]; }; + } __attribute__((preserve_access_index)); + struct sample *s; + + * Access to ``s[0].a`` would be encoded as ``0:0``: + + * ``0``: first element of ``s`` (as if ``s`` is an array); + * ``0``: index of field ``a`` in ``struct sample``. + + * Access to ``s->a`` would be encoded as ``0:0`` as well. + * Access to ``s->b`` would be encoded as ``0:1``: + + * ``0``: first element of ``s``; + * ``1``: index of field ``b`` in ``struct sample``. + + * Access to ``s[1].c[5]`` would be encoded as ``1:2:0:5``: + + * ``1``: second element of ``s``; + * ``2``: index of anonymous structure field in ``struct sample``; + * ``0``: index of field ``c`` in anonymous structure; + * ``5``: access to array element #5. + + * for type-based relocations, string is expected to be just "0"; + + * for enum value-based relocations, string contains an index of enum + value within its enum type; + +* ``kind`` - one of ``enum bpf_core_relo_kind``. + +.. _GEP: https://llvm.org/docs/LangRef.html#getelementptr-instruction + +.. _btf_co_re_relocation_examples: + +CO-RE Relocation Examples +========================= + +For the following C code: + +.. code-block:: c + + struct foo { + int a; + int b; + unsigned c:15; + } __attribute__((preserve_access_index)); + + enum bar { U, V }; + +With the following BTF definitions: + +.. code-block:: + + ... + [2] STRUCT 'foo' size=8 vlen=2 + 'a' type_id=3 bits_offset=0 + 'b' type_id=3 bits_offset=32 + 'c' type_id=4 bits_offset=64 bitfield_size=15 + [3] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED + [4] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none) + ... + [16] ENUM 'bar' encoding=UNSIGNED size=4 vlen=2 + 'U' val=0 + 'V' val=1 + +Field offset relocations are generated automatically when +``__attribute__((preserve_access_index))`` is used, for example: + +.. code-block:: c + + void alpha(struct foo *s, volatile unsigned long *g) { + *g = s->a; + s->a = 1; + } + + 00 : + 0: r3 = *(s32 *)(r1 + 0x0) + 00: CO-RE [2] struct foo::a (0:0) + 1: *(u64 *)(r2 + 0x0) = r3 + 2: *(u32 *)(r1 + 0x0) = 0x1 + 10: CO-RE [2] struct foo::a (0:0) + 3: exit + + +All relocation kinds could be requested via built-in functions. +E.g. field-based relocations: + +.. code-block:: c + + void bravo(struct foo *s, volatile unsigned long *g) { + *g = __builtin_preserve_field_info(s->b, 0 /* field byte offset */); + *g = __builtin_preserve_field_info(s->b, 1 /* field byte size */); + *g = __builtin_preserve_field_info(s->b, 2 /* field existence */); + *g = __builtin_preserve_field_info(s->b, 3 /* field signedness */); + *g = __builtin_preserve_field_info(s->c, 4 /* bitfield left shift */); + *g = __builtin_preserve_field_info(s->c, 5 /* bitfield right shift */); + } + + 20 : + 4: r1 = 0x4 + 20: CO-RE [2] struct foo::b (0:1) + 5: *(u64 *)(r2 + 0x0) = r1 + 6: r1 = 0x4 + 30: CO-RE [2] struct foo::b (0:1) + 7: *(u64 *)(r2 + 0x0) = r1 + 8: r1 = 0x1 + 40: CO-RE [2] struct foo::b (0:1) + 9: *(u64 *)(r2 + 0x0) = r1 + 10: r1 = 0x1 + 50: CO-RE [2] struct foo::b (0:1) + 11: *(u64 *)(r2 + 0x0) = r1 + 12: r1 = 0x31 + 60: CO-RE [2] struct foo::c (0:2) + 13: *(u64 *)(r2 + 0x0) = r1 + 14: r1 = 0x31 + 70: CO-RE [2] struct foo::c (0:2) + 15: *(u64 *)(r2 + 0x0) = r1 + 16: exit + + +Type-based relocations: + +.. code-block:: c + + void charlie(struct foo *s, volatile unsigned long *g) { + *g = __builtin_preserve_type_info(*s, 0 /* type existence */); + *g = __builtin_preserve_type_info(*s, 1 /* type size */); + *g = __builtin_preserve_type_info(*s, 2 /* type matches */); + *g = __builtin_btf_type_id(*s, 0 /* type id in this object file */); + *g = __builtin_btf_type_id(*s, 1 /* type id in target kernel */); + } + + 88 : + 17: r1 = 0x1 + 88: CO-RE [2] struct foo + 18: *(u64 *)(r2 + 0x0) = r1 + 19: r1 = 0xc + 98: CO-RE [2] struct foo + 20: *(u64 *)(r2 + 0x0) = r1 + 21: r1 = 0x1 + a8: CO-RE [2] struct foo + 22: *(u64 *)(r2 + 0x0) = r1 + 23: r1 = 0x2 ll + b8: CO-RE [2] struct foo + 25: *(u64 *)(r2 + 0x0) = r1 + 26: r1 = 0x2 ll + d0: CO-RE [2] struct foo + 28: *(u64 *)(r2 + 0x0) = r1 + 29: exit + +Enum-based relocations: + +.. code-block:: c + + void delta(struct foo *s, volatile unsigned long *g) { + *g = __builtin_preserve_enum_value(*(enum bar *)U, 0 /* enum literal existence */); + *g = __builtin_preserve_enum_value(*(enum bar *)V, 1 /* enum literal value */); + } + + f0 : + 30: r1 = 0x1 ll + f0: CO-RE [16] enum bar::U = 0 + 32: *(u64 *)(r2 + 0x0) = r1 + 33: r1 = 0x1 ll + 108: CO-RE [16] enum bar::V = 1 + 35: *(u64 *)(r2 + 0x0) = r1 + 36: exit From 35d2b7ffffc1d9b3dc6c761010aa3338da49165b Mon Sep 17 00:00:00 2001 From: John Fastabend Date: Tue, 29 Aug 2023 22:35:17 -0700 Subject: [PATCH 07/95] bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_t Sockmap and sockhash maps are a collection of psocks that are objects representing a socket plus a set of metadata needed to manage the BPF programs associated with the socket. These maps use the stab->lock to protect from concurrent operations on the maps, e.g. trying to insert to objects into the array at the same time in the same slot. Additionally, a sockhash map has a bucket lock to protect iteration and insert/delete into the hash entry. Each psock has a psock->link which is a linked list of all the maps that a psock is attached to. This allows a psock (socket) to be included in multiple sockmap and sockhash maps. This linked list is protected the psock->link_lock. They _must_ be nested correctly to avoid deadlock: lock(stab->lock) : do BPF map operations and psock insert/delete lock(psock->link_lock) : add map to psock linked list of maps unlock(psock->link_lock) unlock(stab->lock) For non PREEMPT_RT kernels both raw_spin_lock_t and spin_lock_t are guaranteed to not sleep. But, with PREEMPT_RT kernels the spin_lock_t variants may sleep. In the current code we have many patterns like this: rcu_critical_section: raw_spin_lock(stab->lock) spin_lock(psock->link_lock) <- may sleep ouch spin_unlock(psock->link_lock) raw_spin_unlock(stab->lock) rcu_critical_section Nesting spin_lock() inside a raw_spin_lock() violates locking rules for PREEMPT_RT kernels. And additionally we do alloc(GFP_ATOMICS) inside the stab->lock, but those might sleep on PREEMPT_RT kernels. The result is splats like this: ./test_progs -t sockmap_basic [ 33.344330] bpf_testmod: loading out-of-tree module taints kernel. [ 33.441933] [ 33.442089] ============================= [ 33.442421] [ BUG: Invalid wait context ] [ 33.442763] 6.5.0-rc5-01731-gec0ded2e0282 #4958 Tainted: G O [ 33.443320] ----------------------------- [ 33.443624] test_progs/2073 is trying to lock: [ 33.443960] ffff888102a1c290 (&psock->link_lock){....}-{3:3}, at: sock_map_update_common+0x2c2/0x3d0 [ 33.444636] other info that might help us debug this: [ 33.444991] context-{5:5} [ 33.445183] 3 locks held by test_progs/2073: [ 33.445498] #0: ffff88811a208d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: sock_map_update_elem_sys+0xff/0x330 [ 33.446159] #1: ffffffff842539e0 (rcu_read_lock){....}-{1:3}, at: sock_map_update_elem_sys+0xf5/0x330 [ 33.446809] #2: ffff88810d687240 (&stab->lock){+...}-{2:2}, at: sock_map_update_common+0x177/0x3d0 [ 33.447445] stack backtrace: [ 33.447655] CPU: 10 PID To fix observe we can't readily remove the allocations (for that we would need to use/create something similar to bpf_map_alloc). So convert raw_spin_lock_t to spin_lock_t. We note that sock_map_update that would trigger the allocate and potential sleep is only allowed through sys_bpf ops and via sock_ops which precludes hw interrupts and low level atomic sections in RT preempt kernel. On non RT preempt kernel there are no changes here and spin locks sections and alloc(GFP_ATOMIC) are still not sleepable. Signed-off-by: John Fastabend Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230830053517.166611-1-john.fastabend@gmail.com --- net/core/sock_map.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 8f07fea39d9e..cb11750b1df5 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -18,7 +18,7 @@ struct bpf_stab { struct bpf_map map; struct sock **sks; struct sk_psock_progs progs; - raw_spinlock_t lock; + spinlock_t lock; }; #define SOCK_CREATE_FLAG_MASK \ @@ -44,7 +44,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr) return ERR_PTR(-ENOMEM); bpf_map_init_from_attr(&stab->map, attr); - raw_spin_lock_init(&stab->lock); + spin_lock_init(&stab->lock); stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries * sizeof(struct sock *), @@ -411,7 +411,7 @@ static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test, struct sock *sk; int err = 0; - raw_spin_lock_bh(&stab->lock); + spin_lock_bh(&stab->lock); sk = *psk; if (!sk_test || sk_test == sk) sk = xchg(psk, NULL); @@ -421,7 +421,7 @@ static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test, else err = -EINVAL; - raw_spin_unlock_bh(&stab->lock); + spin_unlock_bh(&stab->lock); return err; } @@ -487,7 +487,7 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx, psock = sk_psock(sk); WARN_ON_ONCE(!psock); - raw_spin_lock_bh(&stab->lock); + spin_lock_bh(&stab->lock); osk = stab->sks[idx]; if (osk && flags == BPF_NOEXIST) { ret = -EEXIST; @@ -501,10 +501,10 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx, stab->sks[idx] = sk; if (osk) sock_map_unref(osk, &stab->sks[idx]); - raw_spin_unlock_bh(&stab->lock); + spin_unlock_bh(&stab->lock); return 0; out_unlock: - raw_spin_unlock_bh(&stab->lock); + spin_unlock_bh(&stab->lock); if (psock) sk_psock_put(sk, psock); out_free: @@ -835,7 +835,7 @@ struct bpf_shtab_elem { struct bpf_shtab_bucket { struct hlist_head head; - raw_spinlock_t lock; + spinlock_t lock; }; struct bpf_shtab { @@ -910,7 +910,7 @@ static void sock_hash_delete_from_link(struct bpf_map *map, struct sock *sk, * is okay since it's going away only after RCU grace period. * However, we need to check whether it's still present. */ - raw_spin_lock_bh(&bucket->lock); + spin_lock_bh(&bucket->lock); elem_probe = sock_hash_lookup_elem_raw(&bucket->head, elem->hash, elem->key, map->key_size); if (elem_probe && elem_probe == elem) { @@ -918,7 +918,7 @@ static void sock_hash_delete_from_link(struct bpf_map *map, struct sock *sk, sock_map_unref(elem->sk, elem); sock_hash_free_elem(htab, elem); } - raw_spin_unlock_bh(&bucket->lock); + spin_unlock_bh(&bucket->lock); } static long sock_hash_delete_elem(struct bpf_map *map, void *key) @@ -932,7 +932,7 @@ static long sock_hash_delete_elem(struct bpf_map *map, void *key) hash = sock_hash_bucket_hash(key, key_size); bucket = sock_hash_select_bucket(htab, hash); - raw_spin_lock_bh(&bucket->lock); + spin_lock_bh(&bucket->lock); elem = sock_hash_lookup_elem_raw(&bucket->head, hash, key, key_size); if (elem) { hlist_del_rcu(&elem->node); @@ -940,7 +940,7 @@ static long sock_hash_delete_elem(struct bpf_map *map, void *key) sock_hash_free_elem(htab, elem); ret = 0; } - raw_spin_unlock_bh(&bucket->lock); + spin_unlock_bh(&bucket->lock); return ret; } @@ -1000,7 +1000,7 @@ static int sock_hash_update_common(struct bpf_map *map, void *key, hash = sock_hash_bucket_hash(key, key_size); bucket = sock_hash_select_bucket(htab, hash); - raw_spin_lock_bh(&bucket->lock); + spin_lock_bh(&bucket->lock); elem = sock_hash_lookup_elem_raw(&bucket->head, hash, key, key_size); if (elem && flags == BPF_NOEXIST) { ret = -EEXIST; @@ -1026,10 +1026,10 @@ static int sock_hash_update_common(struct bpf_map *map, void *key, sock_map_unref(elem->sk, elem); sock_hash_free_elem(htab, elem); } - raw_spin_unlock_bh(&bucket->lock); + spin_unlock_bh(&bucket->lock); return 0; out_unlock: - raw_spin_unlock_bh(&bucket->lock); + spin_unlock_bh(&bucket->lock); sk_psock_put(sk, psock); out_free: sk_psock_free_link(link); @@ -1115,7 +1115,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr) for (i = 0; i < htab->buckets_num; i++) { INIT_HLIST_HEAD(&htab->buckets[i].head); - raw_spin_lock_init(&htab->buckets[i].lock); + spin_lock_init(&htab->buckets[i].lock); } return &htab->map; @@ -1147,11 +1147,11 @@ static void sock_hash_free(struct bpf_map *map) * exists, psock exists and holds a ref to socket. That * lets us to grab a socket ref too. */ - raw_spin_lock_bh(&bucket->lock); + spin_lock_bh(&bucket->lock); hlist_for_each_entry(elem, &bucket->head, node) sock_hold(elem->sk); hlist_move_list(&bucket->head, &unlink_list); - raw_spin_unlock_bh(&bucket->lock); + spin_unlock_bh(&bucket->lock); /* Process removed entries out of atomic context to * block for socket lock before deleting the psock's From e4da8c78973c1e307c0431e0b99a969ffb8aa3f1 Mon Sep 17 00:00:00 2001 From: Heng Guo Date: Fri, 25 Aug 2023 15:55:05 +0800 Subject: [PATCH 08/95] net: ipv4, ipv6: fix IPSTATS_MIB_OUTOCTETS increment duplicated commit edf391ff1723 ("snmp: add missing counters for RFC 4293") had already added OutOctets for RFC 4293. In commit 2d8dbb04c63e ("snmp: fix OutOctets counter to include forwarded datagrams"), OutOctets was counted again, but not removed from ip_output(). According to RFC 4293 "3.2.3. IP Statistics Tables", ipipIfStatsOutTransmits is not equal to ipIfStatsOutForwDatagrams. So "IPSTATS_MIB_OUTOCTETS must be incremented when incrementing" is not accurate. And IPSTATS_MIB_OUTOCTETS should be counted after fragment. This patch reverts commit 2d8dbb04c63e ("snmp: fix OutOctets counter to include forwarded datagrams") and move IPSTATS_MIB_OUTOCTETS to ip_finish_output2 for ipv4. Reviewed-by: Filip Pudak Signed-off-by: Heng Guo Signed-off-by: David S. Miller --- net/ipv4/ip_forward.c | 1 - net/ipv4/ip_output.c | 7 +++---- net/ipv4/ipmr.c | 1 - net/ipv6/ip6_output.c | 1 - net/ipv6/ip6mr.c | 2 -- 5 files changed, 3 insertions(+), 9 deletions(-) diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index e18931a6d153..66fac1216d46 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -67,7 +67,6 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s struct ip_options *opt = &(IPCB(skb)->opt); __IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS); - __IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len); #ifdef CONFIG_NET_SWITCHDEV if (skb->offload_l3_fwd_mark) { diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 43ba4b77b248..b2e0ad312028 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -207,6 +207,9 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s } else if (rt->rt_type == RTN_BROADCAST) IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len); + /* OUTOCTETS should be counted after fragment */ + IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); + if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { skb = skb_expand_head(skb, hh_len); if (!skb) @@ -366,8 +369,6 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb) /* * If the indicated interface is up and running, send the packet. */ - IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); - skb->dev = dev; skb->protocol = htons(ETH_P_IP); @@ -424,8 +425,6 @@ int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev, *indev = skb->dev; - IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); - skb->dev = dev; skb->protocol = htons(ETH_P_IP); diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 3f0c6d602fb7..9e222a57bc2b 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1804,7 +1804,6 @@ static inline int ipmr_forward_finish(struct net *net, struct sock *sk, struct ip_options *opt = &(IPCB(skb)->opt); IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS); - IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len); if (unlikely(opt->optlen)) ip_forward_options(skb); diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 0665e8b09968..4ab50169a5a9 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -451,7 +451,6 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk, struct dst_entry *dst = skb_dst(skb); __IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTFORWDATAGRAMS); - __IP6_ADD_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTOCTETS, skb->len); #ifdef CONFIG_NET_SWITCHDEV if (skb->offload_l3_fwd_mark) { diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c index 67a3b8f6e72b..30ca064b76ef 100644 --- a/net/ipv6/ip6mr.c +++ b/net/ipv6/ip6mr.c @@ -2010,8 +2010,6 @@ static inline int ip6mr_forward2_finish(struct net *net, struct sock *sk, struct { IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_OUTFORWDATAGRAMS); - IP6_ADD_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_OUTOCTETS, skb->len); return dst_output(net, sk, skb); } From aee1720eeb87a3adc242eb07e5d4f7ba3eb8c736 Mon Sep 17 00:00:00 2001 From: David Vernet Date: Mon, 28 Aug 2023 10:59:46 -0500 Subject: [PATCH 09/95] bpf, docs: Move linux-notes.rst to root bpf docs tree In commit 4d496be9ca05 ("bpf,docs: Create new standardization subdirectory"), I added a standardization/ directory to the BPF documentation, which will contain the docs that will be standardized as part of the effort with the IETF. I included linux-notes.rst in that directory, but I shouldn't have. It doesn't contain anything that will be standardized. Let's move it back to Documentation/bpf. Signed-off-by: David Vernet Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230828155948.123405-2-void@manifault.com --- Documentation/bpf/index.rst | 1 + Documentation/bpf/{standardization => }/linux-notes.rst | 0 Documentation/bpf/standardization/index.rst | 1 - 3 files changed, 1 insertion(+), 1 deletion(-) rename Documentation/bpf/{standardization => }/linux-notes.rst (100%) diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 1ff177b89d66..aeaeb35e6d4a 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -29,6 +29,7 @@ that goes into great technical depth about the BPF Architecture. bpf_licensing test_debug clang-notes + linux-notes other redirect diff --git a/Documentation/bpf/standardization/linux-notes.rst b/Documentation/bpf/linux-notes.rst similarity index 100% rename from Documentation/bpf/standardization/linux-notes.rst rename to Documentation/bpf/linux-notes.rst diff --git a/Documentation/bpf/standardization/index.rst b/Documentation/bpf/standardization/index.rst index 09c6ba055fd7..d7b946f71261 100644 --- a/Documentation/bpf/standardization/index.rst +++ b/Documentation/bpf/standardization/index.rst @@ -12,7 +12,6 @@ for the working group charter, documents, and more. :maxdepth: 1 instruction-set - linux-notes .. Links: .. _IETF BPF Working Group: https://datatracker.ietf.org/wg/bpf/about/ From deb88407254621bf926658cff49a7ba01b59dec6 Mon Sep 17 00:00:00 2001 From: David Vernet Date: Mon, 28 Aug 2023 10:59:47 -0500 Subject: [PATCH 10/95] bpf, docs: Add abi.rst document to standardization subdirectory As specified in the IETF BPF charter, the BPF working group has plans to add one or more informational documents that recommend conventions and guidelines for producing portable BPF program binaries. The instruction-set.rst document currently contains a "Registers and calling convention" subsection which dictates a calling convention that belongs in an ABI document, rather than an instruction set document. Let's move it to a new abi.rst document so we can clean it up. The abi.rst document will of course be significantly changed and expanded upon over time. For now, it's really just a placeholder which will contain ABI-specific language that doesn't belong in other documents. Signed-off-by: David Vernet Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230828155948.123405-3-void@manifault.com --- Documentation/bpf/standardization/abi.rst | 25 +++++++++++++++++++ Documentation/bpf/standardization/index.rst | 1 + .../bpf/standardization/instruction-set.rst | 16 ------------ 3 files changed, 26 insertions(+), 16 deletions(-) create mode 100644 Documentation/bpf/standardization/abi.rst diff --git a/Documentation/bpf/standardization/abi.rst b/Documentation/bpf/standardization/abi.rst new file mode 100644 index 000000000000..0c2e10eeb89a --- /dev/null +++ b/Documentation/bpf/standardization/abi.rst @@ -0,0 +1,25 @@ +.. contents:: +.. sectnum:: + +=================================================== +BPF ABI Recommended Conventions and Guidelines v1.0 +=================================================== + +This is version 1.0 of an informational document containing recommended +conventions and guidelines for producing portable BPF program binaries. + +Registers and calling convention +================================ + +BPF has 10 general purpose registers and a read-only frame pointer register, +all of which are 64-bits wide. + +The BPF calling convention is defined as: + +* R0: return value from function calls, and exit value for BPF programs +* R1 - R5: arguments for function calls +* R6 - R9: callee saved registers that function calls will preserve +* R10: read-only frame pointer to access stack + +R0 - R5 are scratch registers and BPF programs needs to spill/fill them if +necessary across calls. diff --git a/Documentation/bpf/standardization/index.rst b/Documentation/bpf/standardization/index.rst index d7b946f71261..a50c3baf6345 100644 --- a/Documentation/bpf/standardization/index.rst +++ b/Documentation/bpf/standardization/index.rst @@ -12,6 +12,7 @@ for the working group charter, documents, and more. :maxdepth: 1 instruction-set + abi .. Links: .. _IETF BPF Working Group: https://datatracker.ietf.org/wg/bpf/about/ diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst index c5b0b2011f16..83583d735e38 100644 --- a/Documentation/bpf/standardization/instruction-set.rst +++ b/Documentation/bpf/standardization/instruction-set.rst @@ -97,22 +97,6 @@ Definitions A: 10000110 B: 11111111 10000110 -Registers and calling convention -================================ - -eBPF has 10 general purpose registers and a read-only frame pointer register, -all of which are 64-bits wide. - -The eBPF calling convention is defined as: - -* R0: return value from function calls, and exit value for eBPF programs -* R1 - R5: arguments for function calls -* R6 - R9: callee saved registers that function calls will preserve -* R10: read-only frame pointer to access stack - -R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if -necessary across calls. - Instruction encoding ==================== From 7d35eb1a184a3f0759ad9e9cde4669b5c55b2063 Mon Sep 17 00:00:00 2001 From: David Vernet Date: Mon, 28 Aug 2023 10:59:48 -0500 Subject: [PATCH 11/95] bpf, docs: s/eBPF/BPF in standards documents There isn't really anything other than just "BPF" at this point, so referring to it as "eBPF" in our standards document just causes unnecessary confusion. Let's just be consistent and use "BPF". Suggested-by: Will Hawkins Signed-off-by: David Vernet Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230828155948.123405-4-void@manifault.com --- .../bpf/standardization/instruction-set.rst | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst index 83583d735e38..c5d53a6e8c79 100644 --- a/Documentation/bpf/standardization/instruction-set.rst +++ b/Documentation/bpf/standardization/instruction-set.rst @@ -1,11 +1,11 @@ .. contents:: .. sectnum:: -======================================== -eBPF Instruction Set Specification, v1.0 -======================================== +======================================= +BPF Instruction Set Specification, v1.0 +======================================= -This document specifies version 1.0 of the eBPF instruction set. +This document specifies version 1.0 of the BPF instruction set. Documentation conventions ========================= @@ -100,7 +100,7 @@ Definitions Instruction encoding ==================== -eBPF has two instruction encodings: +BPF has two instruction encodings: * the basic instruction encoding, which uses 64 bits to encode an instruction * the wide instruction encoding, which appends a second 64-bit immediate (i.e., @@ -244,7 +244,7 @@ BPF_END 0xd0 0 byte swap operations (see `Byte swap instructions`_ b ========= ===== ======= ========================================================== Underflow and overflow are allowed during arithmetic operations, meaning -the 64-bit or 32-bit value will wrap. If eBPF program execution would +the 64-bit or 32-bit value will wrap. If BPF program execution would result in division by zero, the destination register is instead set to zero. If execution would result in modulo by zero, for ``BPF_ALU64`` the value of the destination register is unchanged whereas for ``BPF_ALU`` the upper @@ -366,7 +366,7 @@ BPF_JSLT 0xc any PC += offset if dst < src signed BPF_JSLE 0xd any PC += offset if dst <= src signed ======== ===== === =========================================== ========================================= -The eBPF program needs to store the return value into register R0 before doing a +The BPF program needs to store the return value into register R0 before doing a ``BPF_EXIT``. Example: @@ -486,9 +486,9 @@ Atomic operations Atomic operations are operations that operate on memory and can not be interrupted or corrupted by other access to the same memory region -by other eBPF programs or means outside of this specification. +by other BPF programs or means outside of this specification. -All atomic operations supported by eBPF are encoded as store operations +All atomic operations supported by BPF are encoded as store operations that use the ``BPF_ATOMIC`` mode modifier as follows: * ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations @@ -578,7 +578,7 @@ where Maps ~~~~ -Maps are shared memory regions accessible by eBPF programs on some platforms. +Maps are shared memory regions accessible by BPF programs on some platforms. A map can have various semantics as defined in a separate document, and may or may not have a single contiguous memory region, but the 'map_val(map)' is currently only defined for maps that do have a single contiguous memory region. @@ -600,6 +600,6 @@ identified by the given id. Legacy BPF Packet access instructions ------------------------------------- -eBPF previously introduced special instructions for access to packet data that were +BPF previously introduced special instructions for access to packet data that were carried over from classic BPF. However, these instructions are deprecated and should no longer be used. From 28427f368f0e08d504ed06e74bc7cc79d6d06511 Mon Sep 17 00:00:00 2001 From: Xiao Liang Date: Fri, 25 Aug 2023 13:33:27 +0800 Subject: [PATCH 12/95] netfilter: nft_exthdr: Fix non-linear header modification Fix skb_ensure_writable() size. Don't use nft_tcp_header_pointer() to make it explicit that pointers point to the packet (not local buffer). Fixes: 99d1712bc41c ("netfilter: exthdr: tcp option set support") Fixes: 7890cbea66e7 ("netfilter: exthdr: add support for tcp option removal") Cc: stable@vger.kernel.org Signed-off-by: Xiao Liang Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_exthdr.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c index 7f856ceb3a66..a9844eefedeb 100644 --- a/net/netfilter/nft_exthdr.c +++ b/net/netfilter/nft_exthdr.c @@ -238,7 +238,12 @@ static void nft_exthdr_tcp_set_eval(const struct nft_expr *expr, if (!tcph) goto err; + if (skb_ensure_writable(pkt->skb, nft_thoff(pkt) + tcphdr_len)) + goto err; + + tcph = (struct tcphdr *)(pkt->skb->data + nft_thoff(pkt)); opt = (u8 *)tcph; + for (i = sizeof(*tcph); i < tcphdr_len - 1; i += optl) { union { __be16 v16; @@ -253,15 +258,6 @@ static void nft_exthdr_tcp_set_eval(const struct nft_expr *expr, if (i + optl > tcphdr_len || priv->len + priv->offset > optl) goto err; - if (skb_ensure_writable(pkt->skb, - nft_thoff(pkt) + i + priv->len)) - goto err; - - tcph = nft_tcp_header_pointer(pkt, sizeof(buff), buff, - &tcphdr_len); - if (!tcph) - goto err; - offset = i + priv->offset; switch (priv->len) { @@ -325,9 +321,9 @@ static void nft_exthdr_tcp_strip_eval(const struct nft_expr *expr, if (skb_ensure_writable(pkt->skb, nft_thoff(pkt) + tcphdr_len)) goto drop; - opt = (u8 *)nft_tcp_header_pointer(pkt, sizeof(buff), buff, &tcphdr_len); - if (!opt) - goto err; + tcph = (struct tcphdr *)(pkt->skb->data + nft_thoff(pkt)); + opt = (u8 *)tcph; + for (i = sizeof(*tcph); i < tcphdr_len - 1; i += optl) { unsigned int j; From e99476497687ef9e850748fe6d232264f30bc8f9 Mon Sep 17 00:00:00 2001 From: Wander Lairson Costa Date: Mon, 28 Aug 2023 19:12:55 -0300 Subject: [PATCH 13/95] netfilter: xt_sctp: validate the flag_info count sctp_mt_check doesn't validate the flag_count field. An attacker can take advantage of that to trigger a OOB read and leak memory information. Add the field validation in the checkentry function. Fixes: 2e4e6a17af35 ("[NETFILTER] x_tables: Abstraction layer for {ip,ip6,arp}_tables") Cc: stable@vger.kernel.org Reported-by: Lucas Leong Signed-off-by: Wander Lairson Costa Signed-off-by: Pablo Neira Ayuso --- net/netfilter/xt_sctp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/netfilter/xt_sctp.c b/net/netfilter/xt_sctp.c index e8961094a282..b46a6a512058 100644 --- a/net/netfilter/xt_sctp.c +++ b/net/netfilter/xt_sctp.c @@ -149,6 +149,8 @@ static int sctp_mt_check(const struct xt_mtchk_param *par) { const struct xt_sctp_info *info = par->matchinfo; + if (info->flag_count > ARRAY_SIZE(info->flag_info)) + return -EINVAL; if (info->flags & ~XT_SCTP_VALID_FLAGS) return -EINVAL; if (info->invflags & ~XT_SCTP_VALID_FLAGS) From 69c5d284f67089b4750d28ff6ac6f52ec224b330 Mon Sep 17 00:00:00 2001 From: Wander Lairson Costa Date: Mon, 28 Aug 2023 10:21:07 -0300 Subject: [PATCH 14/95] netfilter: xt_u32: validate user space input The xt_u32 module doesn't validate the fields in the xt_u32 structure. An attacker may take advantage of this to trigger an OOB read by setting the size fields with a value beyond the arrays boundaries. Add a checkentry function to validate the structure. This was originally reported by the ZDI project (ZDI-CAN-18408). Fixes: 1b50b8a371e9 ("[NETFILTER]: Add u32 match") Cc: stable@vger.kernel.org Signed-off-by: Wander Lairson Costa Signed-off-by: Pablo Neira Ayuso --- net/netfilter/xt_u32.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/net/netfilter/xt_u32.c b/net/netfilter/xt_u32.c index 177b40d08098..117d4615d668 100644 --- a/net/netfilter/xt_u32.c +++ b/net/netfilter/xt_u32.c @@ -96,11 +96,32 @@ static bool u32_mt(const struct sk_buff *skb, struct xt_action_param *par) return ret ^ data->invert; } +static int u32_mt_checkentry(const struct xt_mtchk_param *par) +{ + const struct xt_u32 *data = par->matchinfo; + const struct xt_u32_test *ct; + unsigned int i; + + if (data->ntests > ARRAY_SIZE(data->tests)) + return -EINVAL; + + for (i = 0; i < data->ntests; ++i) { + ct = &data->tests[i]; + + if (ct->nnums > ARRAY_SIZE(ct->location) || + ct->nvalues > ARRAY_SIZE(ct->value)) + return -EINVAL; + } + + return 0; +} + static struct xt_match xt_u32_mt_reg __read_mostly = { .name = "u32", .revision = 0, .family = NFPROTO_UNSPEC, .match = u32_mt, + .checkentry = u32_mt_checkentry, .matchsize = sizeof(struct xt_u32), .me = THIS_MODULE, }; From 7e9be1124dbe7888907e82cab20164578e3f9ab7 Mon Sep 17 00:00:00 2001 From: Phil Sutter Date: Tue, 29 Aug 2023 19:51:57 +0200 Subject: [PATCH 15/95] netfilter: nf_tables: Audit log setelem reset Since set element reset is not integrated into nf_tables' transaction logic, an explicit log call is needed, similar to NFT_MSG_GETOBJ_RESET handling. For the sake of simplicity, catchall element reset will always generate a dedicated log entry. This relieves nf_tables_dump_set() from having to adjust the logged element count depending on whether a catchall element was found or not. Fixes: 079cd633219d7 ("netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET") Signed-off-by: Phil Sutter Reviewed-by: Richard Guy Briggs Signed-off-by: Pablo Neira Ayuso --- include/linux/audit.h | 1 + kernel/auditsc.c | 1 + net/netfilter/nf_tables_api.c | 31 ++++++++++++++++++++++++++++--- 3 files changed, 30 insertions(+), 3 deletions(-) diff --git a/include/linux/audit.h b/include/linux/audit.h index 6a3a9e122bb5..192bf03aacc5 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -117,6 +117,7 @@ enum audit_nfcfgop { AUDIT_NFT_OP_OBJ_RESET, AUDIT_NFT_OP_FLOWTABLE_REGISTER, AUDIT_NFT_OP_FLOWTABLE_UNREGISTER, + AUDIT_NFT_OP_SETELEM_RESET, AUDIT_NFT_OP_INVALID, }; diff --git a/kernel/auditsc.c b/kernel/auditsc.c index addeed3df15d..38481e318197 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -143,6 +143,7 @@ static const struct audit_nfcfgop_tab audit_nfcfgs[] = { { AUDIT_NFT_OP_OBJ_RESET, "nft_reset_obj" }, { AUDIT_NFT_OP_FLOWTABLE_REGISTER, "nft_register_flowtable" }, { AUDIT_NFT_OP_FLOWTABLE_UNREGISTER, "nft_unregister_flowtable" }, + { AUDIT_NFT_OP_SETELEM_RESET, "nft_reset_setelem" }, { AUDIT_NFT_OP_INVALID, "nft_invalid" }, }; diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 41b826dff6f5..361e98e71692 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -102,6 +102,7 @@ static const u8 nft2audit_op[NFT_MSG_MAX] = { // enum nf_tables_msg_types [NFT_MSG_NEWFLOWTABLE] = AUDIT_NFT_OP_FLOWTABLE_REGISTER, [NFT_MSG_GETFLOWTABLE] = AUDIT_NFT_OP_INVALID, [NFT_MSG_DELFLOWTABLE] = AUDIT_NFT_OP_FLOWTABLE_UNREGISTER, + [NFT_MSG_GETSETELEM_RESET] = AUDIT_NFT_OP_SETELEM_RESET, }; static void nft_validate_state_update(struct nft_table *table, u8 new_validate_state) @@ -5624,13 +5625,25 @@ static int nf_tables_dump_setelem(const struct nft_ctx *ctx, return nf_tables_fill_setelem(args->skb, set, elem, args->reset); } +static void audit_log_nft_set_reset(const struct nft_table *table, + unsigned int base_seq, + unsigned int nentries) +{ + char *buf = kasprintf(GFP_ATOMIC, "%s:%u", table->name, base_seq); + + audit_log_nfcfg(buf, table->family, nentries, + AUDIT_NFT_OP_SETELEM_RESET, GFP_ATOMIC); + kfree(buf); +} + struct nft_set_dump_ctx { const struct nft_set *set; struct nft_ctx ctx; }; static int nft_set_catchall_dump(struct net *net, struct sk_buff *skb, - const struct nft_set *set, bool reset) + const struct nft_set *set, bool reset, + unsigned int base_seq) { struct nft_set_elem_catchall *catchall; u8 genmask = nft_genmask_cur(net); @@ -5646,6 +5659,8 @@ static int nft_set_catchall_dump(struct net *net, struct sk_buff *skb, elem.priv = catchall->elem; ret = nf_tables_fill_setelem(skb, set, &elem, reset); + if (reset && !ret) + audit_log_nft_set_reset(set->table, base_seq, 1); break; } @@ -5725,12 +5740,17 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb) set->ops->walk(&dump_ctx->ctx, set, &args.iter); if (!args.iter.err && args.iter.count == cb->args[0]) - args.iter.err = nft_set_catchall_dump(net, skb, set, reset); + args.iter.err = nft_set_catchall_dump(net, skb, set, + reset, cb->seq); rcu_read_unlock(); nla_nest_end(skb, nest); nlmsg_end(skb, nlh); + if (reset && args.iter.count > args.iter.skip) + audit_log_nft_set_reset(table, cb->seq, + args.iter.count - args.iter.skip); + if (args.iter.err && args.iter.err != -EMSGSIZE) return args.iter.err; if (args.iter.count == cb->args[0]) @@ -5955,13 +5975,13 @@ static int nf_tables_getsetelem(struct sk_buff *skb, struct netlink_ext_ack *extack = info->extack; u8 genmask = nft_genmask_cur(info->net); u8 family = info->nfmsg->nfgen_family; + int rem, err = 0, nelems = 0; struct net *net = info->net; struct nft_table *table; struct nft_set *set; struct nlattr *attr; struct nft_ctx ctx; bool reset = false; - int rem, err = 0; table = nft_table_lookup(net, nla[NFTA_SET_ELEM_LIST_TABLE], family, genmask, 0); @@ -6004,8 +6024,13 @@ static int nf_tables_getsetelem(struct sk_buff *skb, NL_SET_BAD_ATTR(extack, attr); break; } + nelems++; } + if (reset) + audit_log_nft_set_reset(table, nft_pernet(net)->base_seq, + nelems); + return err; } From ea078ae9108e25fc881c84369f7c03931d22e555 Mon Sep 17 00:00:00 2001 From: Phil Sutter Date: Tue, 29 Aug 2023 19:51:58 +0200 Subject: [PATCH 16/95] netfilter: nf_tables: Audit log rule reset Resetting rules' stateful data happens outside of the transaction logic, so 'get' and 'dump' handlers have to emit audit log entries themselves. Fixes: 8daa8fde3fc3f ("netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET") Signed-off-by: Phil Sutter Reviewed-by: Richard Guy Briggs Signed-off-by: Pablo Neira Ayuso --- include/linux/audit.h | 1 + kernel/auditsc.c | 1 + net/netfilter/nf_tables_api.c | 18 ++++++++++++++++++ 3 files changed, 20 insertions(+) diff --git a/include/linux/audit.h b/include/linux/audit.h index 192bf03aacc5..51b1b7054a23 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -118,6 +118,7 @@ enum audit_nfcfgop { AUDIT_NFT_OP_FLOWTABLE_REGISTER, AUDIT_NFT_OP_FLOWTABLE_UNREGISTER, AUDIT_NFT_OP_SETELEM_RESET, + AUDIT_NFT_OP_RULE_RESET, AUDIT_NFT_OP_INVALID, }; diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 38481e318197..fc0c7c03eeab 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -144,6 +144,7 @@ static const struct audit_nfcfgop_tab audit_nfcfgs[] = { { AUDIT_NFT_OP_FLOWTABLE_REGISTER, "nft_register_flowtable" }, { AUDIT_NFT_OP_FLOWTABLE_UNREGISTER, "nft_unregister_flowtable" }, { AUDIT_NFT_OP_SETELEM_RESET, "nft_reset_setelem" }, + { AUDIT_NFT_OP_RULE_RESET, "nft_reset_rule" }, { AUDIT_NFT_OP_INVALID, "nft_invalid" }, }; diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 361e98e71692..2c81cee858d6 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3422,6 +3422,18 @@ err: nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS); } +static void audit_log_rule_reset(const struct nft_table *table, + unsigned int base_seq, + unsigned int nentries) +{ + char *buf = kasprintf(GFP_ATOMIC, "%s:%u", + table->name, base_seq); + + audit_log_nfcfg(buf, table->family, nentries, + AUDIT_NFT_OP_RULE_RESET, GFP_ATOMIC); + kfree(buf); +} + struct nft_rule_dump_ctx { char *table; char *chain; @@ -3528,6 +3540,9 @@ static int nf_tables_dump_rules(struct sk_buff *skb, done: rcu_read_unlock(); + if (reset && idx > cb->args[0]) + audit_log_rule_reset(table, cb->seq, idx - cb->args[0]); + cb->args[0] = idx; return skb->len; } @@ -3635,6 +3650,9 @@ static int nf_tables_getrule(struct sk_buff *skb, const struct nfnl_info *info, if (err < 0) goto err_fill_rule_info; + if (reset) + audit_log_rule_reset(table, nft_pernet(net)->base_seq, 1); + return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid); err_fill_rule_info: From b5947239bfa666afd05ce0fc02b9c41ec8209e88 Mon Sep 17 00:00:00 2001 From: "Russell King (Oracle)" Date: Tue, 29 Aug 2023 14:29:50 +0100 Subject: [PATCH 17/95] net: stmmac: failure to probe without MAC interface specified Alexander Stein reports that commit a014c35556b9 ("net: stmmac: clarify difference between "interface" and "phy_interface"") caused breakage, because plat->mac_interface will never be negative. Fix this by using the "rc" temporary variable in stmmac_probe_config_dt(). Reported-by: Alexander Stein Signed-off-by: Russell King (Oracle) Tested-by: Alexander Stein Link: https://lore.kernel.org/r/E1qayn0-006Q8J-GE@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c index 35f4b1484029..0f28795e581c 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c @@ -419,9 +419,8 @@ stmmac_probe_config_dt(struct platform_device *pdev, u8 *mac) return ERR_PTR(phy_mode); plat->phy_interface = phy_mode; - plat->mac_interface = stmmac_of_get_mac_mode(np); - if (plat->mac_interface < 0) - plat->mac_interface = plat->phy_interface; + rc = stmmac_of_get_mac_mode(np); + plat->mac_interface = rc < 0 ? plat->phy_interface : rc; /* Some wrapper drivers still rely on phy_node. Let's save it while * they are not converted to phylink. */ From 8b72d2a1c6cc148320a93d029eb3a7e721f951f6 Mon Sep 17 00:00:00 2001 From: Oliver Neukum Date: Tue, 29 Aug 2023 10:47:17 +0200 Subject: [PATCH 18/95] NFC: nxp: add NXP1002 It is backwards compatible Signed-off-by: Oliver Neukum Reviewed-by: Krzysztof Kozlowski Link: https://lore.kernel.org/r/20230829084717.961-1-oneukum@suse.com Signed-off-by: Jakub Kicinski --- drivers/nfc/nxp-nci/i2c.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nfc/nxp-nci/i2c.c b/drivers/nfc/nxp-nci/i2c.c index dca25a0c2f33..3ae4b41c59ac 100644 --- a/drivers/nfc/nxp-nci/i2c.c +++ b/drivers/nfc/nxp-nci/i2c.c @@ -336,6 +336,7 @@ MODULE_DEVICE_TABLE(of, of_nxp_nci_i2c_match); #ifdef CONFIG_ACPI static const struct acpi_device_id acpi_id[] = { { "NXP1001" }, + { "NXP1002" }, { "NXP7471" }, { } }; From ee940b57a92965b76e05075e0a20f7d16a1cf976 Mon Sep 17 00:00:00 2001 From: Donald Hunter Date: Tue, 29 Aug 2023 09:55:39 +0100 Subject: [PATCH 19/95] doc/netlink: Fix missing classic_netlink doc reference Add missing cross-reference label for classic_netlink. Fixes: 2db8abf0b455 ("doc/netlink: Document the netlink-raw schema extensions") Signed-off-by: Donald Hunter Link: https://lore.kernel.org/r/20230829085539.36354-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski --- Documentation/userspace-api/netlink/intro.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/userspace-api/netlink/intro.rst b/Documentation/userspace-api/netlink/intro.rst index 0955e9f203d3..3ea70ad53c58 100644 --- a/Documentation/userspace-api/netlink/intro.rst +++ b/Documentation/userspace-api/netlink/intro.rst @@ -528,6 +528,8 @@ families may, however, require a larger buffer. 32kB buffer is recommended for most efficient handling of dumps (larger buffer fits more dumped objects and therefore fewer recvmsg() calls are needed). +.. _classic_netlink: + Classic Netlink =============== From 8c21ab1bae945686c602c5bfa4e3f3352c2452c5 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 29 Aug 2023 12:35:41 +0000 Subject: [PATCH 20/95] net/sched: fq_pie: avoid stalls in fq_pie_timer() When setting a high number of flows (limit being 65536), fq_pie_timer() is currently using too much time as syzbot reported. Add logic to yield the cpu every 2048 flows (less than 150 usec on debug kernels). It should also help by not blocking qdisc fast paths for too long. Worst case (65536 flows) would need 31 jiffies for a complete scan. Relevant extract from syzbot report: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-.... } 2663 jiffies s: 873 root: 0x1/. rcu: blocking rcu_node structures (internal RCU debug): Sending NMI from CPU 1 to CPUs 0: NMI backtrace for cpu 0 CPU: 0 PID: 5177 Comm: syz-executor273 Not tainted 6.5.0-syzkaller-00453-g727dbda16b83 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 RIP: 0010:check_kcov_mode kernel/kcov.c:173 [inline] RIP: 0010:write_comp_data+0x21/0x90 kernel/kcov.c:236 Code: 2e 0f 1f 84 00 00 00 00 00 65 8b 05 01 b2 7d 7e 49 89 f1 89 c6 49 89 d2 81 e6 00 01 00 00 49 89 f8 65 48 8b 14 25 80 b9 03 00 00 01 ff 00 74 0e 85 f6 74 59 8b 82 04 16 00 00 85 c0 74 4f 8b RSP: 0018:ffffc90000007bb8 EFLAGS: 00000206 RAX: 0000000000000101 RBX: ffffc9000dc0d140 RCX: ffffffff885893b0 RDX: ffff88807c075940 RSI: 0000000000000100 RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000dc0d178 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000555555d54380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b442f6130 CR3: 000000006fe1c000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: pie_calculate_probability+0x480/0x850 net/sched/sch_pie.c:415 fq_pie_timer+0x1da/0x4f0 net/sched/sch_fq_pie.c:387 call_timer_fn+0x1a0/0x580 kernel/time/timer.c:1700 Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler") Link: https://lore.kernel.org/lkml/00000000000017ad3f06040bf394@google.com/ Reported-by: syzbot+e46fbd5289363464bc13@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet Reviewed-by: Michal Kubiak Reviewed-by: Jamal Hadi Salim Link: https://lore.kernel.org/r/20230829123541.3745013-1-edumazet@google.com Signed-off-by: Paolo Abeni --- net/sched/sch_fq_pie.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/net/sched/sch_fq_pie.c b/net/sched/sch_fq_pie.c index 591d87d5e5c0..68e6acd0f130 100644 --- a/net/sched/sch_fq_pie.c +++ b/net/sched/sch_fq_pie.c @@ -61,6 +61,7 @@ struct fq_pie_sched_data { struct pie_params p_params; u32 ecn_prob; u32 flows_cnt; + u32 flows_cursor; u32 quantum; u32 memory_limit; u32 new_flow_count; @@ -375,22 +376,32 @@ flow_error: static void fq_pie_timer(struct timer_list *t) { struct fq_pie_sched_data *q = from_timer(q, t, adapt_timer); + unsigned long next, tupdate; struct Qdisc *sch = q->sch; spinlock_t *root_lock; /* to lock qdisc for probability calculations */ - u32 idx; + int max_cnt, i; rcu_read_lock(); root_lock = qdisc_lock(qdisc_root_sleeping(sch)); spin_lock(root_lock); - for (idx = 0; idx < q->flows_cnt; idx++) - pie_calculate_probability(&q->p_params, &q->flows[idx].vars, - q->flows[idx].backlog); - - /* reset the timer to fire after 'tupdate' jiffies. */ - if (q->p_params.tupdate) - mod_timer(&q->adapt_timer, jiffies + q->p_params.tupdate); + /* Limit this expensive loop to 2048 flows per round. */ + max_cnt = min_t(int, q->flows_cnt - q->flows_cursor, 2048); + for (i = 0; i < max_cnt; i++) { + pie_calculate_probability(&q->p_params, + &q->flows[q->flows_cursor].vars, + q->flows[q->flows_cursor].backlog); + q->flows_cursor++; + } + tupdate = q->p_params.tupdate; + next = 0; + if (q->flows_cursor >= q->flows_cnt) { + q->flows_cursor = 0; + next = tupdate; + } + if (tupdate) + mod_timer(&q->adapt_timer, jiffies + next); spin_unlock(root_lock); rcu_read_unlock(); } From dc9511dd6f37fe803f6b15b61b030728d7057417 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 30 Aug 2023 09:45:19 +0000 Subject: [PATCH 21/95] sctp: annotate data-races around sk->sk_wmem_queued sk->sk_wmem_queued can be read locklessly from sctp_poll() Use sk_wmem_queued_add() when the field is changed, and add READ_ONCE() annotations in sctp_writeable() and sctp_assocs_seq_show() syzbot reported: BUG: KCSAN: data-race in sctp_poll / sctp_wfree read-write to 0xffff888149d77810 of 4 bytes by interrupt on cpu 0: sctp_wfree+0x170/0x4a0 net/sctp/socket.c:9147 skb_release_head_state+0xb7/0x1a0 net/core/skbuff.c:988 skb_release_all net/core/skbuff.c:1000 [inline] __kfree_skb+0x16/0x140 net/core/skbuff.c:1016 consume_skb+0x57/0x180 net/core/skbuff.c:1232 sctp_chunk_destroy net/sctp/sm_make_chunk.c:1503 [inline] sctp_chunk_put+0xcd/0x130 net/sctp/sm_make_chunk.c:1530 sctp_datamsg_put+0x29a/0x300 net/sctp/chunk.c:128 sctp_chunk_free+0x34/0x50 net/sctp/sm_make_chunk.c:1515 sctp_outq_sack+0xafa/0xd70 net/sctp/outqueue.c:1381 sctp_cmd_process_sack net/sctp/sm_sideeffect.c:834 [inline] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1366 [inline] sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x12c7/0x31b0 net/sctp/sm_sideeffect.c:1169 sctp_assoc_bh_rcv+0x2b2/0x430 net/sctp/associola.c:1051 sctp_inq_push+0x108/0x120 net/sctp/inqueue.c:80 sctp_rcv+0x116e/0x1340 net/sctp/input.c:243 sctp6_rcv+0x25/0x40 net/sctp/ipv6.c:1120 ip6_protocol_deliver_rcu+0x92f/0xf30 net/ipv6/ip6_input.c:437 ip6_input_finish net/ipv6/ip6_input.c:482 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] ip6_input+0xbd/0x1b0 net/ipv6/ip6_input.c:491 dst_input include/net/dst.h:468 [inline] ip6_rcv_finish+0x1e2/0x2e0 net/ipv6/ip6_input.c:79 NF_HOOK include/linux/netfilter.h:303 [inline] ipv6_rcv+0x74/0x150 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core net/core/dev.c:5452 [inline] __netif_receive_skb+0x90/0x1b0 net/core/dev.c:5566 process_backlog+0x21f/0x380 net/core/dev.c:5894 __napi_poll+0x60/0x3b0 net/core/dev.c:6460 napi_poll net/core/dev.c:6527 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6660 __do_softirq+0xc1/0x265 kernel/softirq.c:553 run_ksoftirqd+0x17/0x20 kernel/softirq.c:921 smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164 kthread+0x1d7/0x210 kernel/kthread.c:389 ret_from_fork+0x2e/0x40 arch/x86/kernel/process.c:145 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 read to 0xffff888149d77810 of 4 bytes by task 17828 on cpu 1: sctp_writeable net/sctp/socket.c:9304 [inline] sctp_poll+0x265/0x410 net/sctp/socket.c:8671 sock_poll+0x253/0x270 net/socket.c:1374 vfs_poll include/linux/poll.h:88 [inline] do_pollfd fs/select.c:873 [inline] do_poll fs/select.c:921 [inline] do_sys_poll+0x636/0xc00 fs/select.c:1015 __do_sys_ppoll fs/select.c:1121 [inline] __se_sys_ppoll+0x1af/0x1f0 fs/select.c:1101 __x64_sys_ppoll+0x67/0x80 fs/select.c:1101 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00019e80 -> 0x0000cc80 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17828 Comm: syz-executor.1 Not tainted 6.5.0-rc7-syzkaller-00185-g28f20a19294d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot Signed-off-by: Eric Dumazet Cc: Marcelo Ricardo Leitner Acked-by: Xin Long Link: https://lore.kernel.org/r/20230830094519.950007-1-edumazet@google.com Signed-off-by: Paolo Abeni --- net/sctp/proc.c | 2 +- net/sctp/socket.c | 10 +++++----- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/net/sctp/proc.c b/net/sctp/proc.c index f13d6a34f32f..ec00ee75d59a 100644 --- a/net/sctp/proc.c +++ b/net/sctp/proc.c @@ -282,7 +282,7 @@ static int sctp_assocs_seq_show(struct seq_file *seq, void *v) assoc->init_retries, assoc->shutdown_retries, assoc->rtx_data_chunks, refcount_read(&sk->sk_wmem_alloc), - sk->sk_wmem_queued, + READ_ONCE(sk->sk_wmem_queued), sk->sk_sndbuf, sk->sk_rcvbuf); seq_printf(seq, "\n"); diff --git a/net/sctp/socket.c b/net/sctp/socket.c index fd0631e70d46..ab943e8fb1db 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -69,7 +69,7 @@ #include /* Forward declarations for internal helper functions. */ -static bool sctp_writeable(struct sock *sk); +static bool sctp_writeable(const struct sock *sk); static void sctp_wfree(struct sk_buff *skb); static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p, size_t msg_len); @@ -140,7 +140,7 @@ static inline void sctp_set_owner_w(struct sctp_chunk *chunk) refcount_add(sizeof(struct sctp_chunk), &sk->sk_wmem_alloc); asoc->sndbuf_used += chunk->skb->truesize + sizeof(struct sctp_chunk); - sk->sk_wmem_queued += chunk->skb->truesize + sizeof(struct sctp_chunk); + sk_wmem_queued_add(sk, chunk->skb->truesize + sizeof(struct sctp_chunk)); sk_mem_charge(sk, chunk->skb->truesize); } @@ -9144,7 +9144,7 @@ static void sctp_wfree(struct sk_buff *skb) struct sock *sk = asoc->base.sk; sk_mem_uncharge(sk, skb->truesize); - sk->sk_wmem_queued -= skb->truesize + sizeof(struct sctp_chunk); + sk_wmem_queued_add(sk, -(skb->truesize + sizeof(struct sctp_chunk))); asoc->sndbuf_used -= skb->truesize + sizeof(struct sctp_chunk); WARN_ON(refcount_sub_and_test(sizeof(struct sctp_chunk), &sk->sk_wmem_alloc)); @@ -9299,9 +9299,9 @@ void sctp_write_space(struct sock *sk) * UDP-style sockets or TCP-style sockets, this code should work. * - Daisy */ -static bool sctp_writeable(struct sock *sk) +static bool sctp_writeable(const struct sock *sk) { - return sk->sk_sndbuf > sk->sk_wmem_queued; + return READ_ONCE(sk->sk_sndbuf) > READ_ONCE(sk->sk_wmem_queued); } /* Wait for an association to go into ESTABLISHED state. If timeout is 0, From fce92af1c29d90184dfec638b5738831097d66e9 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 30 Aug 2023 09:55:20 +0000 Subject: [PATCH 22/95] ipv4: annotate data-races around fi->fib_dead syzbot complained about a data-race in fib_table_lookup() [1] Add appropriate annotations to document it. [1] BUG: KCSAN: data-race in fib_release_info / fib_table_lookup write to 0xffff888150f31744 of 1 bytes by task 1189 on cpu 0: fib_release_info+0x3a0/0x460 net/ipv4/fib_semantics.c:281 fib_table_delete+0x8d2/0x900 net/ipv4/fib_trie.c:1777 fib_magic+0x1c1/0x1f0 net/ipv4/fib_frontend.c:1106 fib_del_ifaddr+0x8cf/0xa60 net/ipv4/fib_frontend.c:1317 fib_inetaddr_event+0x77/0x200 net/ipv4/fib_frontend.c:1448 notifier_call_chain kernel/notifier.c:93 [inline] blocking_notifier_call_chain+0x90/0x200 kernel/notifier.c:388 __inet_del_ifa+0x4df/0x800 net/ipv4/devinet.c:432 inet_del_ifa net/ipv4/devinet.c:469 [inline] inetdev_destroy net/ipv4/devinet.c:322 [inline] inetdev_event+0x553/0xaf0 net/ipv4/devinet.c:1606 notifier_call_chain kernel/notifier.c:93 [inline] raw_notifier_call_chain+0x6b/0x1c0 kernel/notifier.c:461 call_netdevice_notifiers_info net/core/dev.c:1962 [inline] call_netdevice_notifiers_mtu+0xd2/0x130 net/core/dev.c:2037 dev_set_mtu_ext+0x30b/0x3e0 net/core/dev.c:8673 do_setlink+0x5be/0x2430 net/core/rtnetlink.c:2837 rtnl_setlink+0x255/0x300 net/core/rtnetlink.c:3177 rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6445 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2549 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6463 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1914 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] sock_write_iter+0x1aa/0x230 net/socket.c:1129 do_iter_write+0x4b4/0x7b0 fs/read_write.c:860 vfs_writev+0x1a8/0x320 fs/read_write.c:933 do_writev+0xf8/0x220 fs/read_write.c:976 __do_sys_writev fs/read_write.c:1049 [inline] __se_sys_writev fs/read_write.c:1046 [inline] __x64_sys_writev+0x45/0x50 fs/read_write.c:1046 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888150f31744 of 1 bytes by task 21839 on cpu 1: fib_table_lookup+0x2bf/0xd50 net/ipv4/fib_trie.c:1585 fib_lookup include/net/ip_fib.h:383 [inline] ip_route_output_key_hash_rcu+0x38c/0x12c0 net/ipv4/route.c:2751 ip_route_output_key_hash net/ipv4/route.c:2641 [inline] __ip_route_output_key include/net/route.h:134 [inline] ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2869 send4+0x1e7/0x500 drivers/net/wireguard/socket.c:61 wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work+0x434/0x860 kernel/workqueue.c:2600 worker_thread+0x5f2/0xa10 kernel/workqueue.c:2751 kthread+0x1d7/0x210 kernel/kthread.c:389 ret_from_fork+0x2e/0x40 arch/x86/kernel/process.c:145 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 value changed: 0x00 -> 0x01 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 21839 Comm: kworker/u4:18 Tainted: G W 6.5.0-syzkaller #0 Fixes: dccd9ecc3744 ("ipv4: Do not use dead fib_info entries.") Reported-by: syzbot Signed-off-by: Eric Dumazet Reviewed-by: David Ahern Link: https://lore.kernel.org/r/20230830095520.1046984-1-edumazet@google.com Signed-off-by: Paolo Abeni --- net/ipv4/fib_semantics.c | 5 ++++- net/ipv4/fib_trie.c | 3 ++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 65ba18a91865..eafa4a033515 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -278,7 +278,8 @@ void fib_release_info(struct fib_info *fi) hlist_del(&nexthop_nh->nh_hash); } endfor_nexthops(fi) } - fi->fib_dead = 1; + /* Paired with READ_ONCE() from fib_table_lookup() */ + WRITE_ONCE(fi->fib_dead, 1); fib_info_put(fi); } spin_unlock_bh(&fib_info_lock); @@ -1581,6 +1582,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg, link_it: ofi = fib_find_info(fi); if (ofi) { + /* fib_table_lookup() should not see @fi yet. */ fi->fib_dead = 1; free_fib_info(fi); refcount_inc(&ofi->fib_treeref); @@ -1619,6 +1621,7 @@ err_inval: failure: if (fi) { + /* fib_table_lookup() should not see @fi yet. */ fi->fib_dead = 1; free_fib_info(fi); } diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 74d403dbd2b4..d13fb9e76b97 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1582,7 +1582,8 @@ found: if (fa->fa_dscp && inet_dscp_to_dsfield(fa->fa_dscp) != flp->flowi4_tos) continue; - if (fi->fib_dead) + /* Paired with WRITE_ONCE() in fib_release_info() */ + if (READ_ONCE(fi->fib_dead)) continue; if (fa->fa_info->fib_scope < flp->flowi4_scope) continue; From a3e0fdf71bbe031de845e8e08ed7fba49f9c702c Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 30 Aug 2023 10:12:44 +0000 Subject: [PATCH 23/95] net: read sk->sk_family once in sk_mc_loop() syzbot is playing with IPV6_ADDRFORM quite a lot these days, and managed to hit the WARN_ON_ONCE(1) in sk_mc_loop() We have many more similar issues to fix. WARNING: CPU: 1 PID: 1593 at net/core/sock.c:782 sk_mc_loop+0x165/0x260 Modules linked in: CPU: 1 PID: 1593 Comm: kworker/1:3 Not tainted 6.1.40-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Workqueue: events_power_efficient gc_worker RIP: 0010:sk_mc_loop+0x165/0x260 net/core/sock.c:782 Code: 34 1b fd 49 81 c7 18 05 00 00 4c 89 f8 48 c1 e8 03 42 80 3c 20 00 74 08 4c 89 ff e8 25 36 6d fd 4d 8b 37 eb 13 e8 db 33 1b fd <0f> 0b b3 01 eb 34 e8 d0 33 1b fd 45 31 f6 49 83 c6 38 4c 89 f0 48 RSP: 0018:ffffc90000388530 EFLAGS: 00010246 RAX: ffffffff846d9b55 RBX: 0000000000000011 RCX: ffff88814f884980 RDX: 0000000000000102 RSI: ffffffff87ae5160 RDI: 0000000000000011 RBP: ffffc90000388550 R08: 0000000000000003 R09: ffffffff846d9a65 R10: 0000000000000002 R11: ffff88814f884980 R12: dffffc0000000000 R13: ffff88810dbee000 R14: 0000000000000010 R15: ffff888150084000 FS: 0000000000000000(0000) GS:ffff8881f6b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000180 CR3: 000000014ee5b000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: [] ip6_finish_output2+0x33f/0x1ae0 net/ipv6/ip6_output.c:83 [] __ip6_finish_output net/ipv6/ip6_output.c:200 [inline] [] ip6_finish_output+0x6c6/0xb10 net/ipv6/ip6_output.c:211 [] NF_HOOK_COND include/linux/netfilter.h:298 [inline] [] ip6_output+0x2bc/0x3d0 net/ipv6/ip6_output.c:232 [] dst_output include/net/dst.h:444 [inline] [] ip6_local_out+0x10f/0x140 net/ipv6/output_core.c:161 [] ipvlan_process_v6_outbound drivers/net/ipvlan/ipvlan_core.c:483 [inline] [] ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:529 [inline] [] ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline] [] ipvlan_queue_xmit+0x1174/0x1be0 drivers/net/ipvlan/ipvlan_core.c:677 [] ipvlan_start_xmit+0x49/0x100 drivers/net/ipvlan/ipvlan_main.c:229 [] netdev_start_xmit include/linux/netdevice.h:4925 [inline] [] xmit_one net/core/dev.c:3644 [inline] [] dev_hard_start_xmit+0x320/0x980 net/core/dev.c:3660 [] sch_direct_xmit+0x2a0/0x9c0 net/sched/sch_generic.c:342 [] qdisc_restart net/sched/sch_generic.c:407 [inline] [] __qdisc_run+0xb13/0x1e70 net/sched/sch_generic.c:415 [] qdisc_run+0xd6/0x260 include/net/pkt_sched.h:125 [] net_tx_action+0x7ac/0x940 net/core/dev.c:5247 [] __do_softirq+0x2bd/0x9bd kernel/softirq.c:599 [] invoke_softirq kernel/softirq.c:430 [inline] [] __irq_exit_rcu+0xc8/0x170 kernel/softirq.c:683 [] irq_exit_rcu+0x9/0x20 kernel/softirq.c:695 Fixes: 7ad6848c7e81 ("ip: fix mc_loop checks for tunnels with multicast outer addresses") Reported-by: syzbot Signed-off-by: Eric Dumazet Reviewed-by: Kuniyuki Iwashima Link: https://lore.kernel.org/r/20230830101244.1146934-1-edumazet@google.com Signed-off-by: Paolo Abeni --- net/core/sock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index 666a17cab4f5..b0dd501dabd6 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -765,7 +765,8 @@ bool sk_mc_loop(struct sock *sk) return false; if (!sk) return true; - switch (sk->sk_family) { + /* IPV6_ADDRFORM can change sk->sk_family under us. */ + switch (READ_ONCE(sk->sk_family)) { case AF_INET: return inet_test_bit(MC_LOOP, sk); #if IS_ENABLED(CONFIG_IPV6) From 8aae7625ff3f0bd5484d01f1b8d5af82e44bec2d Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Wed, 30 Aug 2023 13:00:37 +0200 Subject: [PATCH 24/95] net: fib: avoid warn splat in flow dissector New skbs allocated via nf_send_reset() have skb->dev == NULL. fib*_rules_early_flow_dissect helpers already have a 'struct net' argument but its not passed down to the flow dissector core, which will then WARN as it can't derive a net namespace to use: WARNING: CPU: 0 PID: 0 at net/core/flow_dissector.c:1016 __skb_flow_dissect+0xa91/0x1cd0 [..] ip_route_me_harder+0x143/0x330 nf_send_reset+0x17c/0x2d0 [nf_reject_ipv4] nft_reject_inet_eval+0xa9/0xf2 [nft_reject_inet] nft_do_chain+0x198/0x5d0 [nf_tables] nft_do_chain_inet+0xa4/0x110 [nf_tables] nf_hook_slow+0x41/0xc0 ip_local_deliver+0xce/0x110 .. Cc: Stanislav Fomichev Cc: David Ahern Cc: Ido Schimmel Fixes: 812fa71f0d96 ("netfilter: Dissect flow after packet mangling") Link: https://bugzilla.kernel.org/show_bug.cgi?id=217826 Signed-off-by: Florian Westphal Reviewed-by: Ido Schimmel Reviewed-by: David Ahern Link: https://lore.kernel.org/r/20230830110043.30497-1-fw@strlen.de Signed-off-by: Paolo Abeni --- include/net/ip6_fib.h | 5 ++++- include/net/ip_fib.h | 5 ++++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h index c9ff23cf313e..1ba9f4ddf2f6 100644 --- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -642,7 +642,10 @@ static inline bool fib6_rules_early_flow_dissect(struct net *net, if (!net->ipv6.fib6_rules_require_fldissect) return false; - skb_flow_dissect_flow_keys(skb, flkeys, flag); + memset(flkeys, 0, sizeof(*flkeys)); + __skb_flow_dissect(net, skb, &flow_keys_dissector, + flkeys, NULL, 0, 0, 0, flag); + fl6->fl6_sport = flkeys->ports.src; fl6->fl6_dport = flkeys->ports.dst; fl6->flowi6_proto = flkeys->basic.ip_proto; diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index a378eff827c7..f0c13864180e 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -418,7 +418,10 @@ static inline bool fib4_rules_early_flow_dissect(struct net *net, if (!net->ipv4.fib_rules_require_fldissect) return false; - skb_flow_dissect_flow_keys(skb, flkeys, flag); + memset(flkeys, 0, sizeof(*flkeys)); + __skb_flow_dissect(net, skb, &flow_keys_dissector, + flkeys, NULL, 0, 0, 0, flag); + fl4->fl4_sport = flkeys->ports.src; fl4->fl4_dport = flkeys->ports.dst; fl4->flowi4_proto = flkeys->basic.ip_proto; From 3e019d8a05a38abb5c85d4f1e85fda964610aa14 Mon Sep 17 00:00:00 2001 From: Magnus Karlsson Date: Thu, 31 Aug 2023 12:01:17 +0200 Subject: [PATCH 25/95] xsk: Fix xsk_diag use-after-free error during socket cleanup Fix a use-after-free error that is possible if the xsk_diag interface is used after the socket has been unbound from the device. This can happen either due to the socket being closed or the device disappearing. In the early days of AF_XDP, the way we tested that a socket was not bound to a device was to simply check if the netdevice pointer in the xsk socket structure was NULL. Later, a better system was introduced by having an explicit state variable in the xsk socket struct. For example, the state of a socket that is on the way to being closed and has been unbound from the device is XSK_UNBOUND. The commit in the Fixes tag below deleted the old way of signalling that a socket is unbound, setting dev to NULL. This in the belief that all code using the old way had been exterminated. That was unfortunately not true as the xsk diagnostics code was still using the old way and thus does not work as intended when a socket is going down. Fix this by introducing a test against the state variable. If the socket is in the state XSK_UNBOUND, simply abort the diagnostic's netlink operation. Fixes: 18b1ab7aa76b ("xsk: Fix race at socket teardown") Reported-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson Signed-off-by: Daniel Borkmann Tested-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com Tested-by: Maciej Fijalkowski Reviewed-by: Maciej Fijalkowski Link: https://lore.kernel.org/bpf/20230831100119.17408-1-magnus.karlsson@gmail.com --- net/xdp/xsk_diag.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c index c014217f5fa7..22b36c8143cf 100644 --- a/net/xdp/xsk_diag.c +++ b/net/xdp/xsk_diag.c @@ -111,6 +111,9 @@ static int xsk_diag_fill(struct sock *sk, struct sk_buff *nlskb, sock_diag_save_cookie(sk, msg->xdiag_cookie); mutex_lock(&xs->mutex); + if (READ_ONCE(xs->state) == XSK_UNBOUND) + goto out_nlmsg_trim; + if ((req->xdiag_show & XDP_SHOW_INFO) && xsk_diag_put_info(xs, nlskb)) goto out_nlmsg_trim; From 121fd33bf2d99007f8fe2a155c291a30baca3f52 Mon Sep 17 00:00:00 2001 From: Vishal Chourasia Date: Tue, 29 Aug 2023 13:19:31 +0530 Subject: [PATCH 26/95] bpf, docs: Fix invalid escape sequence warnings in bpf_doc.py The script bpf_doc.py generates multiple SyntaxWarnings related to invalid escape sequences when executed with Python 3.12. These warnings do not appear in Python 3.10 and 3.11 and do not affect the kernel build, which completes successfully. This patch resolves these SyntaxWarnings by converting the relevant string literals to raw strings or by escaping backslashes. This ensures that backslashes are interpreted as literal characters, eliminating the warnings. Reported-by: Srikar Dronamraju Signed-off-by: Vishal Chourasia Signed-off-by: Daniel Borkmann Tested-by: Quentin Monnet Link: https://lore.kernel.org/bpf/20230829074931.2511204-1-vishalc@linux.ibm.com --- scripts/bpf_doc.py | 56 +++++++++++++++++++++++----------------------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py index eaae2ce78381..61b7dddedc46 100755 --- a/scripts/bpf_doc.py +++ b/scripts/bpf_doc.py @@ -59,9 +59,9 @@ class Helper(APIElement): Break down helper function protocol into smaller chunks: return type, name, distincts arguments. """ - arg_re = re.compile('((\w+ )*?(\w+|...))( (\**)(\w+))?$') + arg_re = re.compile(r'((\w+ )*?(\w+|...))( (\**)(\w+))?$') res = {} - proto_re = re.compile('(.+) (\**)(\w+)\(((([^,]+)(, )?){1,5})\)$') + proto_re = re.compile(r'(.+) (\**)(\w+)\(((([^,]+)(, )?){1,5})\)$') capture = proto_re.match(self.proto) res['ret_type'] = capture.group(1) @@ -114,11 +114,11 @@ class HeaderParser(object): return Helper(proto=proto, desc=desc, ret=ret) def parse_symbol(self): - p = re.compile(' \* ?(BPF\w+)$') + p = re.compile(r' \* ?(BPF\w+)$') capture = p.match(self.line) if not capture: raise NoSyscallCommandFound - end_re = re.compile(' \* ?NOTES$') + end_re = re.compile(r' \* ?NOTES$') end = end_re.match(self.line) if end: raise NoSyscallCommandFound @@ -133,7 +133,7 @@ class HeaderParser(object): # - Same as above, with "const" and/or "struct" in front of type # - "..." (undefined number of arguments, for bpf_trace_printk()) # There is at least one term ("void"), and at most five arguments. - p = re.compile(' \* ?((.+) \**\w+\((((const )?(struct )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$') + p = re.compile(r' \* ?((.+) \**\w+\((((const )?(struct )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$') capture = p.match(self.line) if not capture: raise NoHelperFound @@ -141,7 +141,7 @@ class HeaderParser(object): return capture.group(1) def parse_desc(self, proto): - p = re.compile(' \* ?(?:\t| {5,8})Description$') + p = re.compile(r' \* ?(?:\t| {5,8})Description$') capture = p.match(self.line) if not capture: raise Exception("No description section found for " + proto) @@ -154,7 +154,7 @@ class HeaderParser(object): if self.line == ' *\n': desc += '\n' else: - p = re.compile(' \* ?(?:\t| {5,8})(?:\t| {8})(.*)') + p = re.compile(r' \* ?(?:\t| {5,8})(?:\t| {8})(.*)') capture = p.match(self.line) if capture: desc_present = True @@ -167,7 +167,7 @@ class HeaderParser(object): return desc def parse_ret(self, proto): - p = re.compile(' \* ?(?:\t| {5,8})Return$') + p = re.compile(r' \* ?(?:\t| {5,8})Return$') capture = p.match(self.line) if not capture: raise Exception("No return section found for " + proto) @@ -180,7 +180,7 @@ class HeaderParser(object): if self.line == ' *\n': ret += '\n' else: - p = re.compile(' \* ?(?:\t| {5,8})(?:\t| {8})(.*)') + p = re.compile(r' \* ?(?:\t| {5,8})(?:\t| {8})(.*)') capture = p.match(self.line) if capture: ret_present = True @@ -219,12 +219,12 @@ class HeaderParser(object): self.seek_to('enum bpf_cmd {', 'Could not find start of bpf_cmd enum', 0) # Searches for either one or more BPF\w+ enums - bpf_p = re.compile('\s*(BPF\w+)+') + bpf_p = re.compile(r'\s*(BPF\w+)+') # Searches for an enum entry assigned to another entry, # for e.g. BPF_PROG_RUN = BPF_PROG_TEST_RUN, which is # not documented hence should be skipped in check to # determine if the right number of syscalls are documented - assign_p = re.compile('\s*(BPF\w+)\s*=\s*(BPF\w+)') + assign_p = re.compile(r'\s*(BPF\w+)\s*=\s*(BPF\w+)') bpf_cmd_str = '' while True: capture = assign_p.match(self.line) @@ -239,7 +239,7 @@ class HeaderParser(object): break self.line = self.reader.readline() # Find the number of occurences of BPF\w+ - self.enum_syscalls = re.findall('(BPF\w+)+', bpf_cmd_str) + self.enum_syscalls = re.findall(r'(BPF\w+)+', bpf_cmd_str) def parse_desc_helpers(self): self.seek_to(helpersDocStart, @@ -263,7 +263,7 @@ class HeaderParser(object): self.seek_to('#define ___BPF_FUNC_MAPPER(FN, ctx...)', 'Could not find start of eBPF helper definition list') # Searches for one FN(\w+) define or a backslash for newline - p = re.compile('\s*FN\((\w+), (\d+), ##ctx\)|\\\\') + p = re.compile(r'\s*FN\((\w+), (\d+), ##ctx\)|\\\\') fn_defines_str = '' i = 0 while True: @@ -278,7 +278,7 @@ class HeaderParser(object): break self.line = self.reader.readline() # Find the number of occurences of FN(\w+) - self.define_unique_helpers = re.findall('FN\(\w+, \d+, ##ctx\)', fn_defines_str) + self.define_unique_helpers = re.findall(r'FN\(\w+, \d+, ##ctx\)', fn_defines_str) def validate_helpers(self): last_helper = '' @@ -425,7 +425,7 @@ class PrinterRST(Printer): try: cmd = ['git', 'log', '-1', '--pretty=format:%cs', '--no-patch', '-L', - '/{}/,/\*\//:include/uapi/linux/bpf.h'.format(delimiter)] + '/{}/,/\\*\\//:include/uapi/linux/bpf.h'.format(delimiter)] date = subprocess.run(cmd, cwd=linuxRoot, capture_output=True, check=True) return date.stdout.decode().rstrip() @@ -516,7 +516,7 @@ as "Dual BSD/GPL", may be used). Some helper functions are only accessible to programs that are compatible with the GNU Privacy License (GPL). In order to use such helpers, the eBPF program must be loaded with the correct -license string passed (via **attr**) to the **bpf**\ () system call, and this +license string passed (via **attr**) to the **bpf**\\ () system call, and this generally translates into the C source code of the program containing a line similar to the following: @@ -550,7 +550,7 @@ may be interested in: * The bpftool utility can be used to probe the availability of helper functions on the system (as well as supported program and map types, and a number of other parameters). To do so, run **bpftool feature probe** (see - **bpftool-feature**\ (8) for details). Add the **unprivileged** keyword to + **bpftool-feature**\\ (8) for details). Add the **unprivileged** keyword to list features available to unprivileged users. Compatibility between helper functions and program types can generally be found @@ -562,23 +562,23 @@ other functions, themselves allowing access to additional helpers. The requirement for GPL license is also in those **struct bpf_func_proto**. Compatibility between helper functions and map types can be found in the -**check_map_func_compatibility**\ () function in file *kernel/bpf/verifier.c*. +**check_map_func_compatibility**\\ () function in file *kernel/bpf/verifier.c*. Helper functions that invalidate the checks on **data** and **data_end** pointers for network processing are listed in function -**bpf_helper_changes_pkt_data**\ () in file *net/core/filter.c*. +**bpf_helper_changes_pkt_data**\\ () in file *net/core/filter.c*. SEE ALSO ======== -**bpf**\ (2), -**bpftool**\ (8), -**cgroups**\ (7), -**ip**\ (8), -**perf_event_open**\ (2), -**sendmsg**\ (2), -**socket**\ (7), -**tc-bpf**\ (8)''' +**bpf**\\ (2), +**bpftool**\\ (8), +**cgroups**\\ (7), +**ip**\\ (8), +**perf_event_open**\\ (2), +**sendmsg**\\ (2), +**socket**\\ (7), +**tc-bpf**\\ (8)''' print(footer) def print_proto(self, helper): @@ -598,7 +598,7 @@ SEE ALSO one_arg = '{}{}'.format(comma, a['type']) if a['name']: if a['star']: - one_arg += ' {}**\ '.format(a['star'].replace('*', '\\*')) + one_arg += ' {}**\\ '.format(a['star'].replace('*', '\\*')) else: one_arg += '** ' one_arg += '*{}*\\ **'.format(a['name']) From d11ae1b16b0a57fac524cad8e277a20ec62600d1 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 31 Aug 2023 16:11:03 +0200 Subject: [PATCH 27/95] selftests/bpf: Fix d_path test Recent commit [1] broke d_path test, because now filp_close is not called directly from sys_close, but eventually later when the file is finally released. As suggested by Hou Tao we don't need to re-hook the bpf program, but just instead we can use sys_close_range to trigger filp_close synchronously. [1] 021a160abf62 ("fs: use __fput_sync in close(2)") Suggested-by: Hou Tao Signed-off-by: Jiri Olsa Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230831141103.359810-1-jolsa@kernel.org --- .../testing/selftests/bpf/prog_tests/d_path.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/prog_tests/d_path.c b/tools/testing/selftests/bpf/prog_tests/d_path.c index 911345c526e6..ccc768592e66 100644 --- a/tools/testing/selftests/bpf/prog_tests/d_path.c +++ b/tools/testing/selftests/bpf/prog_tests/d_path.c @@ -12,6 +12,17 @@ #include "test_d_path_check_rdonly_mem.skel.h" #include "test_d_path_check_types.skel.h" +/* sys_close_range is not around for long time, so let's + * make sure we can call it on systems with older glibc + */ +#ifndef __NR_close_range +#ifdef __alpha__ +#define __NR_close_range 546 +#else +#define __NR_close_range 436 +#endif +#endif + static int duration; static struct { @@ -90,7 +101,11 @@ static int trigger_fstat_events(pid_t pid) fstat(indicatorfd, &fileStat); out_close: - /* triggers filp_close */ + /* sys_close no longer triggers filp_close, but we can + * call sys_close_range instead which still does + */ +#define close(fd) syscall(__NR_close_range, fd, fd, 0) + close(pipefd[0]); close(pipefd[1]); close(sockfd); @@ -98,6 +113,8 @@ out_close: close(devfd); close(localfd); close(indicatorfd); + +#undef close return ret; } From 6a86b5b5cd76d2734304a0173f5f01aa8aa2025e Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Tue, 29 Aug 2023 22:53:52 +0200 Subject: [PATCH 28/95] bpf: Annotate bpf_long_memcpy with data_race syzbot reported a data race splat between two processes trying to update the same BPF map value via syscall on different CPUs: BUG: KCSAN: data-race in bpf_percpu_array_update / bpf_percpu_array_update write to 0xffffe8fffe7425d8 of 8 bytes by task 8257 on cpu 1: bpf_long_memcpy include/linux/bpf.h:428 [inline] bpf_obj_memcpy include/linux/bpf.h:441 [inline] copy_map_value_long include/linux/bpf.h:464 [inline] bpf_percpu_array_update+0x3bb/0x500 kernel/bpf/arraymap.c:380 bpf_map_update_value+0x190/0x370 kernel/bpf/syscall.c:175 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1749 bpf_map_do_batch+0x2df/0x3d0 kernel/bpf/syscall.c:4648 __sys_bpf+0x28a/0x780 __do_sys_bpf kernel/bpf/syscall.c:5241 [inline] __se_sys_bpf kernel/bpf/syscall.c:5239 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5239 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd write to 0xffffe8fffe7425d8 of 8 bytes by task 8268 on cpu 0: bpf_long_memcpy include/linux/bpf.h:428 [inline] bpf_obj_memcpy include/linux/bpf.h:441 [inline] copy_map_value_long include/linux/bpf.h:464 [inline] bpf_percpu_array_update+0x3bb/0x500 kernel/bpf/arraymap.c:380 bpf_map_update_value+0x190/0x370 kernel/bpf/syscall.c:175 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1749 bpf_map_do_batch+0x2df/0x3d0 kernel/bpf/syscall.c:4648 __sys_bpf+0x28a/0x780 __do_sys_bpf kernel/bpf/syscall.c:5241 [inline] __se_sys_bpf kernel/bpf/syscall.c:5239 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5239 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0xfffffff000002788 The bpf_long_memcpy is used with 8-byte aligned pointers, power-of-8 size and forced to use long read/writes to try to atomically copy long counters. It is best-effort only and no barriers are here since it _will_ race with concurrent updates from BPF programs. The bpf_long_memcpy() is called from bpf(2) syscall. Marco suggested that the best way to make this known to KCSAN would be to use data_race() annotation. Reported-by: syzbot+97522333291430dd277f@syzkaller.appspotmail.com Suggested-by: Marco Elver Signed-off-by: Daniel Borkmann Acked-by: Marco Elver Link: https://lore.kernel.org/bpf/000000000000d87a7f06040c970c@google.com Link: https://lore.kernel.org/bpf/57628f7a15e20d502247c3b55fceb1cb2b31f266.1693342186.git.daniel@iogearbox.net --- include/linux/bpf.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 12596af59c00..024e8b28c34b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -438,7 +438,7 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size) size /= sizeof(long); while (size--) - *ldst++ = *lsrc++; + data_race(*ldst++ = *lsrc++); } /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */ From be8e754cbfac698d6304bb8382c8d18ac74424d3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= Date: Thu, 31 Aug 2023 18:29:54 +0200 Subject: [PATCH 29/95] selftests/bpf: Include build flavors for install target MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When using the "install" or targets depending on install, e.g. "gen_tar", the BPF machine flavors weren't included. A command like: | make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- O=/workspace/kbuild \ | HOSTCC=gcc FORMAT= SKIP_TARGETS="arm64 ia64 powerpc sparc64 x86 sgx" \ | -C tools/testing/selftests gen_tar would not include bpf/no_alu32, bpf/cpuv4, or bpf/bpf-gcc. Include the BPF machine flavors for "install" make target. Signed-off-by: Björn Töpel Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230831162954.111485-1-bjorn@kernel.org --- tools/testing/selftests/bpf/Makefile | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index edef49fcd23e..caede9b574cb 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -50,14 +50,17 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test test_cgroup_storage \ test_tcpnotify_user test_sysctl \ test_progs-no_alu32 +TEST_INST_SUBDIRS := no_alu32 # Also test bpf-gcc, if present ifneq ($(BPF_GCC),) TEST_GEN_PROGS += test_progs-bpf_gcc +TEST_INST_SUBDIRS += bpf_gcc endif ifneq ($(CLANG_CPUV4),) TEST_GEN_PROGS += test_progs-cpuv4 +TEST_INST_SUBDIRS += cpuv4 endif TEST_GEN_FILES = test_lwt_ip_encap.bpf.o test_tc_edt.bpf.o @@ -714,3 +717,12 @@ EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR) $(HOST_SCRATCH_DIR) \ # Delete partially updated (corrupted) files on error .DELETE_ON_ERROR: + +DEFAULT_INSTALL_RULE := $(INSTALL_RULE) +override define INSTALL_RULE + $(DEFAULT_INSTALL_RULE) + @for DIR in $(TEST_INST_SUBDIRS); do \ + mkdir -p $(INSTALL_PATH)/$$DIR; \ + rsync -a $(OUTPUT)/$$DIR/*.bpf.o $(INSTALL_PATH)/$$DIR;\ + done +endef From 82ba0ff7bf0483d962e592017bef659ae022d754 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 31 Aug 2023 08:45:09 +0000 Subject: [PATCH 30/95] net/handshake: fix null-ptr-deref in handshake_nl_done_doit() We should not call trace_handshake_cmd_done_err() if socket lookup has failed. Also we should call trace_handshake_cmd_done_err() before releasing the file, otherwise dereferencing sock->sk can return garbage. This also reverts 7afc6d0a107f ("net/handshake: Fix uninitialized local variable") Unable to handle kernel paging request at virtual address dfff800000000003 KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [dfff800000000003] address between user and kernel address ranges Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 5986 Comm: syz-executor292 Not tainted 6.5.0-rc7-syzkaller-gfe4469582053 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : handshake_nl_done_doit+0x198/0x9c8 net/handshake/netlink.c:193 lr : handshake_nl_done_doit+0x180/0x9c8 sp : ffff800096e37180 x29: ffff800096e37200 x28: 1ffff00012dc6e34 x27: dfff800000000000 x26: ffff800096e373d0 x25: 0000000000000000 x24: 00000000ffffffa8 x23: ffff800096e373f0 x22: 1ffff00012dc6e38 x21: 0000000000000000 x20: ffff800096e371c0 x19: 0000000000000018 x18: 0000000000000000 x17: 0000000000000000 x16: ffff800080516cc4 x15: 0000000000000001 x14: 1fffe0001b14aa3b x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000003 x8 : 0000000000000003 x7 : ffff800080afe47c x6 : 0000000000000000 x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800080a88078 x2 : 0000000000000001 x1 : 00000000ffffffa8 x0 : 0000000000000000 Call trace: handshake_nl_done_doit+0x198/0x9c8 net/handshake/netlink.c:193 genl_family_rcv_msg_doit net/netlink/genetlink.c:970 [inline] genl_family_rcv_msg net/netlink/genetlink.c:1050 [inline] genl_rcv_msg+0x96c/0xc50 net/netlink/genetlink.c:1067 netlink_rcv_skb+0x214/0x3c4 net/netlink/af_netlink.c:2549 genl_rcv+0x38/0x50 net/netlink/genetlink.c:1078 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x660/0x8d4 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x834/0xb18 net/netlink/af_netlink.c:1914 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x56c/0x840 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmsg+0x26c/0x33c net/socket.c:2577 __do_sys_sendmsg net/socket.c:2586 [inline] __se_sys_sendmsg net/socket.c:2584 [inline] __arm64_sys_sendmsg+0x80/0x94 net/socket.c:2584 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155 el0_svc+0x58/0x16c arch/arm64/kernel/entry-common.c:678 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591 Code: 12800108 b90043e8 910062b3 d343fe68 (387b6908) Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reported-by: syzbot Signed-off-by: Eric Dumazet Cc: Chuck Lever Reviewed-by: Michal Kubiak Signed-off-by: David S. Miller --- net/handshake/netlink.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c index 1086653e1fad..d0bc1dd8e65a 100644 --- a/net/handshake/netlink.c +++ b/net/handshake/netlink.c @@ -157,26 +157,24 @@ out_status: int handshake_nl_done_doit(struct sk_buff *skb, struct genl_info *info) { struct net *net = sock_net(skb->sk); - struct handshake_req *req = NULL; - struct socket *sock = NULL; + struct handshake_req *req; + struct socket *sock; int fd, status, err; if (GENL_REQ_ATTR_CHECK(info, HANDSHAKE_A_DONE_SOCKFD)) return -EINVAL; fd = nla_get_u32(info->attrs[HANDSHAKE_A_DONE_SOCKFD]); - err = 0; sock = sockfd_lookup(fd, &err); - if (err) { - err = -EBADF; - goto out_status; - } + if (!sock) + return err; req = handshake_req_hash_lookup(sock->sk); if (!req) { err = -EBUSY; + trace_handshake_cmd_done_err(net, req, sock->sk, err); fput(sock->file); - goto out_status; + return err; } trace_handshake_cmd_done(net, req, sock->sk, fd); @@ -188,10 +186,6 @@ int handshake_nl_done_doit(struct sk_buff *skb, struct genl_info *info) handshake_complete(req, status, info); fput(sock->file); return 0; - -out_status: - trace_handshake_cmd_done_err(net, req, sock->sk, err); - return err; } static unsigned int handshake_net_id; From 66d58f046c9d3a8f996b7138d02e965fd0617de0 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 31 Aug 2023 13:52:08 +0000 Subject: [PATCH 31/95] net: use sk_forward_alloc_get() in sk_get_meminfo() inet_sk_diag_fill() has been changed to use sk_forward_alloc_get(), but sk_get_meminfo() was forgotten. Fixes: 292e6077b040 ("net: introduce sk_forward_alloc_get()") Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/sock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index b0dd501dabd6..a61ec97098ad 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3743,7 +3743,7 @@ void sk_get_meminfo(const struct sock *sk, u32 *mem) mem[SK_MEMINFO_RCVBUF] = READ_ONCE(sk->sk_rcvbuf); mem[SK_MEMINFO_WMEM_ALLOC] = sk_wmem_alloc_get(sk); mem[SK_MEMINFO_SNDBUF] = READ_ONCE(sk->sk_sndbuf); - mem[SK_MEMINFO_FWD_ALLOC] = sk->sk_forward_alloc; + mem[SK_MEMINFO_FWD_ALLOC] = sk_forward_alloc_get(sk); mem[SK_MEMINFO_WMEM_QUEUED] = READ_ONCE(sk->sk_wmem_queued); mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc); mem[SK_MEMINFO_BACKLOG] = READ_ONCE(sk->sk_backlog.len); From 5e6300e7b3a4ab5b72a82079753868e91fbf9efc Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 31 Aug 2023 13:52:09 +0000 Subject: [PATCH 32/95] net: annotate data-races around sk->sk_forward_alloc Every time sk->sk_forward_alloc is read locklessly, add a READ_ONCE(). Add sk_forward_alloc_add() helper to centralize updates, to reduce number of WRITE_ONCE(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/net/sock.h | 12 +++++++++--- net/core/sock.c | 8 ++++---- net/ipv4/tcp_output.c | 2 +- net/ipv4/udp.c | 6 +++--- net/mptcp/protocol.c | 6 +++--- 5 files changed, 20 insertions(+), 14 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 11d503417591..f04869ac1d92 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1053,6 +1053,12 @@ static inline void sk_wmem_queued_add(struct sock *sk, int val) WRITE_ONCE(sk->sk_wmem_queued, sk->sk_wmem_queued + val); } +static inline void sk_forward_alloc_add(struct sock *sk, int val) +{ + /* Paired with lockless reads of sk->sk_forward_alloc */ + WRITE_ONCE(sk->sk_forward_alloc, sk->sk_forward_alloc + val); +} + void sk_stream_write_space(struct sock *sk); /* OOB backlog add */ @@ -1377,7 +1383,7 @@ static inline int sk_forward_alloc_get(const struct sock *sk) if (sk->sk_prot->forward_alloc_get) return sk->sk_prot->forward_alloc_get(sk); #endif - return sk->sk_forward_alloc; + return READ_ONCE(sk->sk_forward_alloc); } static inline bool __sk_stream_memory_free(const struct sock *sk, int wake) @@ -1673,14 +1679,14 @@ static inline void sk_mem_charge(struct sock *sk, int size) { if (!sk_has_account(sk)) return; - sk->sk_forward_alloc -= size; + sk_forward_alloc_add(sk, -size); } static inline void sk_mem_uncharge(struct sock *sk, int size) { if (!sk_has_account(sk)) return; - sk->sk_forward_alloc += size; + sk_forward_alloc_add(sk, size); sk_mem_reclaim(sk); } diff --git a/net/core/sock.c b/net/core/sock.c index a61ec97098ad..40e1bda4bde0 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1045,7 +1045,7 @@ static int sock_reserve_memory(struct sock *sk, int bytes) mem_cgroup_uncharge_skmem(sk->sk_memcg, pages); return -ENOMEM; } - sk->sk_forward_alloc += pages << PAGE_SHIFT; + sk_forward_alloc_add(sk, pages << PAGE_SHIFT); WRITE_ONCE(sk->sk_reserved_mem, sk->sk_reserved_mem + (pages << PAGE_SHIFT)); @@ -3139,10 +3139,10 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind) { int ret, amt = sk_mem_pages(size); - sk->sk_forward_alloc += amt << PAGE_SHIFT; + sk_forward_alloc_add(sk, amt << PAGE_SHIFT); ret = __sk_mem_raise_allocated(sk, size, amt, kind); if (!ret) - sk->sk_forward_alloc -= amt << PAGE_SHIFT; + sk_forward_alloc_add(sk, -(amt << PAGE_SHIFT)); return ret; } EXPORT_SYMBOL(__sk_mem_schedule); @@ -3174,7 +3174,7 @@ void __sk_mem_reduce_allocated(struct sock *sk, int amount) void __sk_mem_reclaim(struct sock *sk, int amount) { amount >>= PAGE_SHIFT; - sk->sk_forward_alloc -= amount << PAGE_SHIFT; + sk_forward_alloc_add(sk, -(amount << PAGE_SHIFT)); __sk_mem_reduce_allocated(sk, amount); } EXPORT_SYMBOL(__sk_mem_reclaim); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e6b4fbd642f7..ccfc8bbf7455 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3474,7 +3474,7 @@ void sk_forced_mem_schedule(struct sock *sk, int size) if (delta <= 0) return; amt = sk_mem_pages(delta); - sk->sk_forward_alloc += amt << PAGE_SHIFT; + sk_forward_alloc_add(sk, amt << PAGE_SHIFT); sk_memory_allocated_add(sk, amt); if (mem_cgroup_sockets_enabled && sk->sk_memcg) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 0794a2c46a56..f39b9c844580 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1414,9 +1414,9 @@ static void udp_rmem_release(struct sock *sk, int size, int partial, spin_lock(&sk_queue->lock); - sk->sk_forward_alloc += size; + sk_forward_alloc_add(sk, size); amt = (sk->sk_forward_alloc - partial) & ~(PAGE_SIZE - 1); - sk->sk_forward_alloc -= amt; + sk_forward_alloc_add(sk, -amt); if (amt) __sk_mem_reduce_allocated(sk, amt >> PAGE_SHIFT); @@ -1527,7 +1527,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) goto uncharge_drop; } - sk->sk_forward_alloc -= size; + sk_forward_alloc_add(sk, -size); /* no need to setup a destructor, we will explicitly release the * forward allocated memory on dequeue diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 933b257eee02..625df3a36c46 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1800,7 +1800,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) } /* data successfully copied into the write queue */ - sk->sk_forward_alloc -= total_ts; + sk_forward_alloc_add(sk, -total_ts); copied += psize; dfrag->data_len += psize; frag_truesize += psize; @@ -3257,7 +3257,7 @@ void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags) /* move all the rx fwd alloc into the sk_mem_reclaim_final in * inet_sock_destruct() will dispose it */ - sk->sk_forward_alloc += msk->rmem_fwd_alloc; + sk_forward_alloc_add(sk, msk->rmem_fwd_alloc); msk->rmem_fwd_alloc = 0; mptcp_token_destroy(msk); mptcp_pm_free_anno_list(msk); @@ -3522,7 +3522,7 @@ static void mptcp_shutdown(struct sock *sk, int how) static int mptcp_forward_alloc_get(const struct sock *sk) { - return sk->sk_forward_alloc + mptcp_sk(sk)->rmem_fwd_alloc; + return READ_ONCE(sk->sk_forward_alloc) + mptcp_sk(sk)->rmem_fwd_alloc; } static int mptcp_ioctl_outq(const struct mptcp_sock *msk, u64 v) From 9531e4a83febc3fb47ac77e24cfb5ea97e50034d Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 31 Aug 2023 13:52:10 +0000 Subject: [PATCH 33/95] mptcp: annotate data-races around msk->rmem_fwd_alloc msk->rmem_fwd_alloc can be read locklessly. Add mptcp_rmem_fwd_alloc_add(), similar to sk_forward_alloc_add(), and appropriate READ_ONCE()/WRITE_ONCE() annotations. Fixes: 6511882cdd82 ("mptcp: allocate fwd memory separately on the rx and tx path") Signed-off-by: Eric Dumazet Cc: Paolo Abeni Signed-off-by: David S. Miller --- net/mptcp/protocol.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 625df3a36c46..a7fc16f5175d 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -134,9 +134,15 @@ static void mptcp_drop(struct sock *sk, struct sk_buff *skb) __kfree_skb(skb); } +static void mptcp_rmem_fwd_alloc_add(struct sock *sk, int size) +{ + WRITE_ONCE(mptcp_sk(sk)->rmem_fwd_alloc, + mptcp_sk(sk)->rmem_fwd_alloc + size); +} + static void mptcp_rmem_charge(struct sock *sk, int size) { - mptcp_sk(sk)->rmem_fwd_alloc -= size; + mptcp_rmem_fwd_alloc_add(sk, -size); } static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to, @@ -177,7 +183,7 @@ static bool mptcp_ooo_try_coalesce(struct mptcp_sock *msk, struct sk_buff *to, static void __mptcp_rmem_reclaim(struct sock *sk, int amount) { amount >>= PAGE_SHIFT; - mptcp_sk(sk)->rmem_fwd_alloc -= amount << PAGE_SHIFT; + mptcp_rmem_charge(sk, amount << PAGE_SHIFT); __sk_mem_reduce_allocated(sk, amount); } @@ -186,7 +192,7 @@ static void mptcp_rmem_uncharge(struct sock *sk, int size) struct mptcp_sock *msk = mptcp_sk(sk); int reclaimable; - msk->rmem_fwd_alloc += size; + mptcp_rmem_fwd_alloc_add(sk, size); reclaimable = msk->rmem_fwd_alloc - sk_unused_reserved_mem(sk); /* see sk_mem_uncharge() for the rationale behind the following schema */ @@ -341,7 +347,7 @@ static bool mptcp_rmem_schedule(struct sock *sk, struct sock *ssk, int size) if (!__sk_mem_raise_allocated(sk, size, amt, SK_MEM_RECV)) return false; - msk->rmem_fwd_alloc += amount; + mptcp_rmem_fwd_alloc_add(sk, amount); return true; } @@ -3258,7 +3264,7 @@ void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags) * inet_sock_destruct() will dispose it */ sk_forward_alloc_add(sk, msk->rmem_fwd_alloc); - msk->rmem_fwd_alloc = 0; + WRITE_ONCE(msk->rmem_fwd_alloc, 0); mptcp_token_destroy(msk); mptcp_pm_free_anno_list(msk); mptcp_free_local_addr_list(msk); @@ -3522,7 +3528,8 @@ static void mptcp_shutdown(struct sock *sk, int how) static int mptcp_forward_alloc_get(const struct sock *sk) { - return READ_ONCE(sk->sk_forward_alloc) + mptcp_sk(sk)->rmem_fwd_alloc; + return READ_ONCE(sk->sk_forward_alloc) + + READ_ONCE(mptcp_sk(sk)->rmem_fwd_alloc); } static int mptcp_ioctl_outq(const struct mptcp_sock *msk, u64 v) From e3390b30a5dfb112e8e802a59c0f68f947b638b2 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 31 Aug 2023 13:52:11 +0000 Subject: [PATCH 34/95] net: annotate data-races around sk->sk_tsflags sk->sk_tsflags can be read locklessly, add corresponding annotations. Fixes: b9f40e21ef42 ("net-timestamp: move timestamp flags out of sk_flags") Signed-off-by: Eric Dumazet Cc: Willem de Bruijn Signed-off-by: David S. Miller --- include/net/ip.h | 2 +- include/net/sock.h | 17 ++++++++++------- net/can/j1939/socket.c | 10 ++++++---- net/core/skbuff.c | 10 ++++++---- net/core/sock.c | 4 ++-- net/ipv4/ip_output.c | 2 +- net/ipv4/ip_sockglue.c | 2 +- net/ipv4/tcp.c | 4 ++-- net/ipv6/ip6_output.c | 2 +- net/ipv6/ping.c | 2 +- net/ipv6/raw.c | 2 +- net/ipv6/udp.c | 2 +- net/socket.c | 13 +++++++------ 13 files changed, 40 insertions(+), 32 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index 19adacd5ece0..9276cea775cc 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -94,7 +94,7 @@ static inline void ipcm_init_sk(struct ipcm_cookie *ipcm, ipcm_init(ipcm); ipcm->sockc.mark = READ_ONCE(inet->sk.sk_mark); - ipcm->sockc.tsflags = inet->sk.sk_tsflags; + ipcm->sockc.tsflags = READ_ONCE(inet->sk.sk_tsflags); ipcm->oif = READ_ONCE(inet->sk.sk_bound_dev_if); ipcm->addr = inet->inet_saddr; ipcm->protocol = inet->inet_num; diff --git a/include/net/sock.h b/include/net/sock.h index f04869ac1d92..b770261fbdaf 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1906,7 +1906,9 @@ struct sockcm_cookie { static inline void sockcm_init(struct sockcm_cookie *sockc, const struct sock *sk) { - *sockc = (struct sockcm_cookie) { .tsflags = sk->sk_tsflags }; + *sockc = (struct sockcm_cookie) { + .tsflags = READ_ONCE(sk->sk_tsflags) + }; } int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg, @@ -2701,9 +2703,9 @@ void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk, static inline void sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb) { - ktime_t kt = skb->tstamp; struct skb_shared_hwtstamps *hwtstamps = skb_hwtstamps(skb); - + u32 tsflags = READ_ONCE(sk->sk_tsflags); + ktime_t kt = skb->tstamp; /* * generate control messages if * - receive time stamping in software requested @@ -2711,10 +2713,10 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb) * - hardware time stamps available and wanted */ if (sock_flag(sk, SOCK_RCVTSTAMP) || - (sk->sk_tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) || - (kt && sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) || + (tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) || + (kt && tsflags & SOF_TIMESTAMPING_SOFTWARE) || (hwtstamps->hwtstamp && - (sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE))) + (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE))) __sock_recv_timestamp(msg, sk, skb); else sock_write_timestamp(sk, kt); @@ -2736,7 +2738,8 @@ static inline void sock_recv_cmsgs(struct msghdr *msg, struct sock *sk, #define TSFLAGS_ANY (SOF_TIMESTAMPING_SOFTWARE | \ SOF_TIMESTAMPING_RAW_HARDWARE) - if (sk->sk_flags & FLAGS_RECV_CMSGS || sk->sk_tsflags & TSFLAGS_ANY) + if (sk->sk_flags & FLAGS_RECV_CMSGS || + READ_ONCE(sk->sk_tsflags) & TSFLAGS_ANY) __sock_recv_cmsgs(msg, sk, skb); else if (unlikely(sock_flag(sk, SOCK_TIMESTAMP))) sock_write_timestamp(sk, skb->tstamp); diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c index feaec4ad6d16..b28c976f52a0 100644 --- a/net/can/j1939/socket.c +++ b/net/can/j1939/socket.c @@ -974,6 +974,7 @@ static void __j1939_sk_errqueue(struct j1939_session *session, struct sock *sk, struct sock_exterr_skb *serr; struct sk_buff *skb; char *state = "UNK"; + u32 tsflags; int err; jsk = j1939_sk(sk); @@ -981,13 +982,14 @@ static void __j1939_sk_errqueue(struct j1939_session *session, struct sock *sk, if (!(jsk->state & J1939_SOCK_ERRQUEUE)) return; + tsflags = READ_ONCE(sk->sk_tsflags); switch (type) { case J1939_ERRQUEUE_TX_ACK: - if (!(sk->sk_tsflags & SOF_TIMESTAMPING_TX_ACK)) + if (!(tsflags & SOF_TIMESTAMPING_TX_ACK)) return; break; case J1939_ERRQUEUE_TX_SCHED: - if (!(sk->sk_tsflags & SOF_TIMESTAMPING_TX_SCHED)) + if (!(tsflags & SOF_TIMESTAMPING_TX_SCHED)) return; break; case J1939_ERRQUEUE_TX_ABORT: @@ -997,7 +999,7 @@ static void __j1939_sk_errqueue(struct j1939_session *session, struct sock *sk, case J1939_ERRQUEUE_RX_DPO: fallthrough; case J1939_ERRQUEUE_RX_ABORT: - if (!(sk->sk_tsflags & SOF_TIMESTAMPING_RX_SOFTWARE)) + if (!(tsflags & SOF_TIMESTAMPING_RX_SOFTWARE)) return; break; default: @@ -1054,7 +1056,7 @@ static void __j1939_sk_errqueue(struct j1939_session *session, struct sock *sk, } serr->opt_stats = true; - if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) + if (tsflags & SOF_TIMESTAMPING_OPT_ID) serr->ee.ee_data = session->tskey; netdev_dbg(session->priv->ndev, "%s: 0x%p tskey: %i, state: %s\n", diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 45707059082f..24f26e816184 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5207,7 +5207,7 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb, serr->ee.ee_info = tstype; serr->opt_stats = opt_stats; serr->header.h4.iif = skb->dev ? skb->dev->ifindex : 0; - if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) { + if (READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_OPT_ID) { serr->ee.ee_data = skb_shinfo(skb)->tskey; if (sk_is_tcp(sk)) serr->ee.ee_data -= atomic_read(&sk->sk_tskey); @@ -5263,21 +5263,23 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb, { struct sk_buff *skb; bool tsonly, opt_stats = false; + u32 tsflags; if (!sk) return; - if (!hwtstamps && !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) && + tsflags = READ_ONCE(sk->sk_tsflags); + if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) && skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS) return; - tsonly = sk->sk_tsflags & SOF_TIMESTAMPING_OPT_TSONLY; + tsonly = tsflags & SOF_TIMESTAMPING_OPT_TSONLY; if (!skb_may_tx_timestamp(sk, tsonly)) return; if (tsonly) { #ifdef CONFIG_INET - if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) && + if ((tsflags & SOF_TIMESTAMPING_OPT_STATS) && sk_is_tcp(sk)) { skb = tcp_get_timestamping_opt_stats(sk, orig_skb, ack_skb); diff --git a/net/core/sock.c b/net/core/sock.c index 40e1bda4bde0..d05a290300b6 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -937,7 +937,7 @@ int sock_set_timestamping(struct sock *sk, int optname, return ret; } - sk->sk_tsflags = val; + WRITE_ONCE(sk->sk_tsflags, val); sock_valbool_flag(sk, SOCK_TSTAMP_NEW, optname == SO_TIMESTAMPING_NEW); if (val & SOF_TIMESTAMPING_RX_SOFTWARE) @@ -1719,7 +1719,7 @@ int sk_getsockopt(struct sock *sk, int level, int optname, case SO_TIMESTAMPING_OLD: lv = sizeof(v.timestamping); - v.timestamping.flags = sk->sk_tsflags; + v.timestamping.flags = READ_ONCE(sk->sk_tsflags); v.timestamping.bind_phc = sk->sk_bind_phc; break; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index b2e0ad312028..4ab877cf6d35 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -981,7 +981,7 @@ static int __ip_append_data(struct sock *sk, paged = !!cork->gso_size; if (cork->tx_flags & SKBTX_ANY_TSTAMP && - sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) + READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_OPT_ID) tskey = atomic_inc_return(&sk->sk_tskey) - 1; hh_len = LL_RESERVED_SPACE(rt->dst.dev); diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index d1c73660b844..cce9cb25f3b3 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -511,7 +511,7 @@ static bool ipv4_datagram_support_cmsg(const struct sock *sk, * or without payload (SOF_TIMESTAMPING_OPT_TSONLY). */ info = PKTINFO_SKB_CB(skb); - if (!(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_CMSG) || + if (!(READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_OPT_CMSG) || !info->ipi_ifindex) return false; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index cee1e548660c..cc4b250262c1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2259,14 +2259,14 @@ void tcp_recv_timestamp(struct msghdr *msg, const struct sock *sk, } } - if (sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) + if (READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_SOFTWARE) has_timestamping = true; else tss->ts[0] = (struct timespec64) {0}; } if (tss->ts[2].tv_sec || tss->ts[2].tv_nsec) { - if (sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) + if (READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_RAW_HARDWARE) has_timestamping = true; else tss->ts[2] = (struct timespec64) {0}; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 4ab50169a5a9..54fc4c711f2c 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1501,7 +1501,7 @@ static int __ip6_append_data(struct sock *sk, orig_mtu = mtu; if (cork->tx_flags & SKBTX_ANY_TSTAMP && - sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) + READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_OPT_ID) tskey = atomic_inc_return(&sk->sk_tskey) - 1; hh_len = LL_RESERVED_SPACE(rt->dst.dev); diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index 1b2772834972..5831aaa53d75 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -119,7 +119,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) return -EINVAL; ipcm6_init_sk(&ipc6, np); - ipc6.sockc.tsflags = sk->sk_tsflags; + ipc6.sockc.tsflags = READ_ONCE(sk->sk_tsflags); ipc6.sockc.mark = READ_ONCE(sk->sk_mark); fl6.flowi6_oif = oif; diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 0eae7661a85c..42fcec3ecf5e 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -772,7 +772,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) fl6.flowi6_uid = sk->sk_uid; ipcm6_init(&ipc6); - ipc6.sockc.tsflags = sk->sk_tsflags; + ipc6.sockc.tsflags = READ_ONCE(sk->sk_tsflags); ipc6.sockc.mark = fl6.flowi6_mark; if (sin6) { diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index ebc6ae47cfea..86b5d509a468 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1339,7 +1339,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) ipcm6_init(&ipc6); ipc6.gso_size = READ_ONCE(up->gso_size); - ipc6.sockc.tsflags = sk->sk_tsflags; + ipc6.sockc.tsflags = READ_ONCE(sk->sk_tsflags); ipc6.sockc.mark = READ_ONCE(sk->sk_mark); /* destination address check */ diff --git a/net/socket.c b/net/socket.c index 848116d06b51..98ffffab949e 100644 --- a/net/socket.c +++ b/net/socket.c @@ -825,7 +825,7 @@ static bool skb_is_swtx_tstamp(const struct sk_buff *skb, int false_tstamp) static ktime_t get_timestamp(struct sock *sk, struct sk_buff *skb, int *if_index) { - bool cycles = sk->sk_tsflags & SOF_TIMESTAMPING_BIND_PHC; + bool cycles = READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_BIND_PHC; struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb); struct net_device *orig_dev; ktime_t hwtstamp; @@ -877,12 +877,12 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk, int need_software_tstamp = sock_flag(sk, SOCK_RCVTSTAMP); int new_tstamp = sock_flag(sk, SOCK_TSTAMP_NEW); struct scm_timestamping_internal tss; - int empty = 1, false_tstamp = 0; struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb); int if_index; ktime_t hwtstamp; + u32 tsflags; /* Race occurred between timestamp enabling and packet receiving. Fill in the current time for now. */ @@ -924,11 +924,12 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk, } memset(&tss, 0, sizeof(tss)); - if ((sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) && + tsflags = READ_ONCE(sk->sk_tsflags); + if ((tsflags & SOF_TIMESTAMPING_SOFTWARE) && ktime_to_timespec64_cond(skb->tstamp, tss.ts + 0)) empty = 0; if (shhwtstamps && - (sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) && + (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) && !skb_is_swtx_tstamp(skb, false_tstamp)) { if_index = 0; if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP_NETDEV) @@ -936,14 +937,14 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk, else hwtstamp = shhwtstamps->hwtstamp; - if (sk->sk_tsflags & SOF_TIMESTAMPING_BIND_PHC) + if (tsflags & SOF_TIMESTAMPING_BIND_PHC) hwtstamp = ptp_convert_timestamp(&hwtstamp, sk->sk_bind_phc); if (ktime_to_timespec64_cond(hwtstamp, tss.ts + 2)) { empty = 0; - if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_PKTINFO) && + if ((tsflags & SOF_TIMESTAMPING_OPT_PKTINFO) && !skb_is_err_queue(skb)) put_ts_pktinfo(msg, skb, if_index); } From 251cd405a9e6e70b92fe5afbdd17fd5caf9d3266 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 31 Aug 2023 13:52:12 +0000 Subject: [PATCH 35/95] net: annotate data-races around sk->sk_bind_phc sk->sk_bind_phc is read locklessly. Add corresponding annotations. Fixes: d463126e23f1 ("net: sock: extend SO_TIMESTAMPING for PHC binding") Signed-off-by: Eric Dumazet Cc: Yangbo Lu Signed-off-by: David S. Miller --- net/core/sock.c | 4 ++-- net/socket.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index d05a290300b6..d3c7b53368d2 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -894,7 +894,7 @@ static int sock_timestamping_bind_phc(struct sock *sk, int phc_index) if (!match) return -EINVAL; - sk->sk_bind_phc = phc_index; + WRITE_ONCE(sk->sk_bind_phc, phc_index); return 0; } @@ -1720,7 +1720,7 @@ int sk_getsockopt(struct sock *sk, int level, int optname, case SO_TIMESTAMPING_OLD: lv = sizeof(v.timestamping); v.timestamping.flags = READ_ONCE(sk->sk_tsflags); - v.timestamping.bind_phc = sk->sk_bind_phc; + v.timestamping.bind_phc = READ_ONCE(sk->sk_bind_phc); break; case SO_RCVTIMEO_OLD: diff --git a/net/socket.c b/net/socket.c index 98ffffab949e..928b05811cfd 100644 --- a/net/socket.c +++ b/net/socket.c @@ -939,7 +939,7 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk, if (tsflags & SOF_TIMESTAMPING_BIND_PHC) hwtstamp = ptp_convert_timestamp(&hwtstamp, - sk->sk_bind_phc); + READ_ONCE(sk->sk_bind_phc)); if (ktime_to_timespec64_cond(hwtstamp, tss.ts + 2)) { empty = 0; From 2ea35288c83b3d501a88bc17f2df8f176b5cc96f Mon Sep 17 00:00:00 2001 From: Mohamed Khalfella Date: Thu, 31 Aug 2023 02:17:02 -0600 Subject: [PATCH 36/95] skbuff: skb_segment, Call zero copy functions before using skbuff frags Commit bf5c25d60861 ("skbuff: in skb_segment, call zerocopy functions once per nskb") added the call to zero copy functions in skb_segment(). The change introduced a bug in skb_segment() because skb_orphan_frags() may possibly change the number of fragments or allocate new fragments altogether leaving nrfrags and frag to point to the old values. This can cause a panic with stacktrace like the one below. [ 193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc [ 193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G O 5.15.123+ #26 [ 193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0 [ 194.021892] Call Trace: [ 194.027422] [ 194.072861] tcp_gso_segment+0x107/0x540 [ 194.082031] inet_gso_segment+0x15c/0x3d0 [ 194.090783] skb_mac_gso_segment+0x9f/0x110 [ 194.095016] __skb_gso_segment+0xc1/0x190 [ 194.103131] netem_enqueue+0x290/0xb10 [sch_netem] [ 194.107071] dev_qdisc_enqueue+0x16/0x70 [ 194.110884] __dev_queue_xmit+0x63b/0xb30 [ 194.121670] bond_start_xmit+0x159/0x380 [bonding] [ 194.128506] dev_hard_start_xmit+0xc3/0x1e0 [ 194.131787] __dev_queue_xmit+0x8a0/0xb30 [ 194.138225] macvlan_start_xmit+0x4f/0x100 [macvlan] [ 194.141477] dev_hard_start_xmit+0xc3/0x1e0 [ 194.144622] sch_direct_xmit+0xe3/0x280 [ 194.147748] __dev_queue_xmit+0x54a/0xb30 [ 194.154131] tap_get_user+0x2a8/0x9c0 [tap] [ 194.157358] tap_sendmsg+0x52/0x8e0 [tap] [ 194.167049] handle_tx_zerocopy+0x14e/0x4c0 [vhost_net] [ 194.173631] handle_tx+0xcd/0xe0 [vhost_net] [ 194.176959] vhost_worker+0x76/0xb0 [vhost] [ 194.183667] kthread+0x118/0x140 [ 194.190358] ret_from_fork+0x1f/0x30 [ 194.193670] In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags local variable in skb_segment() stale. This resulted in the code hitting i >= nrfrags prematurely and trying to move to next frag_skb using list_skb pointer, which was NULL, and caused kernel panic. Move the call to zero copy functions before using frags and nr_frags. Fixes: bf5c25d60861 ("skbuff: in skb_segment, call zerocopy functions once per nskb") Signed-off-by: Mohamed Khalfella Reported-by: Amit Goyal Cc: stable@vger.kernel.org Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/skbuff.c | 34 ++++++++++++++++++++-------------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 24f26e816184..17caf4ea67da 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4423,21 +4423,20 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, struct sk_buff *segs = NULL; struct sk_buff *tail = NULL; struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list; - skb_frag_t *frag = skb_shinfo(head_skb)->frags; unsigned int mss = skb_shinfo(head_skb)->gso_size; unsigned int doffset = head_skb->data - skb_mac_header(head_skb); - struct sk_buff *frag_skb = head_skb; unsigned int offset = doffset; unsigned int tnl_hlen = skb_tnl_header_len(head_skb); unsigned int partial_segs = 0; unsigned int headroom; unsigned int len = head_skb->len; + struct sk_buff *frag_skb; + skb_frag_t *frag; __be16 proto; bool csum, sg; - int nfrags = skb_shinfo(head_skb)->nr_frags; int err = -ENOMEM; int i = 0; - int pos; + int nfrags, pos; if ((skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY) && mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) { @@ -4514,6 +4513,13 @@ normal: headroom = skb_headroom(head_skb); pos = skb_headlen(head_skb); + if (skb_orphan_frags(head_skb, GFP_ATOMIC)) + return ERR_PTR(-ENOMEM); + + nfrags = skb_shinfo(head_skb)->nr_frags; + frag = skb_shinfo(head_skb)->frags; + frag_skb = head_skb; + do { struct sk_buff *nskb; skb_frag_t *nskb_frag; @@ -4534,6 +4540,10 @@ normal: (skb_headlen(list_skb) == len || sg)) { BUG_ON(skb_headlen(list_skb) > len); + nskb = skb_clone(list_skb, GFP_ATOMIC); + if (unlikely(!nskb)) + goto err; + i = 0; nfrags = skb_shinfo(list_skb)->nr_frags; frag = skb_shinfo(list_skb)->frags; @@ -4552,12 +4562,8 @@ normal: frag++; } - nskb = skb_clone(list_skb, GFP_ATOMIC); list_skb = list_skb->next; - if (unlikely(!nskb)) - goto err; - if (unlikely(pskb_trim(nskb, len))) { kfree_skb(nskb); goto err; @@ -4633,12 +4639,16 @@ normal: skb_shinfo(nskb)->flags |= skb_shinfo(head_skb)->flags & SKBFL_SHARED_FRAG; - if (skb_orphan_frags(frag_skb, GFP_ATOMIC) || - skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC)) + if (skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC)) goto err; while (pos < offset + len) { if (i >= nfrags) { + if (skb_orphan_frags(list_skb, GFP_ATOMIC) || + skb_zerocopy_clone(nskb, list_skb, + GFP_ATOMIC)) + goto err; + i = 0; nfrags = skb_shinfo(list_skb)->nr_frags; frag = skb_shinfo(list_skb)->frags; @@ -4652,10 +4662,6 @@ normal: i--; frag--; } - if (skb_orphan_frags(frag_skb, GFP_ATOMIC) || - skb_zerocopy_clone(nskb, frag_skb, - GFP_ATOMIC)) - goto err; list_skb = list_skb->next; } From 6ac66cb03ae306c2e288a9be18226310529f5b25 Mon Sep 17 00:00:00 2001 From: Sriram Yagnaraman Date: Thu, 31 Aug 2023 10:03:30 +0200 Subject: [PATCH 37/95] ipv4: ignore dst hint for multipath routes Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group. A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in ip_mkroute_input() and use it in ip_extract_route_hint() to check for the existence of the flag. Fixes: 02b24941619f ("ipv4: use dst hint for ipv4 list receive") Signed-off-by: Sriram Yagnaraman Reviewed-by: Ido Schimmel Reviewed-by: David Ahern Signed-off-by: David S. Miller --- include/net/ip.h | 1 + net/ipv4/ip_input.c | 3 ++- net/ipv4/route.c | 1 + 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/include/net/ip.h b/include/net/ip.h index 9276cea775cc..3489a1cca5e7 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -57,6 +57,7 @@ struct inet_skb_parm { #define IPSKB_FRAG_PMTU BIT(6) #define IPSKB_L3SLAVE BIT(7) #define IPSKB_NOPOLICY BIT(8) +#define IPSKB_MULTIPATH BIT(9) u16 frag_max_size; }; diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index fe9ead9ee863..5e9c8156656a 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -584,7 +584,8 @@ static void ip_sublist_rcv_finish(struct list_head *head) static struct sk_buff *ip_extract_route_hint(const struct net *net, struct sk_buff *skb, int rt_type) { - if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST) + if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST || + IPCB(skb)->flags & IPSKB_MULTIPATH) return NULL; return skb; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index a4e153dd615b..6a3f57a3fa41 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2144,6 +2144,7 @@ static int ip_mkroute_input(struct sk_buff *skb, int h = fib_multipath_hash(res->fi->fib_net, NULL, skb, hkeys); fib_select_multipath(res, h); + IPCB(skb)->flags |= IPSKB_MULTIPATH; } #endif From 8423be8926aa82cd2e28bba5cc96ccb72c7ce6be Mon Sep 17 00:00:00 2001 From: Sriram Yagnaraman Date: Thu, 31 Aug 2023 10:03:31 +0200 Subject: [PATCH 38/95] ipv6: ignore dst hint for multipath routes Route hints when the nexthop is part of a multipath group causes packets in the same receive batch to be sent to the same nexthop irrespective of the multipath hash of the packet. So, do not extract route hint for packets whose destination is part of a multipath group. A new SKB flag IP6SKB_MULTIPATH is introduced for this purpose, set the flag when route is looked up in fib6_select_path() and use it in ip6_can_use_hint() to check for the existence of the flag. Fixes: 197dbf24e360 ("ipv6: introduce and uses route look hints for list input.") Signed-off-by: Sriram Yagnaraman Reviewed-by: Ido Schimmel Reviewed-by: David Ahern Signed-off-by: David S. Miller --- include/linux/ipv6.h | 1 + net/ipv6/ip6_input.c | 3 ++- net/ipv6/route.c | 3 +++ 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 5883551b1ee8..af8a771a053c 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -147,6 +147,7 @@ struct inet6_skb_parm { #define IP6SKB_JUMBOGRAM 128 #define IP6SKB_SEG6 256 #define IP6SKB_FAKEJUMBO 512 +#define IP6SKB_MULTIPATH 1024 }; #if defined(CONFIG_NET_L3_MASTER_DEV) diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index d94041bb4287..b8378814532c 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -99,7 +99,8 @@ static bool ip6_can_use_hint(const struct sk_buff *skb, static struct sk_buff *ip6_extract_route_hint(const struct net *net, struct sk_buff *skb) { - if (fib6_routes_require_src(net) || fib6_has_custom_rules(net)) + if (fib6_routes_require_src(net) || fib6_has_custom_rules(net) || + IP6CB(skb)->flags & IP6SKB_MULTIPATH) return NULL; return skb; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 846aec8e0093..01d6d352850a 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -423,6 +423,9 @@ void fib6_select_path(const struct net *net, struct fib6_result *res, if (match->nh && have_oif_match && res->nh) return; + if (skb) + IP6CB(skb)->flags |= IP6SKB_MULTIPATH; + /* We might have already computed the hash for ICMPv6 errors. In such * case it will always be non-zero. Otherwise now is the time to do it. */ From 8ae9efb859c05a54ac92b3336c6ca0597c9c8cdb Mon Sep 17 00:00:00 2001 From: Sriram Yagnaraman Date: Thu, 31 Aug 2023 10:03:32 +0200 Subject: [PATCH 39/95] selftests: fib_tests: Add multipath list receive tests The test uses perf stat to count the number of fib:fib_table_lookup tracepoint hits for IPv4 and the number of fib6:fib6_table_lookup for IPv6. The measured count is checked to be within 5% of the total number of packets sent via veth1. Signed-off-by: Sriram Yagnaraman Signed-off-by: David S. Miller --- tools/testing/selftests/net/fib_tests.sh | 155 ++++++++++++++++++++++- 1 file changed, 154 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index d328af4a149c..e7d2a530618a 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -12,7 +12,8 @@ ksft_skip=4 TESTS="unregister down carrier nexthop suppress ipv6_notify ipv4_notify \ ipv6_rt ipv4_rt ipv6_addr_metric ipv4_addr_metric ipv6_route_metrics \ ipv4_route_metrics ipv4_route_v6_gw rp_filter ipv4_del_addr \ - ipv6_del_addr ipv4_mangle ipv6_mangle ipv4_bcast_neigh fib6_gc_test" + ipv6_del_addr ipv4_mangle ipv6_mangle ipv4_bcast_neigh fib6_gc_test \ + ipv4_mpath_list ipv6_mpath_list" VERBOSE=0 PAUSE_ON_FAIL=no @@ -2352,6 +2353,156 @@ ipv4_bcast_neigh_test() cleanup } +mpath_dep_check() +{ + if [ ! -x "$(command -v mausezahn)" ]; then + echo "mausezahn command not found. Skipping test" + return 1 + fi + + if [ ! -x "$(command -v jq)" ]; then + echo "jq command not found. Skipping test" + return 1 + fi + + if [ ! -x "$(command -v bc)" ]; then + echo "bc command not found. Skipping test" + return 1 + fi + + if [ ! -x "$(command -v perf)" ]; then + echo "perf command not found. Skipping test" + return 1 + fi + + perf list fib:* | grep -q fib_table_lookup + if [ $? -ne 0 ]; then + echo "IPv4 FIB tracepoint not found. Skipping test" + return 1 + fi + + perf list fib6:* | grep -q fib6_table_lookup + if [ $? -ne 0 ]; then + echo "IPv6 FIB tracepoint not found. Skipping test" + return 1 + fi + + return 0 +} + +link_stats_get() +{ + local ns=$1; shift + local dev=$1; shift + local dir=$1; shift + local stat=$1; shift + + ip -n $ns -j -s link show dev $dev \ + | jq '.[]["stats64"]["'$dir'"]["'$stat'"]' +} + +list_rcv_eval() +{ + local file=$1; shift + local expected=$1; shift + + local count=$(tail -n 1 $file | jq '.["counter-value"] | tonumber | floor') + local ratio=$(echo "scale=2; $count / $expected" | bc -l) + local res=$(echo "$ratio >= 0.95" | bc) + [[ $res -eq 1 ]] + log_test $? 0 "Multipath route hit ratio ($ratio)" +} + +ipv4_mpath_list_test() +{ + echo + echo "IPv4 multipath list receive tests" + + mpath_dep_check || return 1 + + route_setup + + set -e + run_cmd "ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off" + + run_cmd "ip netns exec ns2 bash -c \"echo 20000 > /sys/class/net/veth2/gro_flush_timeout\"" + run_cmd "ip netns exec ns2 bash -c \"echo 1 > /sys/class/net/veth2/napi_defer_hard_irqs\"" + run_cmd "ip netns exec ns2 ethtool -K veth2 generic-receive-offload on" + run_cmd "ip -n ns2 link add name nh1 up type dummy" + run_cmd "ip -n ns2 link add name nh2 up type dummy" + run_cmd "ip -n ns2 address add 172.16.201.1/24 dev nh1" + run_cmd "ip -n ns2 address add 172.16.202.1/24 dev nh2" + run_cmd "ip -n ns2 neigh add 172.16.201.2 lladdr 00:11:22:33:44:55 nud perm dev nh1" + run_cmd "ip -n ns2 neigh add 172.16.202.2 lladdr 00:aa:bb:cc:dd:ee nud perm dev nh2" + run_cmd "ip -n ns2 route add 203.0.113.0/24 + nexthop via 172.16.201.2 nexthop via 172.16.202.2" + run_cmd "ip netns exec ns2 sysctl -qw net.ipv4.fib_multipath_hash_policy=1" + set +e + + local dmac=$(ip -n ns2 -j link show dev veth2 | jq -r '.[]["address"]') + local tmp_file=$(mktemp) + local cmd="ip netns exec ns1 mausezahn veth1 -a own -b $dmac + -A 172.16.101.1 -B 203.0.113.1 -t udp 'sp=12345,dp=0-65535' -q" + + # Packets forwarded in a list using a multipath route must not reuse a + # cached result so that a flow always hits the same nexthop. In other + # words, the FIB lookup tracepoint needs to be triggered for every + # packet. + local t0_rx_pkts=$(link_stats_get ns2 veth2 rx packets) + run_cmd "perf stat -e fib:fib_table_lookup --filter 'err == 0' -j -o $tmp_file -- $cmd" + local t1_rx_pkts=$(link_stats_get ns2 veth2 rx packets) + local diff=$(echo $t1_rx_pkts - $t0_rx_pkts | bc -l) + list_rcv_eval $tmp_file $diff + + rm $tmp_file + route_cleanup +} + +ipv6_mpath_list_test() +{ + echo + echo "IPv6 multipath list receive tests" + + mpath_dep_check || return 1 + + route_setup + + set -e + run_cmd "ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off" + + run_cmd "ip netns exec ns2 bash -c \"echo 20000 > /sys/class/net/veth2/gro_flush_timeout\"" + run_cmd "ip netns exec ns2 bash -c \"echo 1 > /sys/class/net/veth2/napi_defer_hard_irqs\"" + run_cmd "ip netns exec ns2 ethtool -K veth2 generic-receive-offload on" + run_cmd "ip -n ns2 link add name nh1 up type dummy" + run_cmd "ip -n ns2 link add name nh2 up type dummy" + run_cmd "ip -n ns2 -6 address add 2001:db8:201::1/64 dev nh1" + run_cmd "ip -n ns2 -6 address add 2001:db8:202::1/64 dev nh2" + run_cmd "ip -n ns2 -6 neigh add 2001:db8:201::2 lladdr 00:11:22:33:44:55 nud perm dev nh1" + run_cmd "ip -n ns2 -6 neigh add 2001:db8:202::2 lladdr 00:aa:bb:cc:dd:ee nud perm dev nh2" + run_cmd "ip -n ns2 -6 route add 2001:db8:301::/64 + nexthop via 2001:db8:201::2 nexthop via 2001:db8:202::2" + run_cmd "ip netns exec ns2 sysctl -qw net.ipv6.fib_multipath_hash_policy=1" + set +e + + local dmac=$(ip -n ns2 -j link show dev veth2 | jq -r '.[]["address"]') + local tmp_file=$(mktemp) + local cmd="ip netns exec ns1 mausezahn -6 veth1 -a own -b $dmac + -A 2001:db8:101::1 -B 2001:db8:301::1 -t udp 'sp=12345,dp=0-65535' -q" + + # Packets forwarded in a list using a multipath route must not reuse a + # cached result so that a flow always hits the same nexthop. In other + # words, the FIB lookup tracepoint needs to be triggered for every + # packet. + local t0_rx_pkts=$(link_stats_get ns2 veth2 rx packets) + run_cmd "perf stat -e fib6:fib6_table_lookup --filter 'err == 0' -j -o $tmp_file -- $cmd" + local t1_rx_pkts=$(link_stats_get ns2 veth2 rx packets) + local diff=$(echo $t1_rx_pkts - $t0_rx_pkts | bc -l) + list_rcv_eval $tmp_file $diff + + rm $tmp_file + route_cleanup +} + ################################################################################ # usage @@ -2433,6 +2584,8 @@ do ipv6_mangle) ipv6_mangle_test;; ipv4_bcast_neigh) ipv4_bcast_neigh_test;; fib6_gc_test|ipv6_gc) fib6_gc_test;; + ipv4_mpath_list) ipv4_mpath_list_test;; + ipv6_mpath_list) ipv6_mpath_list_test;; help) echo "Test names: $TESTS"; exit 0;; esac From ae074e2b2fd410bf54d56509a7e48fb83873af3b Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Thu, 31 Aug 2023 17:58:11 +0100 Subject: [PATCH 40/95] sfc: check for zero length in EF10 RX prefix When EF10 RXDP firmware is operating in cut-through mode, packet length is not known at the time the RX prefix is generated, so it is left as zero and RX event merging is inhibited to ensure that the length is available in the RX event. However, it has been found that in certain circumstances the RX events for these packets still get merged, meaning the driver cannot read the length from the RX event, and tries to use the length from the prefix. The resulting zero-length SKBs cause crashes in GRO since commit 1d11fa696733 ("net-gro: remove GRO_DROP"), so add a check to the driver to detect these zero-length RX events and discard the packet. Signed-off-by: Edward Cree Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/rx.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c index 2375cef577e4..f77a2d3ef37e 100644 --- a/drivers/net/ethernet/sfc/rx.c +++ b/drivers/net/ethernet/sfc/rx.c @@ -359,26 +359,36 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel, /* Handle a received packet. Second half: Touches packet payload. */ void __efx_rx_packet(struct efx_channel *channel) { + struct efx_rx_queue *rx_queue = efx_channel_get_rx_queue(channel); struct efx_nic *efx = channel->efx; struct efx_rx_buffer *rx_buf = - efx_rx_buffer(&channel->rx_queue, channel->rx_pkt_index); + efx_rx_buffer(rx_queue, channel->rx_pkt_index); u8 *eh = efx_rx_buf_va(rx_buf); /* Read length from the prefix if necessary. This already * excludes the length of the prefix itself. */ - if (rx_buf->flags & EFX_RX_PKT_PREFIX_LEN) + if (rx_buf->flags & EFX_RX_PKT_PREFIX_LEN) { rx_buf->len = le16_to_cpup((__le16 *) (eh + efx->rx_packet_len_offset)); + /* A known issue may prevent this being filled in; + * if that happens, just drop the packet. + * Must do that in the driver since passing a zero-length + * packet up to the stack may cause a crash. + */ + if (unlikely(!rx_buf->len)) { + efx_free_rx_buffers(rx_queue, rx_buf, + channel->rx_pkt_n_frags); + channel->n_rx_frm_trunc++; + goto out; + } + } /* If we're in loopback test, then pass the packet directly to the * loopback layer, and free the rx_buf here */ if (unlikely(efx->loopback_selftest)) { - struct efx_rx_queue *rx_queue; - efx_loopback_rx_packet(efx, eh, rx_buf->len); - rx_queue = efx_channel_get_rx_queue(channel); efx_free_rx_buffers(rx_queue, rx_buf, channel->rx_pkt_n_frags); goto out; From c1970e26bdc1209974bb5cf31cc23f2b7ad6ce50 Mon Sep 17 00:00:00 2001 From: Xu Kuohai Date: Fri, 1 Sep 2023 11:10:37 +0800 Subject: [PATCH 41/95] selftests/bpf: Fix a CI failure caused by vsock write While commit 90f0074cd9f9 ("selftests/bpf: fix a CI failure caused by vsock sockmap test") fixes a receive failure of vsock sockmap test, there is still a write failure: Error: #211/79 sockmap_listen/sockmap VSOCK test_vsock_redir Error: #211/79 sockmap_listen/sockmap VSOCK test_vsock_redir ./test_progs:vsock_unix_redir_connectible:1501: egress: write: Transport endpoint is not connected vsock_unix_redir_connectible:FAIL:1501 ./test_progs:vsock_unix_redir_connectible:1501: ingress: write: Transport endpoint is not connected vsock_unix_redir_connectible:FAIL:1501 ./test_progs:vsock_unix_redir_connectible:1501: egress: write: Transport endpoint is not connected vsock_unix_redir_connectible:FAIL:1501 The reason is that the vsock connection in the test is set to ESTABLISHED state by function virtio_transport_recv_pkt, which is executed in a workqueue thread, so when the user space test thread runs before the workqueue thread, this problem occurs. To fix it, before writing the connection, wait for it to be connected. Fixes: d61bd8c1fd02 ("selftests/bpf: add a test case for vsock sockmap") Signed-off-by: Xu Kuohai Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230901031037.3314007-1-xukuohai@huaweicloud.com --- .../bpf/prog_tests/sockmap_helpers.h | 26 +++++++++++++++++++ .../selftests/bpf/prog_tests/sockmap_listen.c | 7 +++++ 2 files changed, 33 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h b/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h index d12665490a90..36d829a65aa4 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h @@ -179,6 +179,32 @@ __ret; \ }) +static inline int poll_connect(int fd, unsigned int timeout_sec) +{ + struct timeval timeout = { .tv_sec = timeout_sec }; + fd_set wfds; + int r, eval; + socklen_t esize = sizeof(eval); + + FD_ZERO(&wfds); + FD_SET(fd, &wfds); + + r = select(fd + 1, NULL, &wfds, NULL, &timeout); + if (r == 0) + errno = ETIME; + if (r != 1) + return -1; + + if (getsockopt(fd, SOL_SOCKET, SO_ERROR, &eval, &esize) < 0) + return -1; + if (eval != 0) { + errno = eval; + return -1; + } + + return 0; +} + static inline int poll_read(int fd, unsigned int timeout_sec) { struct timeval timeout = { .tv_sec = timeout_sec }; diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c index 5674a9d0cacf..8df8cbb447f1 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c @@ -1452,11 +1452,18 @@ static int vsock_socketpair_connectible(int sotype, int *v0, int *v1) if (p < 0) goto close_cli; + if (poll_connect(c, IO_TIMEOUT_SEC) < 0) { + FAIL_ERRNO("poll_connect"); + goto close_acc; + } + *v0 = p; *v1 = c; return 0; +close_acc: + close(p); close_cli: close(c); close_srv: From 3888fa134eddac467b5a094949a8f0731ef6ffd5 Mon Sep 17 00:00:00 2001 From: Eduard Zingerman Date: Fri, 1 Sep 2023 15:59:35 +0300 Subject: [PATCH 42/95] docs/bpf: Fix "file doesn't exist" warnings in {llvm_reloc,btf}.rst scripts/documentation-file-ref-check reports warnings for (valid) cross-links of form: :ref:`Documentation/bpf/btf ` Adding extension to the file name helps to avoid the warning, e.g: :ref:`Documentation/bpf/btf.rst ` Fixes: be4033d36070 ("docs/bpf: Add description for CO-RE relocations") Reported-by: kernel test robot Signed-off-by: Eduard Zingerman Signed-off-by: Daniel Borkmann Acked-by: Jiri Olsa Closes: https://lore.kernel.org/oe-kbuild-all/202309010804.G3MpXo59-lkp@intel.com Link: https://lore.kernel.org/bpf/20230901125935.487972-1-eddyz87@gmail.com --- Documentation/bpf/btf.rst | 2 +- Documentation/bpf/llvm_reloc.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index ffc11afee569..e43c2fdafcd7 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -803,7 +803,7 @@ structure when .BTF.ext is generated. All ``bpf_core_relo`` structures within a single ``btf_ext_info_sec`` describe relocations applied to section named by ``btf_ext_info_sec->sec_name_off``. -See :ref:`Documentation/bpf/llvm_reloc ` +See :ref:`Documentation/bpf/llvm_reloc.rst ` for more information on CO-RE relocations. 4.2 .BTF_ids section diff --git a/Documentation/bpf/llvm_reloc.rst b/Documentation/bpf/llvm_reloc.rst index 73bf805000f2..44188e219d32 100644 --- a/Documentation/bpf/llvm_reloc.rst +++ b/Documentation/bpf/llvm_reloc.rst @@ -250,7 +250,7 @@ CO-RE Relocations From object file point of view CO-RE mechanism is implemented as a set of CO-RE specific relocation records. These relocation records are not related to ELF relocations and are encoded in .BTF.ext section. -See :ref:`Documentation/bpf/btf ` for more +See :ref:`Documentation/bpf/btf.rst ` for more information on .BTF.ext structure. CO-RE relocations are applied to BPF instructions to update immediate From fa09bc40b21a33937872c4c4cf0f266ec9fa4869 Mon Sep 17 00:00:00 2001 From: Corinna Vinschen Date: Thu, 31 Aug 2023 14:19:13 +0200 Subject: [PATCH 43/95] igb: disable virtualization features on 82580 Disable virtualization features on 82580 just as on i210/i211. This avoids that virt functions are acidentally called on 82850. Fixes: 55cac248caa4 ("igb: Add full support for 82580 devices") Signed-off-by: Corinna Vinschen Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- drivers/net/ethernet/intel/igb/igb_main.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 1ab787ed254d..13ba9c74bd84 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -3933,8 +3933,9 @@ static void igb_probe_vfs(struct igb_adapter *adapter) struct pci_dev *pdev = adapter->pdev; struct e1000_hw *hw = &adapter->hw; - /* Virtualization features not supported on i210 family. */ - if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211)) + /* Virtualization features not supported on i210 and 82580 family. */ + if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211) || + (hw->mac.type == e1000_82580)) return; /* Of the below we really only want the effect of getting From 915d975b2ffa58a14bfcf16fafe00c41315949ff Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 31 Aug 2023 18:37:50 +0000 Subject: [PATCH 44/95] net: deal with integer overflows in kmalloc_reserve() Blamed commit changed: ptr = kmalloc(size); if (ptr) size = ksize(ptr); to: size = kmalloc_size_roundup(size); ptr = kmalloc(size); This allowed various crash as reported by syzbot [1] and Kyle Zeng. Problem is that if @size is bigger than 0x80000001, kmalloc_size_roundup(size) returns 2^32. kmalloc_reserve() uses a 32bit variable (obj_size), so 2^32 is truncated to 0. kmalloc(0) returns ZERO_SIZE_PTR which is not handled by skb allocations. Following trace can be triggered if a netdev->mtu is set close to 0x7fffffff We might in the future limit netdev->mtu to more sensible limit (like KMALLOC_MAX_SIZE). This patch is based on a syzbot report, and also a report and tentative fix from Kyle Zeng. [1] BUG: KASAN: user-memory-access in __build_skb_around net/core/skbuff.c:294 [inline] BUG: KASAN: user-memory-access in __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527 Write of size 32 at addr 00000000fffffd10 by task syz-executor.4/22554 CPU: 1 PID: 22554 Comm: syz-executor.4 Not tainted 6.1.39-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023 Call trace: dump_backtrace+0x1c8/0x1f4 arch/arm64/kernel/stacktrace.c:279 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:286 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x120/0x1a0 lib/dump_stack.c:106 print_report+0xe4/0x4b4 mm/kasan/report.c:398 kasan_report+0x150/0x1ac mm/kasan/report.c:495 kasan_check_range+0x264/0x2a4 mm/kasan/generic.c:189 memset+0x40/0x70 mm/kasan/shadow.c:44 __build_skb_around net/core/skbuff.c:294 [inline] __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527 alloc_skb include/linux/skbuff.h:1316 [inline] igmpv3_newpack+0x104/0x1088 net/ipv4/igmp.c:359 add_grec+0x81c/0x1124 net/ipv4/igmp.c:534 igmpv3_send_cr net/ipv4/igmp.c:667 [inline] igmp_ifc_timer_expire+0x1b0/0x1008 net/ipv4/igmp.c:810 call_timer_fn+0x1c0/0x9f0 kernel/time/timer.c:1474 expire_timers kernel/time/timer.c:1519 [inline] __run_timers+0x54c/0x710 kernel/time/timer.c:1790 run_timer_softirq+0x28/0x4c kernel/time/timer.c:1803 _stext+0x380/0xfbc ____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:79 call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:891 do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:84 invoke_softirq kernel/softirq.c:437 [inline] __irq_exit_rcu+0x1c0/0x4cc kernel/softirq.c:683 irq_exit_rcu+0x14/0x78 kernel/softirq.c:695 el0_interrupt+0x7c/0x2e0 arch/arm64/kernel/entry-common.c:717 __el0_irq_handler_common+0x18/0x24 arch/arm64/kernel/entry-common.c:724 el0t_64_irq_handler+0x10/0x1c arch/arm64/kernel/entry-common.c:729 el0t_64_irq+0x1a0/0x1a4 arch/arm64/kernel/entry.S:584 Fixes: 12d6c1d3a2ad ("skbuff: Proactively round up to kmalloc bucket size") Reported-by: syzbot Reported-by: Kyle Zeng Signed-off-by: Eric Dumazet Cc: Kees Cook Cc: Vlastimil Babka Signed-off-by: David S. Miller --- net/core/skbuff.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 17caf4ea67da..4eaf7ed0d1f4 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -550,7 +550,7 @@ static void *kmalloc_reserve(unsigned int *size, gfp_t flags, int node, bool *pfmemalloc) { bool ret_pfmemalloc = false; - unsigned int obj_size; + size_t obj_size; void *obj; obj_size = SKB_HEAD_ALIGN(*size); @@ -567,7 +567,13 @@ static void *kmalloc_reserve(unsigned int *size, gfp_t flags, int node, obj = kmem_cache_alloc_node(skb_small_head_cache, flags, node); goto out; } - *size = obj_size = kmalloc_size_roundup(obj_size); + + obj_size = kmalloc_size_roundup(obj_size); + /* The following cast might truncate high-order bits of obj_size, this + * is harmless because kmalloc(obj_size >= 2^32) will fail anyway. + */ + *size = (unsigned int)obj_size; + /* * Try a regular allocation, when that fails and we're not entitled * to the reserves, fail. From 817c7cd2043a83a3d8147f40eea1505ac7300b62 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 31 Aug 2023 21:38:12 +0000 Subject: [PATCH 45/95] gve: fix frag_list chaining gve_rx_append_frags() is able to build skbs chained with frag_list, like GRO engine. Problem is that shinfo->frag_list should only be used for the head of the chain. All other links should use skb->next pointer. Otherwise, built skbs are not valid and can cause crashes. Equivalent code in GRO (skb_gro_receive()) is: if (NAPI_GRO_CB(p)->last == p) skb_shinfo(p)->frag_list = skb; else NAPI_GRO_CB(p)->last->next = skb; NAPI_GRO_CB(p)->last = skb; Fixes: 9b8dd5e5ea48 ("gve: DQO: Add RX path") Signed-off-by: Eric Dumazet Cc: Bailey Forrest Cc: Willem de Bruijn Cc: Catherine Sullivan Reviewed-by: David Ahern Signed-off-by: David S. Miller --- drivers/net/ethernet/google/gve/gve_rx_dqo.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/google/gve/gve_rx_dqo.c b/drivers/net/ethernet/google/gve/gve_rx_dqo.c index ea0e38b4d9e9..f281e42a7ef9 100644 --- a/drivers/net/ethernet/google/gve/gve_rx_dqo.c +++ b/drivers/net/ethernet/google/gve/gve_rx_dqo.c @@ -570,7 +570,10 @@ static int gve_rx_append_frags(struct napi_struct *napi, if (!skb) return -1; - skb_shinfo(rx->ctx.skb_tail)->frag_list = skb; + if (rx->ctx.skb_tail == rx->ctx.skb_head) + skb_shinfo(rx->ctx.skb_head)->frag_list = skb; + else + rx->ctx.skb_tail->next = skb; rx->ctx.skb_tail = skb; num_frags = 0; } From 151e887d8ff97e2e42110ffa1fb1e6a2128fb364 Mon Sep 17 00:00:00 2001 From: Liang Chen Date: Fri, 1 Sep 2023 12:09:21 +0800 Subject: [PATCH 46/95] veth: Fixing transmit return status for dropped packets The veth_xmit function returns NETDEV_TX_OK even when packets are dropped. This behavior leads to incorrect calculations of statistics counts, as well as things like txq->trans_start updates. Fixes: e314dbdc1c0d ("[NET]: Virtual ethernet device driver.") Signed-off-by: Liang Chen Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- drivers/net/veth.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index d43e62ebc2fc..9c6f4f83f22b 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -344,6 +344,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) { struct veth_priv *rcv_priv, *priv = netdev_priv(dev); struct veth_rq *rq = NULL; + int ret = NETDEV_TX_OK; struct net_device *rcv; int length = skb->len; bool use_napi = false; @@ -378,11 +379,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) } else { drop: atomic64_inc(&priv->dropped); + ret = NET_XMIT_DROP; } rcu_read_unlock(); - return NETDEV_TX_OK; + return ret; } static u64 veth_stats_tx(struct net_device *dev, u64 *packets, u64 *bytes) From f31867d0d9d82af757c1e0178b659438f4c1ea3c Mon Sep 17 00:00:00 2001 From: Alex Henrie Date: Thu, 31 Aug 2023 22:41:27 -0600 Subject: [PATCH 47/95] net: ipv6/addrconf: avoid integer underflow in ipv6_create_tempaddr The existing code incorrectly casted a negative value (the result of a subtraction) to an unsigned value without checking. For example, if /proc/sys/net/ipv6/conf/*/temp_prefered_lft was set to 1, the preferred lifetime would jump to 4 billion seconds. On my machine and network the shortest lifetime that avoided underflow was 3 seconds. Fixes: 76506a986dc3 ("IPv6: fix DESYNC_FACTOR") Signed-off-by: Alex Henrie Reviewed-by: David Ahern Signed-off-by: David S. Miller --- net/ipv6/addrconf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 47d1dd8501b7..85cdbc252654 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1378,7 +1378,7 @@ retry: * idev->desync_factor if it's larger */ cnf_temp_preferred_lft = READ_ONCE(idev->cnf.temp_prefered_lft); - max_desync_factor = min_t(__u32, + max_desync_factor = min_t(long, idev->cnf.max_desync_factor, cnf_temp_preferred_lft - regen_advance); From 719c5e37e99d2fd588d1c994284d17650a66354c Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Fri, 1 Sep 2023 06:53:23 +0200 Subject: [PATCH 48/95] net: phy: micrel: Correct bit assignments for phy_device flags Previously, the defines for phy_device flags in the Micrel driver were ambiguous in their representation. They were intended to be bit masks but were mistakenly defined as bit positions. This led to the following issues: - MICREL_KSZ8_P1_ERRATA, designated for KSZ88xx switches, overlapped with MICREL_PHY_FXEN and MICREL_PHY_50MHZ_CLK. - Due to this overlap, the code path for MICREL_PHY_FXEN, tailored for the KSZ8041 PHY, was not executed for KSZ88xx PHYs. - Similarly, the code associated with MICREL_PHY_50MHZ_CLK wasn't triggered for KSZ88xx. To rectify this, all three flags have now been explicitly converted to use the `BIT()` macro, ensuring they are defined as bit masks and preventing potential overlaps in the future. Fixes: 49011e0c1555 ("net: phy: micrel: ksz886x/ksz8081: add cabletest support") Signed-off-by: Oleksij Rempel Reviewed-by: Russell King (Oracle) Signed-off-by: David S. Miller --- include/linux/micrel_phy.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/micrel_phy.h b/include/linux/micrel_phy.h index 8bef1ab62bba..322d87255984 100644 --- a/include/linux/micrel_phy.h +++ b/include/linux/micrel_phy.h @@ -41,9 +41,9 @@ #define PHY_ID_KSZ9477 0x00221631 /* struct phy_device dev_flags definitions */ -#define MICREL_PHY_50MHZ_CLK 0x00000001 -#define MICREL_PHY_FXEN 0x00000002 -#define MICREL_KSZ8_P1_ERRATA 0x00000003 +#define MICREL_PHY_50MHZ_CLK BIT(0) +#define MICREL_PHY_FXEN BIT(1) +#define MICREL_KSZ8_P1_ERRATA BIT(2) #define MICREL_KSZ9021_EXTREG_CTRL 0xB #define MICREL_KSZ9021_EXTREG_DATA_WRITE 0xC From a454d84ee20baf7bd7be90721b9821f73c7d23d9 Mon Sep 17 00:00:00 2001 From: John Fastabend Date: Fri, 1 Sep 2023 13:21:37 -0700 Subject: [PATCH 49/95] bpf, sockmap: Fix skb refcnt race after locking changes There is a race where skb's from the sk_psock_backlog can be referenced after userspace side has already skb_consumed() the sk_buff and its refcnt dropped to zer0 causing use after free. The flow is the following: while ((skb = skb_peek(&psock->ingress_skb)) sk_psock_handle_Skb(psock, skb, ..., ingress) if (!ingress) ... sk_psock_skb_ingress sk_psock_skb_ingress_enqueue(skb) msg->skb = skb sk_psock_queue_msg(psock, msg) skb_dequeue(&psock->ingress_skb) The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is what the application reads when recvmsg() is called. An application can read this anytime after the msg is placed on the queue. The recvmsg hook will also read msg->skb and then after user space reads the msg will call consume_skb(skb) on it effectively free'ing it. But, the race is in above where backlog queue still has a reference to the skb and calls skb_dequeue(). If the skb_dequeue happens after the user reads and free's the skb we have a use after free. The !ingress case does not suffer from this problem because it uses sendmsg_*(sk, msg) which does not pass the sk_buff further down the stack. The following splat was observed with 'test_progs -t sockmap_listen': [ 1022.710250][ T2556] general protection fault, ... [...] [ 1022.712830][ T2556] Workqueue: events sk_psock_backlog [ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80 [ 1022.713653][ T2556] Code: ... [...] [ 1022.720699][ T2556] Call Trace: [ 1022.720984][ T2556] [ 1022.721254][ T2556] ? die_addr+0x32/0x80^M [ 1022.721589][ T2556] ? exc_general_protection+0x25a/0x4b0 [ 1022.722026][ T2556] ? asm_exc_general_protection+0x22/0x30 [ 1022.722489][ T2556] ? skb_dequeue+0x4c/0x80 [ 1022.722854][ T2556] sk_psock_backlog+0x27a/0x300 [ 1022.723243][ T2556] process_one_work+0x2a7/0x5b0 [ 1022.723633][ T2556] worker_thread+0x4f/0x3a0 [ 1022.723998][ T2556] ? __pfx_worker_thread+0x10/0x10 [ 1022.724386][ T2556] kthread+0xfd/0x130 [ 1022.724709][ T2556] ? __pfx_kthread+0x10/0x10 [ 1022.725066][ T2556] ret_from_fork+0x2d/0x50 [ 1022.725409][ T2556] ? __pfx_kthread+0x10/0x10 [ 1022.725799][ T2556] ret_from_fork_asm+0x1b/0x30 [ 1022.726201][ T2556] To fix we add an skb_get() before passing the skb to be enqueued in the engress queue. This bumps the skb->users refcnt so that consume_skb() and kfree_skb will not immediately free the sk_buff. With this we can be sure the skb is still around when we do the dequeue. Then we just need to decrement the refcnt or free the skb in the backlog case which we do by calling kfree_skb() on the ingress case as well as the sendmsg case. Before locking change from fixes tag we had the sock locked so we couldn't race with user and there was no issue here. Fixes: 799aa7f98d53e ("skmsg: Avoid lock_sock() in sk_psock_backlog()") Reported-by: Jiri Olsa Signed-off-by: John Fastabend Signed-off-by: Daniel Borkmann Tested-by: Xu Kuohai Tested-by: Jiri Olsa Link: https://lore.kernel.org/bpf/20230901202137.214666-1-john.fastabend@gmail.com --- net/core/skmsg.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index a0659fc29bcc..6c31eefbd777 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -612,12 +612,18 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len, bool ingress) { + int err = 0; + if (!ingress) { if (!sock_writeable(psock->sk)) return -EAGAIN; return skb_send_sock(psock->sk, skb, off, len); } - return sk_psock_skb_ingress(psock, skb, off, len); + skb_get(skb); + err = sk_psock_skb_ingress(psock, skb, off, len); + if (err < 0) + kfree_skb(skb); + return err; } static void sk_psock_skb_state(struct sk_psock *psock, @@ -685,9 +691,7 @@ static void sk_psock_backlog(struct work_struct *work) } while (len); skb = skb_dequeue(&psock->ingress_skb); - if (!ingress) { - kfree_skb(skb); - } + kfree_skb(skb); } end: mutex_unlock(&psock->work_mutex); From ee8ab74aa0c248138c14f74cc6a636e0191c410b Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 1 Sep 2023 07:24:05 -0700 Subject: [PATCH 50/95] docs: netdev: document patchwork patch states The patchwork states are largely self-explanatory but small ambiguities may still come up. Document how we interpret the states in networking. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller --- Documentation/process/maintainer-netdev.rst | 32 ++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/Documentation/process/maintainer-netdev.rst b/Documentation/process/maintainer-netdev.rst index c1c732e9748b..db1b81cfba9b 100644 --- a/Documentation/process/maintainer-netdev.rst +++ b/Documentation/process/maintainer-netdev.rst @@ -120,7 +120,37 @@ queue for netdev: https://patchwork.kernel.org/project/netdevbpf/list/ The "State" field will tell you exactly where things are at with your -patch. Patches are indexed by the ``Message-ID`` header of the emails +patch: + +================== ============================================================= +Patch state Description +================== ============================================================= +New, Under review pending review, patch is in the maintainer’s queue for + review; the two states are used interchangeably (depending on + the exact co-maintainer handling patchwork at the time) +Accepted patch was applied to the appropriate networking tree, this is + usually set automatically by the pw-bot +Needs ACK waiting for an ack from an area expert or testing +Changes requested patch has not passed the review, new revision is expected + with appropriate code and commit message changes +Rejected patch has been rejected and new revision is not expected +Not applicable patch is expected to be applied outside of the networking + subsystem +Awaiting upstream patch should be reviewed and handled by appropriate + sub-maintainer, who will send it on to the networking trees; + patches set to ``Awaiting upstream`` in netdev's patchwork + will usually remain in this state, whether the sub-maintainer + requested changes, accepted or rejected the patch +Deferred patch needs to be reposted later, usually due to dependency + or because it was posted for a closed tree +Superseded new version of the patch was posted, usually set by the + pw-bot +RFC not to be applied, usually not in maintainer’s review queue, + pw-bot can automatically set patches to this state based + on subject tags +================== ============================================================= + +Patches are indexed by the ``Message-ID`` header of the emails which carried them so if you have trouble finding your patch append the value of ``Message-ID`` to the URL above. From 5245008738029135af52a2048d9fe9c4dd9be698 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 1 Sep 2023 14:17:18 -0700 Subject: [PATCH 51/95] docs: netdev: update the netdev infra URLs Some corporate proxies block our current NIPA URLs because they use a free / shady DNS domain. As suggested by Jesse we got a new DNS entry from Konstantin - netdev.bots.linux.dev, use it. Signed-off-by: Jakub Kicinski Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- Documentation/process/maintainer-netdev.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/process/maintainer-netdev.rst b/Documentation/process/maintainer-netdev.rst index db1b81cfba9b..09dcf6377c27 100644 --- a/Documentation/process/maintainer-netdev.rst +++ b/Documentation/process/maintainer-netdev.rst @@ -98,7 +98,7 @@ If you aren't subscribed to netdev and/or are simply unsure if repository link above for any new networking-related commits. You may also check the following website for the current status: - https://patchwork.hopto.org/net-next.html + https://netdev.bots.linux.dev/net-next.html The ``net`` tree continues to collect fixes for the vX.Y content, and is fed back to Linus at regular (~weekly) intervals. Meaning that the @@ -185,7 +185,7 @@ must match the MAINTAINERS entry) and a handful of senior reviewers. Bot records its activity here: - https://patchwork.hopto.org/pw-bot.html + https://netdev.bots.linux.dev/pw-bot.html Review timelines ~~~~~~~~~~~~~~~~ From 718e6b51298e0f254baca0d40ab52a00e004e014 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Fri, 1 Sep 2023 16:46:04 -0700 Subject: [PATCH 52/95] af_unix: Fix msg_controllen test in scm_pidfd_recv() for MSG_CMSG_COMPAT. Heiko Carstens reported that SCM_PIDFD does not work with MSG_CMSG_COMPAT because scm_pidfd_recv() always checks msg_controllen against sizeof(struct cmsghdr). We need to use sizeof(struct compat_cmsghdr) for the compat case. Fixes: 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD") Reported-by: Heiko Carstens Closes: https://lore.kernel.org/netdev/20230901200517.8742-A-hca@linux.ibm.com/ Signed-off-by: Kuniyuki Iwashima Tested-by: Heiko Carstens Reviewed-by: Alexander Mikhalitsyn Reviewed-by: Michal Swiatkowski Acked-by: Christian Brauner Signed-off-by: David S. Miller --- include/net/scm.h | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/include/net/scm.h b/include/net/scm.h index c5bcdf65f55c..e8c76b4be2fe 100644 --- a/include/net/scm.h +++ b/include/net/scm.h @@ -9,6 +9,7 @@ #include #include #include +#include /* Well, we should have at least one descriptor open * to accept passed FDs 8) @@ -123,14 +124,17 @@ static inline bool scm_has_secdata(struct socket *sock) static __inline__ void scm_pidfd_recv(struct msghdr *msg, struct scm_cookie *scm) { struct file *pidfd_file = NULL; - int pidfd; + int len, pidfd; - /* - * put_cmsg() doesn't return an error if CMSG is truncated, + /* put_cmsg() doesn't return an error if CMSG is truncated, * that's why we need to opencode these checks here. */ - if ((msg->msg_controllen <= sizeof(struct cmsghdr)) || - (msg->msg_controllen - sizeof(struct cmsghdr)) < sizeof(int)) { + if (msg->msg_flags & MSG_CMSG_COMPAT) + len = sizeof(struct compat_cmsghdr) + sizeof(int); + else + len = sizeof(struct cmsghdr) + sizeof(int); + + if (msg->msg_controllen < len) { msg->msg_flags |= MSG_CTRUNC; return; } From 0bc36c0650b21df36fbec8136add83936eaf0607 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Fri, 1 Sep 2023 17:27:05 -0700 Subject: [PATCH 53/95] af_unix: Fix data-races around user->unix_inflight. user->unix_inflight is changed under spin_lock(unix_gc_lock), but too_many_unix_fds() reads it locklessly. Let's annotate the write/read accesses to user->unix_inflight. BUG: KCSAN: data-race in unix_attach_fds / unix_inflight write to 0xffffffff8546f2d0 of 8 bytes by task 44798 on cpu 1: unix_inflight+0x157/0x180 net/unix/scm.c:66 unix_attach_fds+0x147/0x1e0 net/unix/scm.c:123 unix_scm_to_skb net/unix/af_unix.c:1827 [inline] unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950 unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline] unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg+0x148/0x160 net/socket.c:748 ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494 ___sys_sendmsg+0xc6/0x140 net/socket.c:2548 __sys_sendmsg+0x94/0x140 net/socket.c:2577 __do_sys_sendmsg net/socket.c:2586 [inline] __se_sys_sendmsg net/socket.c:2584 [inline] __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 read to 0xffffffff8546f2d0 of 8 bytes by task 44814 on cpu 0: too_many_unix_fds net/unix/scm.c:101 [inline] unix_attach_fds+0x54/0x1e0 net/unix/scm.c:110 unix_scm_to_skb net/unix/af_unix.c:1827 [inline] unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950 unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline] unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg+0x148/0x160 net/socket.c:748 ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494 ___sys_sendmsg+0xc6/0x140 net/socket.c:2548 __sys_sendmsg+0x94/0x140 net/socket.c:2577 __do_sys_sendmsg net/socket.c:2586 [inline] __se_sys_sendmsg net/socket.c:2584 [inline] __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 value changed: 0x000000000000000c -> 0x000000000000000d Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 44814 Comm: systemd-coredum Not tainted 6.4.0-11989-g6843306689af #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 712f4aad406b ("unix: properly account for FDs passed over unix sockets") Reported-by: syzkaller Signed-off-by: Kuniyuki Iwashima Acked-by: Willy Tarreau Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/unix/scm.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/unix/scm.c b/net/unix/scm.c index e9dde7176c8a..6ff628f2349f 100644 --- a/net/unix/scm.c +++ b/net/unix/scm.c @@ -64,7 +64,7 @@ void unix_inflight(struct user_struct *user, struct file *fp) /* Paired with READ_ONCE() in wait_for_unix_gc() */ WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + 1); } - user->unix_inflight++; + WRITE_ONCE(user->unix_inflight, user->unix_inflight + 1); spin_unlock(&unix_gc_lock); } @@ -85,7 +85,7 @@ void unix_notinflight(struct user_struct *user, struct file *fp) /* Paired with READ_ONCE() in wait_for_unix_gc() */ WRITE_ONCE(unix_tot_inflight, unix_tot_inflight - 1); } - user->unix_inflight--; + WRITE_ONCE(user->unix_inflight, user->unix_inflight - 1); spin_unlock(&unix_gc_lock); } @@ -99,7 +99,7 @@ static inline bool too_many_unix_fds(struct task_struct *p) { struct user_struct *user = current_user(); - if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE))) + if (unlikely(READ_ONCE(user->unix_inflight) > task_rlimit(p, RLIMIT_NOFILE))) return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN); return false; } From ade32bd8a738d7497ffe9743c46728db26740f78 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Fri, 1 Sep 2023 17:27:06 -0700 Subject: [PATCH 54/95] af_unix: Fix data-race around unix_tot_inflight. unix_tot_inflight is changed under spin_lock(unix_gc_lock), but unix_release_sock() reads it locklessly. Let's use READ_ONCE() for unix_tot_inflight. Note that the writer side was marked by commit 9d6d7f1cb67c ("af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress") BUG: KCSAN: data-race in unix_inflight / unix_release_sock write (marked) to 0xffffffff871852b8 of 4 bytes by task 123 on cpu 1: unix_inflight+0x130/0x180 net/unix/scm.c:64 unix_attach_fds+0x137/0x1b0 net/unix/scm.c:123 unix_scm_to_skb net/unix/af_unix.c:1832 [inline] unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1955 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg+0x148/0x160 net/socket.c:747 ____sys_sendmsg+0x4e4/0x610 net/socket.c:2493 ___sys_sendmsg+0xc6/0x140 net/socket.c:2547 __sys_sendmsg+0x94/0x140 net/socket.c:2576 __do_sys_sendmsg net/socket.c:2585 [inline] __se_sys_sendmsg net/socket.c:2583 [inline] __x64_sys_sendmsg+0x45/0x50 net/socket.c:2583 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc read to 0xffffffff871852b8 of 4 bytes by task 4891 on cpu 0: unix_release_sock+0x608/0x910 net/unix/af_unix.c:671 unix_release+0x59/0x80 net/unix/af_unix.c:1058 __sock_release+0x7d/0x170 net/socket.c:653 sock_close+0x19/0x30 net/socket.c:1385 __fput+0x179/0x5e0 fs/file_table.c:321 ____fput+0x15/0x20 fs/file_table.c:349 task_work_run+0x116/0x1a0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc value changed: 0x00000000 -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 4891 Comm: systemd-coredum Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 9305cfa4443d ("[AF_UNIX]: Make unix_tot_inflight counter non-atomic") Reported-by: syzkaller Signed-off-by: Kuniyuki Iwashima Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/unix/af_unix.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 86930a8ed012..3e8a04a13668 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -680,7 +680,7 @@ static void unix_release_sock(struct sock *sk, int embrion) * What the above comment does talk about? --ANK(980817) */ - if (unix_tot_inflight) + if (READ_ONCE(unix_tot_inflight)) unix_gc(); /* Garbage collect fds */ } From afe8764f76346ba838d4f162883e23d2fcfaa90e Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Fri, 1 Sep 2023 17:27:07 -0700 Subject: [PATCH 55/95] af_unix: Fix data-races around sk->sk_shutdown. sk->sk_shutdown is changed under unix_state_lock(sk), but unix_dgram_sendmsg() calls two functions to read sk_shutdown locklessly. sock_alloc_send_pskb `- sock_wait_for_wmem Let's use READ_ONCE() there. Note that the writer side was marked by commit e1d09c2c2f57 ("af_unix: Fix data races around sk->sk_shutdown."). BUG: KCSAN: data-race in sock_alloc_send_pskb / unix_release_sock write (marked) to 0xffff8880069af12c of 1 bytes by task 1 on cpu 1: unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631 unix_release+0x59/0x80 net/unix/af_unix.c:1053 __sock_release+0x7d/0x170 net/socket.c:654 sock_close+0x19/0x30 net/socket.c:1386 __fput+0x2a3/0x680 fs/file_table.c:384 ____fput+0x15/0x20 fs/file_table.c:412 task_work_run+0x116/0x1a0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 read to 0xffff8880069af12c of 1 bytes by task 28650 on cpu 0: sock_alloc_send_pskb+0xd2/0x620 net/core/sock.c:2767 unix_dgram_sendmsg+0x2f8/0x14f0 net/unix/af_unix.c:1944 unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline] unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg+0x148/0x160 net/socket.c:748 ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494 ___sys_sendmsg+0xc6/0x140 net/socket.c:2548 __sys_sendmsg+0x94/0x140 net/socket.c:2577 __do_sys_sendmsg net/socket.c:2586 [inline] __se_sys_sendmsg net/socket.c:2584 [inline] __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 value changed: 0x00 -> 0x03 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 28650 Comm: systemd-coredum Not tainted 6.4.0-11989-g6843306689af #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzkaller Signed-off-by: Kuniyuki Iwashima Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/sock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index d3c7b53368d2..e3da7eae9338 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2747,7 +2747,7 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo) prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); if (refcount_read(&sk->sk_wmem_alloc) < READ_ONCE(sk->sk_sndbuf)) break; - if (sk->sk_shutdown & SEND_SHUTDOWN) + if (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN) break; if (sk->sk_err) break; @@ -2777,7 +2777,7 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len, goto failure; err = -EPIPE; - if (sk->sk_shutdown & SEND_SHUTDOWN) + if (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN) goto failure; if (sk_wmem_alloc_get(sk) < READ_ONCE(sk->sk_sndbuf)) From b192812905e4b134f7b7994b079eb647e9d2d37e Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Fri, 1 Sep 2023 17:27:08 -0700 Subject: [PATCH 56/95] af_unix: Fix data race around sk->sk_err. As with sk->sk_shutdown shown in the previous patch, sk->sk_err can be read locklessly by unix_dgram_sendmsg(). Let's use READ_ONCE() for sk_err as well. Note that the writer side is marked by commit cc04410af7de ("af_unix: annotate lockless accesses to sk->sk_err"). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/sock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index e3da7eae9338..16584e2dd648 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2749,7 +2749,7 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo) break; if (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN) break; - if (sk->sk_err) + if (READ_ONCE(sk->sk_err)) break; timeo = schedule_timeout(timeo); } From 8fc134fee27f2263988ae38920bc03da416b03d8 Mon Sep 17 00:00:00 2001 From: valis Date: Fri, 1 Sep 2023 12:22:37 -0400 Subject: [PATCH 57/95] net: sched: sch_qfq: Fix UAF in qfq_dequeue() When the plug qdisc is used as a class of the qfq qdisc it could trigger a UAF. This issue can be reproduced with following commands: tc qdisc add dev lo root handle 1: qfq tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512 tc qdisc add dev lo parent 1:1 handle 2: plug tc filter add dev lo parent 1: basic classid 1:1 ping -c1 127.0.0.1 and boom: [ 285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0 [ 285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144 [ 285.355903] [ 285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4 [ 285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 285.358376] Call Trace: [ 285.358773] [ 285.359109] dump_stack_lvl+0x44/0x60 [ 285.359708] print_address_description.constprop.0+0x2c/0x3c0 [ 285.360611] kasan_report+0x10c/0x120 [ 285.361195] ? qfq_dequeue+0xa7/0x7f0 [ 285.361780] qfq_dequeue+0xa7/0x7f0 [ 285.362342] __qdisc_run+0xf1/0x970 [ 285.362903] net_tx_action+0x28e/0x460 [ 285.363502] __do_softirq+0x11b/0x3de [ 285.364097] do_softirq.part.0+0x72/0x90 [ 285.364721] [ 285.365072] [ 285.365422] __local_bh_enable_ip+0x77/0x90 [ 285.366079] __dev_queue_xmit+0x95f/0x1550 [ 285.366732] ? __pfx_csum_and_copy_from_iter+0x10/0x10 [ 285.367526] ? __pfx___dev_queue_xmit+0x10/0x10 [ 285.368259] ? __build_skb_around+0x129/0x190 [ 285.368960] ? ip_generic_getfrag+0x12c/0x170 [ 285.369653] ? __pfx_ip_generic_getfrag+0x10/0x10 [ 285.370390] ? csum_partial+0x8/0x20 [ 285.370961] ? raw_getfrag+0xe5/0x140 [ 285.371559] ip_finish_output2+0x539/0xa40 [ 285.372222] ? __pfx_ip_finish_output2+0x10/0x10 [ 285.372954] ip_output+0x113/0x1e0 [ 285.373512] ? __pfx_ip_output+0x10/0x10 [ 285.374130] ? icmp_out_count+0x49/0x60 [ 285.374739] ? __pfx_ip_finish_output+0x10/0x10 [ 285.375457] ip_push_pending_frames+0xf3/0x100 [ 285.376173] raw_sendmsg+0xef5/0x12d0 [ 285.376760] ? do_syscall_64+0x40/0x90 [ 285.377359] ? __static_call_text_end+0x136578/0x136578 [ 285.378173] ? do_syscall_64+0x40/0x90 [ 285.378772] ? kasan_enable_current+0x11/0x20 [ 285.379469] ? __pfx_raw_sendmsg+0x10/0x10 [ 285.380137] ? __sock_create+0x13e/0x270 [ 285.380673] ? __sys_socket+0xf3/0x180 [ 285.381174] ? __x64_sys_socket+0x3d/0x50 [ 285.381725] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 285.382425] ? __rcu_read_unlock+0x48/0x70 [ 285.382975] ? ip4_datagram_release_cb+0xd8/0x380 [ 285.383608] ? __pfx_ip4_datagram_release_cb+0x10/0x10 [ 285.384295] ? preempt_count_sub+0x14/0xc0 [ 285.384844] ? __list_del_entry_valid+0x76/0x140 [ 285.385467] ? _raw_spin_lock_bh+0x87/0xe0 [ 285.386014] ? __pfx__raw_spin_lock_bh+0x10/0x10 [ 285.386645] ? release_sock+0xa0/0xd0 [ 285.387148] ? preempt_count_sub+0x14/0xc0 [ 285.387712] ? freeze_secondary_cpus+0x348/0x3c0 [ 285.388341] ? aa_sk_perm+0x177/0x390 [ 285.388856] ? __pfx_aa_sk_perm+0x10/0x10 [ 285.389441] ? check_stack_object+0x22/0x70 [ 285.390032] ? inet_send_prepare+0x2f/0x120 [ 285.390603] ? __pfx_inet_sendmsg+0x10/0x10 [ 285.391172] sock_sendmsg+0xcc/0xe0 [ 285.391667] __sys_sendto+0x190/0x230 [ 285.392168] ? __pfx___sys_sendto+0x10/0x10 [ 285.392727] ? kvm_clock_get_cycles+0x14/0x30 [ 285.393328] ? set_normalized_timespec64+0x57/0x70 [ 285.393980] ? _raw_spin_unlock_irq+0x1b/0x40 [ 285.394578] ? __x64_sys_clock_gettime+0x11c/0x160 [ 285.395225] ? __pfx___x64_sys_clock_gettime+0x10/0x10 [ 285.395908] ? _copy_to_user+0x3e/0x60 [ 285.396432] ? exit_to_user_mode_prepare+0x1a/0x120 [ 285.397086] ? syscall_exit_to_user_mode+0x22/0x50 [ 285.397734] ? do_syscall_64+0x71/0x90 [ 285.398258] __x64_sys_sendto+0x74/0x90 [ 285.398786] do_syscall_64+0x64/0x90 [ 285.399273] ? exit_to_user_mode_prepare+0x1a/0x120 [ 285.399949] ? syscall_exit_to_user_mode+0x22/0x50 [ 285.400605] ? do_syscall_64+0x71/0x90 [ 285.401124] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 285.401807] RIP: 0033:0x495726 [ 285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09 [ 285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726 [ 285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000 [ 285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c [ 285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634 [ 285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000 [ 285.410403] [ 285.410704] [ 285.410929] Allocated by task 144: [ 285.411402] kasan_save_stack+0x1e/0x40 [ 285.411926] kasan_set_track+0x21/0x30 [ 285.412442] __kasan_slab_alloc+0x55/0x70 [ 285.412973] kmem_cache_alloc_node+0x187/0x3d0 [ 285.413567] __alloc_skb+0x1b4/0x230 [ 285.414060] __ip_append_data+0x17f7/0x1b60 [ 285.414633] ip_append_data+0x97/0xf0 [ 285.415144] raw_sendmsg+0x5a8/0x12d0 [ 285.415640] sock_sendmsg+0xcc/0xe0 [ 285.416117] __sys_sendto+0x190/0x230 [ 285.416626] __x64_sys_sendto+0x74/0x90 [ 285.417145] do_syscall_64+0x64/0x90 [ 285.417624] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 285.418306] [ 285.418531] Freed by task 144: [ 285.418960] kasan_save_stack+0x1e/0x40 [ 285.419469] kasan_set_track+0x21/0x30 [ 285.419988] kasan_save_free_info+0x27/0x40 [ 285.420556] ____kasan_slab_free+0x109/0x1a0 [ 285.421146] kmem_cache_free+0x1c2/0x450 [ 285.421680] __netif_receive_skb_core+0x2ce/0x1870 [ 285.422333] __netif_receive_skb_one_core+0x97/0x140 [ 285.423003] process_backlog+0x100/0x2f0 [ 285.423537] __napi_poll+0x5c/0x2d0 [ 285.424023] net_rx_action+0x2be/0x560 [ 285.424510] __do_softirq+0x11b/0x3de [ 285.425034] [ 285.425254] The buggy address belongs to the object at ffff8880bad31280 [ 285.425254] which belongs to the cache skbuff_head_cache of size 224 [ 285.426993] The buggy address is located 40 bytes inside of [ 285.426993] freed 224-byte region [ffff8880bad31280, ffff8880bad31360) [ 285.428572] [ 285.428798] The buggy address belongs to the physical page: [ 285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31 [ 285.430758] flags: 0x100000000000200(slab|node=0|zone=1) [ 285.431447] page_type: 0xffffffff() [ 285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000 [ 285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 [ 285.433562] page dumped because: kasan: bad access detected [ 285.434144] [ 285.434320] Memory state around the buggy address: [ 285.434828] ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 285.435580] ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 285.436777] ^ [ 285.437106] ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 285.437616] ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 285.438126] ================================================================== [ 285.438662] Disabling lock debugging due to kernel taint Fix this by: 1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a function compatible with non-work-conserving qdiscs 2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq. Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Reported-by: valis Signed-off-by: valis Signed-off-by: Jamal Hadi Salim Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.com Signed-off-by: Paolo Abeni --- net/sched/sch_plug.c | 2 +- net/sched/sch_qfq.c | 22 +++++++++++++++++----- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/net/sched/sch_plug.c b/net/sched/sch_plug.c index ea8c4a7174bb..35f49edf63db 100644 --- a/net/sched/sch_plug.c +++ b/net/sched/sch_plug.c @@ -207,7 +207,7 @@ static struct Qdisc_ops plug_qdisc_ops __read_mostly = { .priv_size = sizeof(struct plug_sched_data), .enqueue = plug_enqueue, .dequeue = plug_dequeue, - .peek = qdisc_peek_head, + .peek = qdisc_peek_dequeued, .init = plug_init, .change = plug_change, .reset = qdisc_reset_queue, diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c index 1a25752f1a9a..546c10adcacd 100644 --- a/net/sched/sch_qfq.c +++ b/net/sched/sch_qfq.c @@ -974,10 +974,13 @@ static void qfq_update_eligible(struct qfq_sched *q) } /* Dequeue head packet of the head class in the DRR queue of the aggregate. */ -static void agg_dequeue(struct qfq_aggregate *agg, - struct qfq_class *cl, unsigned int len) +static struct sk_buff *agg_dequeue(struct qfq_aggregate *agg, + struct qfq_class *cl, unsigned int len) { - qdisc_dequeue_peeked(cl->qdisc); + struct sk_buff *skb = qdisc_dequeue_peeked(cl->qdisc); + + if (!skb) + return NULL; cl->deficit -= (int) len; @@ -987,6 +990,8 @@ static void agg_dequeue(struct qfq_aggregate *agg, cl->deficit += agg->lmax; list_move_tail(&cl->alist, &agg->active); } + + return skb; } static inline struct sk_buff *qfq_peek_skb(struct qfq_aggregate *agg, @@ -1132,11 +1137,18 @@ static struct sk_buff *qfq_dequeue(struct Qdisc *sch) if (!skb) return NULL; - qdisc_qstats_backlog_dec(sch, skb); sch->q.qlen--; + + skb = agg_dequeue(in_serv_agg, cl, len); + + if (!skb) { + sch->q.qlen++; + return NULL; + } + + qdisc_qstats_backlog_dec(sch, skb); qdisc_bstats_update(sch, skb); - agg_dequeue(in_serv_agg, cl, len); /* If lmax is lowered, through qfq_change_class, for a class * owning pending packets with larger size than the new value * of lmax, then the following condition may hold. From 6ad40b36cd3b04209e2d6c89d252c873d8082a59 Mon Sep 17 00:00:00 2001 From: Shigeru Yoshida Date: Sun, 3 Sep 2023 02:07:08 +0900 Subject: [PATCH 58/95] kcm: Destroy mutex in kcm_exit_net() kcm_exit_net() should call mutex_destroy() on knet->mutex. This is especially needed if CONFIG_DEBUG_MUTEXES is enabled. Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module") Signed-off-by: Shigeru Yoshida Link: https://lore.kernel.org/r/20230902170708.1727999-1-syoshida@redhat.com Signed-off-by: Paolo Abeni --- net/kcm/kcmsock.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c index 393f01b2a7e6..4580f61426bb 100644 --- a/net/kcm/kcmsock.c +++ b/net/kcm/kcmsock.c @@ -1859,6 +1859,8 @@ static __net_exit void kcm_exit_net(struct net *net) * that all multiplexors and psocks have been destroyed. */ WARN_ON(!list_empty(&knet->mux_list)); + + mutex_destroy(&knet->mutex); } static struct pernet_operations kcm_net_ops = { From d3287e4038ca4f81e02067ab72d087af7224c68b Mon Sep 17 00:00:00 2001 From: Sabrina Dubroca Date: Mon, 4 Sep 2023 10:56:04 +0200 Subject: [PATCH 59/95] Revert "net: macsec: preserve ingress frame ordering" This reverts commit ab046a5d4be4c90a3952a0eae75617b49c0cb01b. It was trying to work around an issue at the crypto layer by excluding ASYNC implementations of gcm(aes), because a bug in the AESNI version caused reordering when some requests bypassed the cryptd queue while older requests were still pending on the queue. This was fixed by commit 38b2f68b4264 ("crypto: aesni - Fix cryptd reordering problem on gcm"), which pre-dates ab046a5d4be4. Herbert Xu confirmed that all ASYNC implementations are expected to maintain the ordering of completions wrt requests, so we can use them in MACsec. On my test machine, this restores the performance of a single netperf instance, from 1.4Gbps to 4.4Gbps. Link: https://lore.kernel.org/netdev/9328d206c5d9f9239cae27e62e74de40b258471d.1692279161.git.sd@queasysnail.net/T/ Link: https://lore.kernel.org/netdev/1b0cec71-d084-8153-2ba4-72ce71abeb65@byu.edu/ Link: https://lore.kernel.org/netdev/d335ddaa-18dc-f9f0-17ee-9783d3b2ca29@mailbox.tu-dresden.de/ Fixes: ab046a5d4be4 ("net: macsec: preserve ingress frame ordering") Signed-off-by: Sabrina Dubroca Link: https://lore.kernel.org/r/11c952469d114db6fb29242e1d9545e61f52f512.1693757159.git.sd@queasysnail.net Signed-off-by: Paolo Abeni --- drivers/net/macsec.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index c3f30663070f..b7e151439c48 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -1330,8 +1330,7 @@ static struct crypto_aead *macsec_alloc_tfm(char *key, int key_len, int icv_len) struct crypto_aead *tfm; int ret; - /* Pick a sync gcm(aes) cipher to ensure order is preserved. */ - tfm = crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC); + tfm = crypto_alloc_aead("gcm(aes)", 0, 0); if (IS_ERR(tfm)) return tfm; From c3b704d4a4a265660e665df51b129e8425216ed1 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 5 Sep 2023 04:23:38 +0000 Subject: [PATCH 60/95] igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU This is a follow up of commit 915d975b2ffa ("net: deal with integer overflows in kmalloc_reserve()") based on David Laight feedback. Back in 2010, I failed to realize malicious users could set dev->mtu to arbitrary values. This mtu has been since limited to 0x7fffffff but regardless of how big dev->mtu is, it makes no sense for igmpv3_newpack() to allocate more than IP_MAX_MTU and risk various skb fields overflows. Fixes: 57e1ab6eaddc ("igmp: refine skb allocations") Link: https://lore.kernel.org/netdev/d273628df80f45428e739274ab9ecb72@AcuMS.aculab.com/ Signed-off-by: Eric Dumazet Reported-by: David Laight Cc: Kyle Zeng Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- net/ipv4/igmp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index 0c9e768e5628..418e5fb58fd3 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -353,8 +353,9 @@ static struct sk_buff *igmpv3_newpack(struct net_device *dev, unsigned int mtu) struct flowi4 fl4; int hlen = LL_RESERVED_SPACE(dev); int tlen = dev->needed_tailroom; - unsigned int size = mtu; + unsigned int size; + size = min(mtu, IP_MAX_MTU); while (1) { skb = alloc_skb(size + hlen + tlen, GFP_ATOMIC | __GFP_NOWARN); From 29fe7a1b62717d58f033009874554d99d71f7d37 Mon Sep 17 00:00:00 2001 From: Geetha sowjanya Date: Tue, 5 Sep 2023 12:18:16 +0530 Subject: [PATCH 61/95] octeontx2-af: Fix truncation of smq in CN10K NIX AQ enqueue mbox handler The smq value used in the CN10K NIX AQ instruction enqueue mailbox handler was truncated to 9-bit value from 10-bit value because of typecasting the CN10K mbox request structure to the CN9K structure. Though this hasn't caused any problems when programming the NIX SQ context to the HW because the context structure is the same size. However, this causes a problem when accessing the structure parameters. This patch reads the right smq value for each platform. Fixes: 30077d210c83 ("octeontx2-af: cn10k: Update NIX/NPA context structure") Signed-off-by: Geetha sowjanya Signed-off-by: Sunil Kovvuri Goutham Signed-off-by: David S. Miller --- .../ethernet/marvell/octeontx2/af/rvu_nix.c | 21 +++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c index c2f68678e947..23c2f2ed2fb8 100644 --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c @@ -846,6 +846,21 @@ static int nix_aq_enqueue_wait(struct rvu *rvu, struct rvu_block *block, return 0; } +static void nix_get_aq_req_smq(struct rvu *rvu, struct nix_aq_enq_req *req, + u16 *smq, u16 *smq_mask) +{ + struct nix_cn10k_aq_enq_req *aq_req; + + if (!is_rvu_otx2(rvu)) { + aq_req = (struct nix_cn10k_aq_enq_req *)req; + *smq = aq_req->sq.smq; + *smq_mask = aq_req->sq_mask.smq; + } else { + *smq = req->sq.smq; + *smq_mask = req->sq_mask.smq; + } +} + static int rvu_nix_blk_aq_enq_inst(struct rvu *rvu, struct nix_hw *nix_hw, struct nix_aq_enq_req *req, struct nix_aq_enq_rsp *rsp) @@ -857,6 +872,7 @@ static int rvu_nix_blk_aq_enq_inst(struct rvu *rvu, struct nix_hw *nix_hw, struct rvu_block *block; struct admin_queue *aq; struct rvu_pfvf *pfvf; + u16 smq, smq_mask; void *ctx, *mask; bool ena; u64 cfg; @@ -928,13 +944,14 @@ static int rvu_nix_blk_aq_enq_inst(struct rvu *rvu, struct nix_hw *nix_hw, if (rc) return rc; + nix_get_aq_req_smq(rvu, req, &smq, &smq_mask); /* Check if SQ pointed SMQ belongs to this PF/VF or not */ if (req->ctype == NIX_AQ_CTYPE_SQ && ((req->op == NIX_AQ_INSTOP_INIT && req->sq.ena) || (req->op == NIX_AQ_INSTOP_WRITE && - req->sq_mask.ena && req->sq_mask.smq && req->sq.ena))) { + req->sq_mask.ena && req->sq.ena && smq_mask))) { if (!is_valid_txschq(rvu, blkaddr, NIX_TXSCH_LVL_SMQ, - pcifunc, req->sq.smq)) + pcifunc, smq)) return NIX_AF_ERR_AQ_ENQUEUE; } From 5aa48279712e1f134aac908acde4df798955a955 Mon Sep 17 00:00:00 2001 From: Olga Zaborska Date: Tue, 25 Jul 2023 10:10:56 +0200 Subject: [PATCH 62/95] igc: Change IGC_MIN to allow set rx/tx value between 64 and 80 Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx value between 64 and 80. All igc devices can use as low as 64 descriptors. This change will unify igc with other drivers. Based on commit 7b1be1987c1e ("e1000e: lower ring minimum size to 64") Fixes: 0507ef8a0372 ("igc: Add transmit and receive fastpath and interrupt handlers") Signed-off-by: Olga Zaborska Tested-by: Naama Meir Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/igc/igc.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/igc/igc.h b/drivers/net/ethernet/intel/igc/igc.h index 8ebe6999a528..f48f82d5e274 100644 --- a/drivers/net/ethernet/intel/igc/igc.h +++ b/drivers/net/ethernet/intel/igc/igc.h @@ -379,11 +379,11 @@ static inline u32 igc_rss_type(const union igc_adv_rx_desc *rx_desc) /* TX/RX descriptor defines */ #define IGC_DEFAULT_TXD 256 #define IGC_DEFAULT_TX_WORK 128 -#define IGC_MIN_TXD 80 +#define IGC_MIN_TXD 64 #define IGC_MAX_TXD 4096 #define IGC_DEFAULT_RXD 256 -#define IGC_MIN_RXD 80 +#define IGC_MIN_RXD 64 #define IGC_MAX_RXD 4096 /* Supported Rx Buffer Sizes */ From 8360717524a24a421c36ef8eb512406dbd42160a Mon Sep 17 00:00:00 2001 From: Olga Zaborska Date: Tue, 25 Jul 2023 10:10:57 +0200 Subject: [PATCH 63/95] igbvf: Change IGBVF_MIN to allow set rx/tx value between 64 and 80 Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx value between 64 and 80. All igbvf devices can use as low as 64 descriptors. This change will unify igbvf with other drivers. Based on commit 7b1be1987c1e ("e1000e: lower ring minimum size to 64") Fixes: d4e0fe01a38a ("igbvf: add new driver to support 82576 virtual functions") Signed-off-by: Olga Zaborska Tested-by: Rafal Romanowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/igbvf/igbvf.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/igbvf/igbvf.h b/drivers/net/ethernet/intel/igbvf/igbvf.h index 57d39ee00b58..7b83678ba83a 100644 --- a/drivers/net/ethernet/intel/igbvf/igbvf.h +++ b/drivers/net/ethernet/intel/igbvf/igbvf.h @@ -39,11 +39,11 @@ enum latency_range { /* Tx/Rx descriptor defines */ #define IGBVF_DEFAULT_TXD 256 #define IGBVF_MAX_TXD 4096 -#define IGBVF_MIN_TXD 80 +#define IGBVF_MIN_TXD 64 #define IGBVF_DEFAULT_RXD 256 #define IGBVF_MAX_RXD 4096 -#define IGBVF_MIN_RXD 80 +#define IGBVF_MIN_RXD 64 #define IGBVF_MIN_ITR_USECS 10 /* 100000 irq/sec */ #define IGBVF_MAX_ITR_USECS 10000 /* 100 irq/sec */ From 6319685bdc8ad5310890add907b7c42f89302886 Mon Sep 17 00:00:00 2001 From: Olga Zaborska Date: Tue, 25 Jul 2023 10:10:58 +0200 Subject: [PATCH 64/95] igb: Change IGB_MIN to allow set rx/tx value between 64 and 80 Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx value between 64 and 80. All igb devices can use as low as 64 descriptors. This change will unify igb with other drivers. Based on commit 7b1be1987c1e ("e1000e: lower ring minimum size to 64") Fixes: 9d5c824399de ("igb: PCI-Express 82575 Gigabit Ethernet driver") Signed-off-by: Olga Zaborska Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/igb/igb.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index 015b78144114..a2b759531cb7 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -34,11 +34,11 @@ struct igb_adapter; /* TX/RX descriptor defines */ #define IGB_DEFAULT_TXD 256 #define IGB_DEFAULT_TX_WORK 128 -#define IGB_MIN_TXD 80 +#define IGB_MIN_TXD 64 #define IGB_MAX_TXD 4096 #define IGB_DEFAULT_RXD 256 -#define IGB_MIN_RXD 80 +#define IGB_MIN_RXD 64 #define IGB_MAX_RXD 4096 #define IGB_DEFAULT_ITR 3 /* dynamic */ From a5e2151ff9d5852d0ababbbcaeebd9646af9c8d9 Mon Sep 17 00:00:00 2001 From: Quan Tian Date: Tue, 5 Sep 2023 10:36:10 +0000 Subject: [PATCH 65/95] net/ipv6: SKB symmetric hash should incorporate transport ports __skb_get_hash_symmetric() was added to compute a symmetric hash over the protocol, addresses and transport ports, by commit eb70db875671 ("packet: Use symmetric hash for PACKET_FANOUT_HASH."). It uses flow_keys_dissector_symmetric_keys as the flow_dissector to incorporate IPv4 addresses, IPv6 addresses and ports. However, it should not specify the flag as FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL, which stops further dissection when an IPv6 flow label is encountered, making transport ports not being incorporated in such case. As a consequence, the symmetric hash is based on 5-tuple for IPv4 but 3-tuple for IPv6 when flow label is present. It caused a few problems, e.g. when nft symhash and openvswitch l4_sym rely on the symmetric hash to perform load balancing as different L4 flows between two given IPv6 addresses would always get the same symmetric hash, leading to uneven traffic distribution. Removing the use of FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL makes sure the symmetric hash is based on 5-tuple for both IPv4 and IPv6 consistently. Fixes: eb70db875671 ("packet: Use symmetric hash for PACKET_FANOUT_HASH.") Reported-by: Lars Ekman Closes: https://github.com/antrea-io/antrea/issues/5457 Signed-off-by: Quan Tian Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/flow_dissector.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 89d15ceaf9af..b3b3af0e7844 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -1831,8 +1831,7 @@ u32 __skb_get_hash_symmetric(const struct sk_buff *skb) memset(&keys, 0, sizeof(keys)); __skb_flow_dissect(NULL, skb, &flow_keys_dissector_symmetric, - &keys, NULL, 0, 0, 0, - FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); + &keys, NULL, 0, 0, 0, 0); return __flow_hash_from_keys(&keys, &hashrnd); } From 39285e124edbc752331e98ace37cc141a6a3747a Mon Sep 17 00:00:00 2001 From: Taehee Yoo Date: Tue, 5 Sep 2023 08:46:10 +0000 Subject: [PATCH 66/95] net: team: do not use dynamic lockdep key team interface has used a dynamic lockdep key to avoid false-positive lockdep deadlock detection. Virtual interfaces such as team usually have their own lock for protecting private data. These interfaces can be nested. team0 | team1 Each interface's lock is actually different(team0->lock and team1->lock). So, mutex_lock(&team0->lock); mutex_lock(&team1->lock); mutex_unlock(&team1->lock); mutex_unlock(&team0->lock); The above case is absolutely safe. But lockdep warns about deadlock. Because the lockdep understands these two locks are same. This is a false-positive lockdep warning. So, in order to avoid this problem, the team interfaces started to use dynamic lockdep key. The false-positive problem was fixed, but it introduced a new problem. When the new team virtual interface is created, it registers a dynamic lockdep key(creates dynamic lockdep key) and uses it. But there is the limitation of the number of lockdep keys. So, If so many team interfaces are created, it consumes all lockdep keys. Then, the lockdep stops to work and warns about it. In order to fix this problem, team interfaces use the subclass instead of the dynamic key. So, when a new team interface is created, it doesn't register(create) a new lockdep, but uses existed subclass key instead. It is already used by the bonding interface for a similar case. As the bonding interface does, the subclass variable is the same as the 'dev->nested_level'. This variable indicates the depth in the stacked interface graph. The 'dev->nested_level' is protected by RTNL and RCU. So, 'mutex_lock_nested()' for 'team->lock' requires RTNL or RCU. In the current code, 'team->lock' is usually acquired under RTNL, there is no problem with using 'dev->nested_level'. The 'team_nl_team_get()' and The 'lb_stats_refresh()' functions acquire 'team->lock' without RTNL. But these don't iterate their own ports nested so they don't need nested lock. Reproducer: for i in {0..1000} do ip link add team$i type team ip link add dummy$i master team$i type dummy ip link set dummy$i up ip link set team$i up done Splat looks like: BUG: MAX_LOCKDEP_ENTRIES too low! turning off the locking correctness validator. Please attach the output of /proc/lock_stat to the bug report CPU: 0 PID: 4104 Comm: ip Not tainted 6.5.0-rc7+ #45 Call Trace: dump_stack_lvl+0x64/0xb0 add_lock_to_list+0x30d/0x5e0 check_prev_add+0x73a/0x23a0 ... sock_def_readable+0xfe/0x4f0 netlink_broadcast+0x76b/0xac0 nlmsg_notify+0x69/0x1d0 dev_open+0xed/0x130 ... Reported-by: syzbot+9bbbacfbf1e04d5221f7@syzkaller.appspotmail.com Fixes: 369f61bee0f5 ("team: fix nested locking lockdep warning") Signed-off-by: Taehee Yoo Signed-off-by: David S. Miller --- drivers/net/team/team.c | 113 +++++++++++------------ drivers/net/team/team_mode_loadbalance.c | 4 +- include/linux/if_team.h | 30 +++++- 3 files changed, 86 insertions(+), 61 deletions(-) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index e8b94580194e..ad29122a5468 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -1135,8 +1135,8 @@ static int team_port_add(struct team *team, struct net_device *port_dev, struct netlink_ext_ack *extack) { struct net_device *dev = team->dev; - struct team_port *port; char *portname = port_dev->name; + struct team_port *port; int err; if (port_dev->flags & IFF_LOOPBACK) { @@ -1203,13 +1203,6 @@ static int team_port_add(struct team *team, struct net_device *port_dev, memcpy(port->orig.dev_addr, port_dev->dev_addr, port_dev->addr_len); - err = team_port_enter(team, port); - if (err) { - netdev_err(dev, "Device %s failed to enter team mode\n", - portname); - goto err_port_enter; - } - err = dev_open(port_dev, extack); if (err) { netdev_dbg(dev, "Device %s opening failed\n", @@ -1217,6 +1210,26 @@ static int team_port_add(struct team *team, struct net_device *port_dev, goto err_dev_open; } + err = team_upper_dev_link(team, port, extack); + if (err) { + netdev_err(dev, "Device %s failed to set upper link\n", + portname); + goto err_set_upper_link; + } + + /* lockdep subclass variable(dev->nested_level) was updated by + * team_upper_dev_link(). + */ + team_unlock(team); + team_lock(team); + + err = team_port_enter(team, port); + if (err) { + netdev_err(dev, "Device %s failed to enter team mode\n", + portname); + goto err_port_enter; + } + err = vlan_vids_add_by_dev(port_dev, dev); if (err) { netdev_err(dev, "Failed to add vlan ids to device %s\n", @@ -1242,13 +1255,6 @@ static int team_port_add(struct team *team, struct net_device *port_dev, goto err_handler_register; } - err = team_upper_dev_link(team, port, extack); - if (err) { - netdev_err(dev, "Device %s failed to set upper link\n", - portname); - goto err_set_upper_link; - } - err = __team_option_inst_add_port(team, port); if (err) { netdev_err(dev, "Device %s failed to add per-port options\n", @@ -1295,9 +1301,6 @@ err_set_slave_promisc: __team_option_inst_del_port(team, port); err_option_port_add: - team_upper_dev_unlink(team, port); - -err_set_upper_link: netdev_rx_handler_unregister(port_dev); err_handler_register: @@ -1307,13 +1310,16 @@ err_enable_netpoll: vlan_vids_del_by_dev(port_dev, dev); err_vids_add: + team_port_leave(team, port); + +err_port_enter: + team_upper_dev_unlink(team, port); + +err_set_upper_link: dev_close(port_dev); err_dev_open: - team_port_leave(team, port); team_port_set_orig_dev_addr(port); - -err_port_enter: dev_set_mtu(port_dev, port->orig.mtu); err_set_mtu: @@ -1616,6 +1622,7 @@ static int team_init(struct net_device *dev) int err; team->dev = dev; + mutex_init(&team->lock); team_set_no_mode(team); team->notifier_ctx = false; @@ -1643,8 +1650,6 @@ static int team_init(struct net_device *dev) goto err_options_register; netif_carrier_off(dev); - lockdep_register_key(&team->team_lock_key); - __mutex_init(&team->lock, "team->team_lock_key", &team->team_lock_key); netdev_lockdep_set_classes(dev); return 0; @@ -1665,7 +1670,7 @@ static void team_uninit(struct net_device *dev) struct team_port *port; struct team_port *tmp; - mutex_lock(&team->lock); + team_lock(team); list_for_each_entry_safe(port, tmp, &team->port_list, list) team_port_del(team, port->dev); @@ -1674,9 +1679,8 @@ static void team_uninit(struct net_device *dev) team_mcast_rejoin_fini(team); team_notify_peers_fini(team); team_queue_override_fini(team); - mutex_unlock(&team->lock); + team_unlock(team); netdev_change_features(dev); - lockdep_unregister_key(&team->team_lock_key); } static void team_destructor(struct net_device *dev) @@ -1790,18 +1794,18 @@ static void team_set_rx_mode(struct net_device *dev) static int team_set_mac_address(struct net_device *dev, void *p) { - struct sockaddr *addr = p; struct team *team = netdev_priv(dev); + struct sockaddr *addr = p; struct team_port *port; if (dev->type == ARPHRD_ETHER && !is_valid_ether_addr(addr->sa_data)) return -EADDRNOTAVAIL; dev_addr_set(dev, addr->sa_data); - mutex_lock(&team->lock); + team_lock(team); list_for_each_entry(port, &team->port_list, list) if (team->ops.port_change_dev_addr) team->ops.port_change_dev_addr(team, port); - mutex_unlock(&team->lock); + team_unlock(team); return 0; } @@ -1815,7 +1819,7 @@ static int team_change_mtu(struct net_device *dev, int new_mtu) * Alhough this is reader, it's guarded by team lock. It's not possible * to traverse list in reverse under rcu_read_lock */ - mutex_lock(&team->lock); + team_lock(team); team->port_mtu_change_allowed = true; list_for_each_entry(port, &team->port_list, list) { err = dev_set_mtu(port->dev, new_mtu); @@ -1826,7 +1830,7 @@ static int team_change_mtu(struct net_device *dev, int new_mtu) } } team->port_mtu_change_allowed = false; - mutex_unlock(&team->lock); + team_unlock(team); dev->mtu = new_mtu; @@ -1836,7 +1840,7 @@ unwind: list_for_each_entry_continue_reverse(port, &team->port_list, list) dev_set_mtu(port->dev, dev->mtu); team->port_mtu_change_allowed = false; - mutex_unlock(&team->lock); + team_unlock(team); return err; } @@ -1890,20 +1894,20 @@ static int team_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid) * Alhough this is reader, it's guarded by team lock. It's not possible * to traverse list in reverse under rcu_read_lock */ - mutex_lock(&team->lock); + team_lock(team); list_for_each_entry(port, &team->port_list, list) { err = vlan_vid_add(port->dev, proto, vid); if (err) goto unwind; } - mutex_unlock(&team->lock); + team_unlock(team); return 0; unwind: list_for_each_entry_continue_reverse(port, &team->port_list, list) vlan_vid_del(port->dev, proto, vid); - mutex_unlock(&team->lock); + team_unlock(team); return err; } @@ -1913,10 +1917,10 @@ static int team_vlan_rx_kill_vid(struct net_device *dev, __be16 proto, u16 vid) struct team *team = netdev_priv(dev); struct team_port *port; - mutex_lock(&team->lock); + team_lock(team); list_for_each_entry(port, &team->port_list, list) vlan_vid_del(port->dev, proto, vid); - mutex_unlock(&team->lock); + team_unlock(team); return 0; } @@ -1938,9 +1942,9 @@ static void team_netpoll_cleanup(struct net_device *dev) { struct team *team = netdev_priv(dev); - mutex_lock(&team->lock); + team_lock(team); __team_netpoll_cleanup(team); - mutex_unlock(&team->lock); + team_unlock(team); } static int team_netpoll_setup(struct net_device *dev, @@ -1950,7 +1954,7 @@ static int team_netpoll_setup(struct net_device *dev, struct team_port *port; int err = 0; - mutex_lock(&team->lock); + team_lock(team); list_for_each_entry(port, &team->port_list, list) { err = __team_port_enable_netpoll(port); if (err) { @@ -1958,7 +1962,7 @@ static int team_netpoll_setup(struct net_device *dev, break; } } - mutex_unlock(&team->lock); + team_unlock(team); return err; } #endif @@ -1969,9 +1973,9 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev, struct team *team = netdev_priv(dev); int err; - mutex_lock(&team->lock); + team_lock(team); err = team_port_add(team, port_dev, extack); - mutex_unlock(&team->lock); + team_unlock(team); if (!err) netdev_change_features(dev); @@ -1984,19 +1988,12 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev) struct team *team = netdev_priv(dev); int err; - mutex_lock(&team->lock); + team_lock(team); err = team_port_del(team, port_dev); - mutex_unlock(&team->lock); + team_unlock(team); - if (err) - return err; - - if (netif_is_team_master(port_dev)) { - lockdep_unregister_key(&team->team_lock_key); - lockdep_register_key(&team->team_lock_key); - lockdep_set_class(&team->lock, &team->team_lock_key); - } - netdev_change_features(dev); + if (!err) + netdev_change_features(dev); return err; } @@ -2316,13 +2313,13 @@ static struct team *team_nl_team_get(struct genl_info *info) } team = netdev_priv(dev); - mutex_lock(&team->lock); + __team_lock(team); return team; } static void team_nl_team_put(struct team *team) { - mutex_unlock(&team->lock); + team_unlock(team); dev_put(team->dev); } @@ -2984,9 +2981,9 @@ static void team_port_change_check(struct team_port *port, bool linkup) { struct team *team = port->team; - mutex_lock(&team->lock); + team_lock(team); __team_port_change_check(port, linkup); - mutex_unlock(&team->lock); + team_unlock(team); } diff --git a/drivers/net/team/team_mode_loadbalance.c b/drivers/net/team/team_mode_loadbalance.c index 00f8989c29c0..7bcc9d37447a 100644 --- a/drivers/net/team/team_mode_loadbalance.c +++ b/drivers/net/team/team_mode_loadbalance.c @@ -478,7 +478,7 @@ static void lb_stats_refresh(struct work_struct *work) team = lb_priv_ex->team; lb_priv = get_lb_priv(team); - if (!mutex_trylock(&team->lock)) { + if (!team_trylock(team)) { schedule_delayed_work(&lb_priv_ex->stats.refresh_dw, 0); return; } @@ -515,7 +515,7 @@ static void lb_stats_refresh(struct work_struct *work) schedule_delayed_work(&lb_priv_ex->stats.refresh_dw, (lb_priv_ex->stats.refresh_interval * HZ) / 10); - mutex_unlock(&team->lock); + team_unlock(team); } static void lb_stats_refresh_interval_get(struct team *team, diff --git a/include/linux/if_team.h b/include/linux/if_team.h index 1b9b15a492fa..12d4447fc8ab 100644 --- a/include/linux/if_team.h +++ b/include/linux/if_team.h @@ -221,10 +221,38 @@ struct team { atomic_t count_pending; struct delayed_work dw; } mcast_rejoin; - struct lock_class_key team_lock_key; long mode_priv[TEAM_MODE_PRIV_LONGS]; }; +static inline void __team_lock(struct team *team) +{ + mutex_lock(&team->lock); +} + +static inline int team_trylock(struct team *team) +{ + return mutex_trylock(&team->lock); +} + +#ifdef CONFIG_LOCKDEP +static inline void team_lock(struct team *team) +{ + ASSERT_RTNL(); + mutex_lock_nested(&team->lock, team->dev->nested_level); +} + +#else +static inline void team_lock(struct team *team) +{ + __team_lock(team); +} +#endif + +static inline void team_unlock(struct team *team) +{ + mutex_unlock(&team->lock); +} + static inline int team_dev_queue_xmit(struct team *team, struct team_port *port, struct sk_buff *skb) { From 9b271ebaf9a2c5c566a54bc6cd915962e8241130 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 5 Sep 2023 13:40:46 +0000 Subject: [PATCH 67/95] ip_tunnels: use DEV_STATS_INC() syzbot/KCSAN reported data-races in iptunnel_xmit_stats() [1] This can run from multiple cpus without mutual exclusion. Adopt SMP safe DEV_STATS_INC() to update dev->stats fields. [1] BUG: KCSAN: data-race in iptunnel_xmit / iptunnel_xmit read-write to 0xffff8881353df170 of 8 bytes by task 30263 on cpu 1: iptunnel_xmit_stats include/net/ip_tunnels.h:493 [inline] iptunnel_xmit+0x432/0x4a0 net/ipv4/ip_tunnel_core.c:87 ip_tunnel_xmit+0x1477/0x1750 net/ipv4/ip_tunnel.c:831 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:662 __netdev_start_xmit include/linux/netdevice.h:4889 [inline] netdev_start_xmit include/linux/netdevice.h:4903 [inline] xmit_one net/core/dev.c:3544 [inline] dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560 __dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340 dev_queue_xmit include/linux/netdevice.h:3082 [inline] __bpf_tx_skb net/core/filter.c:2129 [inline] __bpf_redirect_no_mac net/core/filter.c:2159 [inline] __bpf_redirect+0x723/0x9c0 net/core/filter.c:2182 ____bpf_clone_redirect net/core/filter.c:2453 [inline] bpf_clone_redirect+0x16c/0x1d0 net/core/filter.c:2425 ___bpf_prog_run+0xd7d/0x41e0 kernel/bpf/core.c:1954 __bpf_prog_run512+0x74/0xa0 kernel/bpf/core.c:2195 bpf_dispatcher_nop_func include/linux/bpf.h:1181 [inline] __bpf_prog_run include/linux/filter.h:609 [inline] bpf_prog_run include/linux/filter.h:616 [inline] bpf_test_run+0x15d/0x3d0 net/bpf/test_run.c:423 bpf_prog_test_run_skb+0x77b/0xa00 net/bpf/test_run.c:1045 bpf_prog_test_run+0x265/0x3d0 kernel/bpf/syscall.c:3996 __sys_bpf+0x3af/0x780 kernel/bpf/syscall.c:5353 __do_sys_bpf kernel/bpf/syscall.c:5439 [inline] __se_sys_bpf kernel/bpf/syscall.c:5437 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5437 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read-write to 0xffff8881353df170 of 8 bytes by task 30249 on cpu 0: iptunnel_xmit_stats include/net/ip_tunnels.h:493 [inline] iptunnel_xmit+0x432/0x4a0 net/ipv4/ip_tunnel_core.c:87 ip_tunnel_xmit+0x1477/0x1750 net/ipv4/ip_tunnel.c:831 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:662 __netdev_start_xmit include/linux/netdevice.h:4889 [inline] netdev_start_xmit include/linux/netdevice.h:4903 [inline] xmit_one net/core/dev.c:3544 [inline] dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560 __dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340 dev_queue_xmit include/linux/netdevice.h:3082 [inline] __bpf_tx_skb net/core/filter.c:2129 [inline] __bpf_redirect_no_mac net/core/filter.c:2159 [inline] __bpf_redirect+0x723/0x9c0 net/core/filter.c:2182 ____bpf_clone_redirect net/core/filter.c:2453 [inline] bpf_clone_redirect+0x16c/0x1d0 net/core/filter.c:2425 ___bpf_prog_run+0xd7d/0x41e0 kernel/bpf/core.c:1954 __bpf_prog_run512+0x74/0xa0 kernel/bpf/core.c:2195 bpf_dispatcher_nop_func include/linux/bpf.h:1181 [inline] __bpf_prog_run include/linux/filter.h:609 [inline] bpf_prog_run include/linux/filter.h:616 [inline] bpf_test_run+0x15d/0x3d0 net/bpf/test_run.c:423 bpf_prog_test_run_skb+0x77b/0xa00 net/bpf/test_run.c:1045 bpf_prog_test_run+0x265/0x3d0 kernel/bpf/syscall.c:3996 __sys_bpf+0x3af/0x780 kernel/bpf/syscall.c:5353 __do_sys_bpf kernel/bpf/syscall.c:5439 [inline] __se_sys_bpf kernel/bpf/syscall.c:5437 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5437 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000018830 -> 0x0000000000018831 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 30249 Comm: syz-executor.4 Not tainted 6.5.0-syzkaller-11704-g3f86ed6ec0b3 #0 Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Reported-by: syzbot Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/net/ip_tunnels.h | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index e8750b4ef7e1..f346b4efbc30 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -483,15 +483,14 @@ static inline void iptunnel_xmit_stats(struct net_device *dev, int pkt_len) u64_stats_inc(&tstats->tx_packets); u64_stats_update_end(&tstats->syncp); put_cpu_ptr(tstats); - } else { - struct net_device_stats *err_stats = &dev->stats; + return; + } - if (pkt_len < 0) { - err_stats->tx_errors++; - err_stats->tx_aborted_errors++; - } else { - err_stats->tx_dropped++; - } + if (pkt_len < 0) { + DEV_STATS_INC(dev, tx_errors); + DEV_STATS_INC(dev, tx_aborted_errors); + } else { + DEV_STATS_INC(dev, tx_dropped); } } From b7558a77529fef60e7992f40fb5353fed8be0cf8 Mon Sep 17 00:00:00 2001 From: Jianbo Liu Date: Tue, 5 Sep 2023 10:48:45 -0700 Subject: [PATCH 68/95] net/mlx5e: Clear mirred devices array if the rule is split In the cited commit, the mirred devices are recorded and checked while parsing the actions. In order to avoid system crash, the duplicate action in a single rule is not allowed. But the rule is actually break down into several FTEs in different tables, for either mirroring, or the specified types of actions which use post action infrastructure. It will reject certain action list by mistake, for example: actions:enp8s0f0_1,set(ipv4(ttl=63)),enp8s0f0_0,enp8s0f0_1. Here the rule is split to two FTEs because of pedit action. To fix this issue, when parsing the rule actions, reset if_count to clear the mirred devices array if the rule is split to multiple FTEs, and then the duplicate checking is restarted. Fixes: 554fe75c1b3f ("net/mlx5e: Avoid duplicating rule destinations") Signed-off-by: Jianbo Liu Reviewed-by: Vlad Buslov Signed-off-by: Saeed Mahameed Signed-off-by: David S. Miller --- drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c | 4 +++- drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/mirred.c | 1 + drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/pedit.c | 4 +++- .../ethernet/mellanox/mlx5/core/en/tc/act/redirect_ingress.c | 1 + drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan.c | 1 + .../net/ethernet/mellanox/mlx5/core/en/tc/act/vlan_mangle.c | 4 +++- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 1 + 7 files changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c index 92d3952dfa8b..feeb41693c17 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c @@ -17,8 +17,10 @@ tc_act_parse_ct(struct mlx5e_tc_act_parse_state *parse_state, if (err) return err; - if (mlx5e_is_eswitch_flow(parse_state->flow)) + if (mlx5e_is_eswitch_flow(parse_state->flow)) { attr->esw_attr->split_count = attr->esw_attr->out_count; + parse_state->if_count = 0; + } attr->flags |= MLX5_ATTR_FLAG_CT; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/mirred.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/mirred.c index 291193f7120d..f63402c48028 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/mirred.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/mirred.c @@ -294,6 +294,7 @@ parse_mirred_ovs_master(struct mlx5e_tc_act_parse_state *parse_state, if (err) return err; + parse_state->if_count = 0; esw_attr->out_count++; return 0; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/pedit.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/pedit.c index 3b272bbf4c53..368a95fa77d3 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/pedit.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/pedit.c @@ -98,8 +98,10 @@ tc_act_parse_pedit(struct mlx5e_tc_act_parse_state *parse_state, attr->action |= MLX5_FLOW_CONTEXT_ACTION_MOD_HDR; - if (ns_type == MLX5_FLOW_NAMESPACE_FDB) + if (ns_type == MLX5_FLOW_NAMESPACE_FDB) { esw_attr->split_count = esw_attr->out_count; + parse_state->if_count = 0; + } return 0; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/redirect_ingress.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/redirect_ingress.c index ad09a8a5f36e..2d1d4a04501b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/redirect_ingress.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/redirect_ingress.c @@ -66,6 +66,7 @@ tc_act_parse_redirect_ingress(struct mlx5e_tc_act_parse_state *parse_state, if (err) return err; + parse_state->if_count = 0; esw_attr->out_count++; return 0; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan.c index c8a3eaf189f6..a13c5e707b83 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan.c @@ -166,6 +166,7 @@ tc_act_parse_vlan(struct mlx5e_tc_act_parse_state *parse_state, return err; esw_attr->split_count = esw_attr->out_count; + parse_state->if_count = 0; return 0; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan_mangle.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan_mangle.c index 310b99230760..f17575b09788 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan_mangle.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/vlan_mangle.c @@ -65,8 +65,10 @@ tc_act_parse_vlan_mangle(struct mlx5e_tc_act_parse_state *parse_state, if (err) return err; - if (ns_type == MLX5_FLOW_NAMESPACE_FDB) + if (ns_type == MLX5_FLOW_NAMESPACE_FDB) { attr->esw_attr->split_count = attr->esw_attr->out_count; + parse_state->if_count = 0; + } return 0; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 318083690fcd..c24828b688ac 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -3936,6 +3936,7 @@ parse_tc_actions(struct mlx5e_tc_act_parse_state *parse_state, } i_split = i + 1; + parse_state->if_count = 0; list_add(&attr->list, &flow->attrs); } From 344134609a564f28b3cc81ca6650319ccd5d8961 Mon Sep 17 00:00:00 2001 From: Bodong Wang Date: Tue, 5 Sep 2023 10:48:46 -0700 Subject: [PATCH 69/95] mlx5/core: E-Switch, Create ACL FT for eswitch manager in switchdev mode ACL flow table is required in switchdev mode when metadata is enabled, driver creates such table when loading each vport. However, not every vport is loaded in switchdev mode. Such as ECPF if it's the eswitch manager. In this case, ACL flow table is still needed. To make it modularized, create ACL flow table for eswitch manager as default and skip such operations when loading manager vport. Also, there is no need to load the eswitch manager vport in switchdev mode. This means there is no need to load it on regular connect-x HCAs where the PF is the eswitch manager. This will avoid creating duplicate ACL flow table for host PF vport. Fixes: 29bcb6e4fe70 ("net/mlx5e: E-Switch, Use metadata for vport matching in send-to-vport rules") Fixes: eb8e9fae0a22 ("mlx5/core: E-Switch, Allocate ECPF vport if it's an eswitch manager") Fixes: 5019833d661f ("net/mlx5: E-switch, Introduce helper function to enable/disable vports") Signed-off-by: Bodong Wang Reviewed-by: Mark Bloch Signed-off-by: Saeed Mahameed Signed-off-by: David S. Miller --- .../net/ethernet/mellanox/mlx5/core/eswitch.c | 21 ++++++-- .../mellanox/mlx5/core/eswitch_offloads.c | 49 +++++++++++++------ 2 files changed, 51 insertions(+), 19 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c index 6cd7d6497e10..d4cde6555063 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c @@ -1276,12 +1276,19 @@ int mlx5_eswitch_enable_pf_vf_vports(struct mlx5_eswitch *esw, enum mlx5_eswitch_vport_event enabled_events) { + bool pf_needed; int ret; + pf_needed = mlx5_core_is_ecpf_esw_manager(esw->dev) || + esw->mode == MLX5_ESWITCH_LEGACY; + /* Enable PF vport */ - ret = mlx5_eswitch_load_pf_vf_vport(esw, MLX5_VPORT_PF, enabled_events); - if (ret) - return ret; + if (pf_needed) { + ret = mlx5_eswitch_load_pf_vf_vport(esw, MLX5_VPORT_PF, + enabled_events); + if (ret) + return ret; + } /* Enable external host PF HCA */ ret = host_pf_enable_hca(esw->dev); @@ -1317,7 +1324,8 @@ ec_vf_err: ecpf_err: host_pf_disable_hca(esw->dev); pf_hca_err: - mlx5_eswitch_unload_pf_vf_vport(esw, MLX5_VPORT_PF); + if (pf_needed) + mlx5_eswitch_unload_pf_vf_vport(esw, MLX5_VPORT_PF); return ret; } @@ -1335,7 +1343,10 @@ void mlx5_eswitch_disable_pf_vf_vports(struct mlx5_eswitch *esw) } host_pf_disable_hca(esw->dev); - mlx5_eswitch_unload_pf_vf_vport(esw, MLX5_VPORT_PF); + + if (mlx5_core_is_ecpf_esw_manager(esw->dev) || + esw->mode == MLX5_ESWITCH_LEGACY) + mlx5_eswitch_unload_pf_vf_vport(esw, MLX5_VPORT_PF); } static void mlx5_eswitch_get_devlink_param(struct mlx5_eswitch *esw) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index 752fb0dfb111..b296ac52a439 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -3216,26 +3216,47 @@ esw_vport_destroy_offloads_acl_tables(struct mlx5_eswitch *esw, esw_acl_ingress_ofld_cleanup(esw, vport); } -static int esw_create_uplink_offloads_acl_tables(struct mlx5_eswitch *esw) +static int esw_create_offloads_acl_tables(struct mlx5_eswitch *esw) { - struct mlx5_vport *vport; + struct mlx5_vport *uplink, *manager; + int ret; - vport = mlx5_eswitch_get_vport(esw, MLX5_VPORT_UPLINK); - if (IS_ERR(vport)) - return PTR_ERR(vport); + uplink = mlx5_eswitch_get_vport(esw, MLX5_VPORT_UPLINK); + if (IS_ERR(uplink)) + return PTR_ERR(uplink); - return esw_vport_create_offloads_acl_tables(esw, vport); + ret = esw_vport_create_offloads_acl_tables(esw, uplink); + if (ret) + return ret; + + manager = mlx5_eswitch_get_vport(esw, esw->manager_vport); + if (IS_ERR(manager)) { + ret = PTR_ERR(manager); + goto err_manager; + } + + ret = esw_vport_create_offloads_acl_tables(esw, manager); + if (ret) + goto err_manager; + + return 0; + +err_manager: + esw_vport_destroy_offloads_acl_tables(esw, uplink); + return ret; } -static void esw_destroy_uplink_offloads_acl_tables(struct mlx5_eswitch *esw) +static void esw_destroy_offloads_acl_tables(struct mlx5_eswitch *esw) { struct mlx5_vport *vport; - vport = mlx5_eswitch_get_vport(esw, MLX5_VPORT_UPLINK); - if (IS_ERR(vport)) - return; + vport = mlx5_eswitch_get_vport(esw, esw->manager_vport); + if (!IS_ERR(vport)) + esw_vport_destroy_offloads_acl_tables(esw, vport); - esw_vport_destroy_offloads_acl_tables(esw, vport); + vport = mlx5_eswitch_get_vport(esw, MLX5_VPORT_UPLINK); + if (!IS_ERR(vport)) + esw_vport_destroy_offloads_acl_tables(esw, vport); } int mlx5_eswitch_reload_reps(struct mlx5_eswitch *esw) @@ -3280,7 +3301,7 @@ static int esw_offloads_steering_init(struct mlx5_eswitch *esw) } esw->fdb_table.offloads.indir = indir; - err = esw_create_uplink_offloads_acl_tables(esw); + err = esw_create_offloads_acl_tables(esw); if (err) goto create_acl_err; @@ -3321,7 +3342,7 @@ create_fdb_err: create_restore_err: esw_destroy_offloads_table(esw); create_offloads_err: - esw_destroy_uplink_offloads_acl_tables(esw); + esw_destroy_offloads_acl_tables(esw); create_acl_err: mlx5_esw_indir_table_destroy(esw->fdb_table.offloads.indir); create_indir_err: @@ -3337,7 +3358,7 @@ static void esw_offloads_steering_cleanup(struct mlx5_eswitch *esw) esw_destroy_offloads_fdb_tables(esw); esw_destroy_restore_table(esw); esw_destroy_offloads_table(esw); - esw_destroy_uplink_offloads_acl_tables(esw); + esw_destroy_offloads_acl_tables(esw); mlx5_esw_indir_table_destroy(esw->fdb_table.offloads.indir); mutex_destroy(&esw->fdb_table.offloads.vports.lock); } From 954ad9bf13c4f95a4958b5f8433301f2ab99e1f5 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Wed, 6 Sep 2023 00:53:36 +0300 Subject: [PATCH 70/95] net: dsa: sja1105: fix bandwidth discrepancy between tc-cbs software and offload More careful measurement of the tc-cbs bandwidth shows that the stream bandwidth (effectively idleslope) increases, there is a larger and larger discrepancy between the rate limit obtained by the software Qdisc, and the rate limit obtained by its offloaded counterpart. The discrepancy becomes so large, that e.g. at an idleslope of 40000 (40Mbps), the offloaded cbs does not actually rate limit anything, and traffic will pass at line rate through a 100 Mbps port. The reason for the discrepancy is that the hardware documentation I've been following is incorrect. UM11040.pdf (for SJA1105P/Q/R/S) states about IDLE_SLOPE that it is "the rate (in unit of bytes/sec) at which the credit counter is increased". Cross-checking with UM10944.pdf (for SJA1105E/T) and UM11107.pdf (for SJA1110), the wording is different: "This field specifies the value, in bytes per second times link speed, by which the credit counter is increased". So there's an extra scaling for link speed that the driver is currently not accounting for, and apparently (empirically), that link speed is expressed in Kbps. I've pondered whether to pollute the sja1105_mac_link_up() implementation with CBS shaper reprogramming, but I don't think it is worth it. IMO, the UAPI exposed by tc-cbs requires user space to recalculate the sendslope anyway, since the formula for that depends on port_transmit_rate (see man tc-cbs), which is not an invariant from tc's perspective. So we use the offload->sendslope and offload->idleslope to deduce the original port_transmit_rate from the CBS formula, and use that value to scale the offload->sendslope and offload->idleslope to values that the hardware understands. Some numerical data points: 40Mbps stream, max interfering frame size 1500, port speed 100M --------------------------------------------------------------- tc-cbs parameters: idleslope 40000 sendslope -60000 locredit -900 hicredit 600 which result in hardware values: Before (doesn't work) After (works) credit_hi 600 600 credit_lo 900 900 send_slope 7500000 75 idle_slope 5000000 50 40Mbps stream, max interfering frame size 1500, port speed 1G ------------------------------------------------------------- tc-cbs parameters: idleslope 40000 sendslope -960000 locredit -1440 hicredit 60 which result in hardware values: Before (doesn't work) After (works) credit_hi 60 60 credit_lo 1440 1440 send_slope 120000000 120 idle_slope 5000000 5 5.12Mbps stream, max interfering frame size 1522, port speed 100M ----------------------------------------------------------------- tc-cbs parameters: idleslope 5120 sendslope -94880 locredit -1444 hicredit 77 which result in hardware values: Before (doesn't work) After (works) credit_hi 77 77 credit_lo 1444 1444 send_slope 11860000 118 idle_slope 640000 6 Tested on SJA1105T, SJA1105S and SJA1110A, at 1Gbps and 100Mbps. Fixes: 4d7525085a9b ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc") Reported-by: Yanan Yang Signed-off-by: Vladimir Oltean Signed-off-by: David S. Miller --- drivers/net/dsa/sja1105/sja1105_main.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c index 331bb1c6676a..3f17c17ff636 100644 --- a/drivers/net/dsa/sja1105/sja1105_main.c +++ b/drivers/net/dsa/sja1105/sja1105_main.c @@ -2150,6 +2150,7 @@ static int sja1105_setup_tc_cbs(struct dsa_switch *ds, int port, { struct sja1105_private *priv = ds->priv; struct sja1105_cbs_entry *cbs; + s64 port_transmit_rate_kbps; int index; if (!offload->enable) @@ -2167,9 +2168,17 @@ static int sja1105_setup_tc_cbs(struct dsa_switch *ds, int port, */ cbs->credit_hi = offload->hicredit; cbs->credit_lo = abs(offload->locredit); - /* User space is in kbits/sec, hardware in bytes/sec */ - cbs->idle_slope = offload->idleslope * BYTES_PER_KBIT; - cbs->send_slope = abs(offload->sendslope * BYTES_PER_KBIT); + /* User space is in kbits/sec, while the hardware in bytes/sec times + * link speed. Since the given offload->sendslope is good only for the + * current link speed anyway, and user space is likely to reprogram it + * when that changes, don't even bother to track the port's link speed, + * but deduce the port transmit rate from idleslope - sendslope. + */ + port_transmit_rate_kbps = offload->idleslope - offload->sendslope; + cbs->idle_slope = div_s64(offload->idleslope * BYTES_PER_KBIT, + port_transmit_rate_kbps); + cbs->send_slope = div_s64(abs(offload->sendslope * BYTES_PER_KBIT), + port_transmit_rate_kbps); /* Convert the negative values from 64-bit 2's complement * to 32-bit 2's complement (for the case of 0x80000000 whose * negative is still negative). From 894cafc5c62ccced758077bd4e970dc714c42637 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Wed, 6 Sep 2023 00:53:37 +0300 Subject: [PATCH 71/95] net: dsa: sja1105: fix -ENOSPC when replacing the same tc-cbs too many times After running command [2] too many times in a row: [1] $ tc qdisc add dev sw2p0 root handle 1: mqprio num_tc 8 \ map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0 [2] $ tc qdisc replace dev sw2p0 parent 1:1 cbs offload 1 \ idleslope 120000 sendslope -880000 locredit -1320 hicredit 180 (aka more than priv->info->num_cbs_shapers times) we start seeing the following error message: Error: Specified device failed to setup cbs hardware offload. This comes from the fact that ndo_setup_tc(TC_SETUP_QDISC_CBS) presents the same API for the qdisc create and replace cases, and the sja1105 driver fails to distinguish between the 2. Thus, it always thinks that it must allocate the same shaper for a {port, queue} pair, when it may instead have to replace an existing one. Fixes: 4d7525085a9b ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc") Signed-off-by: Vladimir Oltean Signed-off-by: David S. Miller --- drivers/net/dsa/sja1105/sja1105_main.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c index 3f17c17ff636..d7f57f223031 100644 --- a/drivers/net/dsa/sja1105/sja1105_main.c +++ b/drivers/net/dsa/sja1105/sja1105_main.c @@ -2116,6 +2116,18 @@ static void sja1105_bridge_leave(struct dsa_switch *ds, int port, #define BYTES_PER_KBIT (1000LL / 8) +static int sja1105_find_cbs_shaper(struct sja1105_private *priv, + int port, int prio) +{ + int i; + + for (i = 0; i < priv->info->num_cbs_shapers; i++) + if (priv->cbs[i].port == port && priv->cbs[i].prio == prio) + return i; + + return -1; +} + static int sja1105_find_unused_cbs_shaper(struct sja1105_private *priv) { int i; @@ -2156,9 +2168,14 @@ static int sja1105_setup_tc_cbs(struct dsa_switch *ds, int port, if (!offload->enable) return sja1105_delete_cbs_shaper(priv, port, offload->queue); - index = sja1105_find_unused_cbs_shaper(priv); - if (index < 0) - return -ENOSPC; + /* The user may be replacing an existing shaper */ + index = sja1105_find_cbs_shaper(priv, port, offload->queue); + if (index < 0) { + /* That isn't the case - see if we can allocate a new one */ + index = sja1105_find_unused_cbs_shaper(priv); + if (index < 0) + return -ENOSPC; + } cbs = &priv->cbs[index]; cbs->port = port; From 180a7419fe4adc8d9c8e0ef0fd17bcdd0cf78acd Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Wed, 6 Sep 2023 00:53:38 +0300 Subject: [PATCH 72/95] net: dsa: sja1105: complete tc-cbs offload support on SJA1110 The blamed commit left this delta behind: struct sja1105_cbs_entry { - u64 port; - u64 prio; + u64 port; /* Not used for SJA1110 */ + u64 prio; /* Not used for SJA1110 */ u64 credit_hi; u64 credit_lo; u64 send_slope; u64 idle_slope; }; but did not actually implement tc-cbs offload fully for the new switch. The offload is accepted, but it doesn't work. The difference compared to earlier switch generations is that now, the table of CBS shapers is sparse, because there are many more shapers, so the mapping between a {port, prio} and a table index is static, rather than requiring us to store the port and prio into the sja1105_cbs_entry. So, the problem is that the code programs the CBS shaper parameters at a dynamic table index which is incorrect. All that needs to be done for SJA1110 CBS shapers to work is to bypass the logic which allocates shapers in a dense manner, as for SJA1105, and use the fixed mapping instead. Fixes: 3e77e59bf8cf ("net: dsa: sja1105: add support for the SJA1110 switch family") Signed-off-by: Vladimir Oltean Signed-off-by: David S. Miller --- drivers/net/dsa/sja1105/sja1105.h | 2 ++ drivers/net/dsa/sja1105/sja1105_main.c | 13 +++++++++++++ drivers/net/dsa/sja1105/sja1105_spi.c | 4 ++++ 3 files changed, 19 insertions(+) diff --git a/drivers/net/dsa/sja1105/sja1105.h b/drivers/net/dsa/sja1105/sja1105.h index dee35ba924ad..0617d5ccd3ff 100644 --- a/drivers/net/dsa/sja1105/sja1105.h +++ b/drivers/net/dsa/sja1105/sja1105.h @@ -132,6 +132,8 @@ struct sja1105_info { int max_frame_mem; int num_ports; bool multiple_cascade_ports; + /* Every {port, TXQ} has its own CBS shaper */ + bool fixed_cbs_mapping; enum dsa_tag_protocol tag_proto; const struct sja1105_dynamic_table_ops *dyn_ops; const struct sja1105_table_ops *static_ops; diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c index d7f57f223031..a23d980d28f5 100644 --- a/drivers/net/dsa/sja1105/sja1105_main.c +++ b/drivers/net/dsa/sja1105/sja1105_main.c @@ -2115,12 +2115,22 @@ static void sja1105_bridge_leave(struct dsa_switch *ds, int port, } #define BYTES_PER_KBIT (1000LL / 8) +/* Port 0 (the uC port) does not have CBS shapers */ +#define SJA1110_FIXED_CBS(port, prio) ((((port) - 1) * SJA1105_NUM_TC) + (prio)) static int sja1105_find_cbs_shaper(struct sja1105_private *priv, int port, int prio) { int i; + if (priv->info->fixed_cbs_mapping) { + i = SJA1110_FIXED_CBS(port, prio); + if (i >= 0 && i < priv->info->num_cbs_shapers) + return i; + + return -1; + } + for (i = 0; i < priv->info->num_cbs_shapers; i++) if (priv->cbs[i].port == port && priv->cbs[i].prio == prio) return i; @@ -2132,6 +2142,9 @@ static int sja1105_find_unused_cbs_shaper(struct sja1105_private *priv) { int i; + if (priv->info->fixed_cbs_mapping) + return -1; + for (i = 0; i < priv->info->num_cbs_shapers; i++) if (!priv->cbs[i].idle_slope && !priv->cbs[i].send_slope) return i; diff --git a/drivers/net/dsa/sja1105/sja1105_spi.c b/drivers/net/dsa/sja1105/sja1105_spi.c index 5ce29c8057a4..834b5c1b4db0 100644 --- a/drivers/net/dsa/sja1105/sja1105_spi.c +++ b/drivers/net/dsa/sja1105/sja1105_spi.c @@ -781,6 +781,7 @@ const struct sja1105_info sja1110a_info = { .tag_proto = DSA_TAG_PROTO_SJA1110, .can_limit_mcast_flood = true, .multiple_cascade_ports = true, + .fixed_cbs_mapping = true, .ptp_ts_bits = 32, .ptpegr_ts_bytes = 8, .max_frame_mem = SJA1110_MAX_FRAME_MEMORY, @@ -831,6 +832,7 @@ const struct sja1105_info sja1110b_info = { .tag_proto = DSA_TAG_PROTO_SJA1110, .can_limit_mcast_flood = true, .multiple_cascade_ports = true, + .fixed_cbs_mapping = true, .ptp_ts_bits = 32, .ptpegr_ts_bytes = 8, .max_frame_mem = SJA1110_MAX_FRAME_MEMORY, @@ -881,6 +883,7 @@ const struct sja1105_info sja1110c_info = { .tag_proto = DSA_TAG_PROTO_SJA1110, .can_limit_mcast_flood = true, .multiple_cascade_ports = true, + .fixed_cbs_mapping = true, .ptp_ts_bits = 32, .ptpegr_ts_bytes = 8, .max_frame_mem = SJA1110_MAX_FRAME_MEMORY, @@ -931,6 +934,7 @@ const struct sja1105_info sja1110d_info = { .tag_proto = DSA_TAG_PROTO_SJA1110, .can_limit_mcast_flood = true, .multiple_cascade_ports = true, + .fixed_cbs_mapping = true, .ptp_ts_bits = 32, .ptpegr_ts_bytes = 8, .max_frame_mem = SJA1110_MAX_FRAME_MEMORY, From 1a961e74d5abbea049588a3d74b759955b4ed9d5 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Tue, 5 Sep 2023 16:42:02 -0700 Subject: [PATCH 73/95] net: phylink: fix sphinx complaint about invalid literal sphinx complains about the use of "%PHYLINK_PCS_NEG_*": Documentation/networking/kapi:144: ./include/linux/phylink.h:601: WARNING: Inline literal start-string without end-string. Documentation/networking/kapi:144: ./include/linux/phylink.h:633: WARNING: Inline literal start-string without end-string. These are not valid symbols so drop the '%' prefix. Alternatively we could use %PHYLINK_PCS_NEG_\* (escape the *) or use normal literal ``PHYLINK_PCS_NEG_*`` but there is already a handful of un-adorned DEFINE_* in this file. Fixes: f99d471afa03 ("net: phylink: add PCS negotiation mode") Reported-by: Stephen Rothwell Link: https://lore.kernel.org/all/20230626162908.2f149f98@canb.auug.org.au/ Signed-off-by: Jakub Kicinski Reviewed-by: Bagas Sanjaya Signed-off-by: David S. Miller --- include/linux/phylink.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/phylink.h b/include/linux/phylink.h index 7d07f8736431..2b886ea654bb 100644 --- a/include/linux/phylink.h +++ b/include/linux/phylink.h @@ -600,7 +600,7 @@ void pcs_get_state(struct phylink_pcs *pcs, * * The %neg_mode argument should be tested via the phylink_mode_*() family of * functions, or for PCS that set pcs->neg_mode true, should be tested - * against the %PHYLINK_PCS_NEG_* definitions. + * against the PHYLINK_PCS_NEG_* definitions. */ int pcs_config(struct phylink_pcs *pcs, unsigned int neg_mode, phy_interface_t interface, const unsigned long *advertising, @@ -630,7 +630,7 @@ void pcs_an_restart(struct phylink_pcs *pcs); * * The %mode argument should be tested via the phylink_mode_*() family of * functions, or for PCS that set pcs->neg_mode true, should be tested - * against the %PHYLINK_PCS_NEG_* definitions. + * against the PHYLINK_PCS_NEG_* definitions. */ void pcs_link_up(struct phylink_pcs *pcs, unsigned int neg_mode, phy_interface_t interface, int speed, int duplex); From 7645629f7dc88cd777f98970134bf1a54c8d77e3 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 30 Aug 2023 10:04:04 +0200 Subject: [PATCH 74/95] bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf(). If __bpf_prog_enter_sleepable_recur() detects recursion then it returns 0 without undoing rcu_read_lock_trace(), migrate_disable() or decrementing the recursion counter. This is fine in the JIT case because the JIT code will jump in the 0 case to the end and invoke the matching exit trampoline (__bpf_prog_exit_sleepable_recur()). This is not the case in kern_sys_bpf() which returns directly to the caller with an error code. Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case. Fixes: b1d18a7574d0d ("bpf: Extend sys_bpf commands for bpf_syscall programs.") Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Daniel Borkmann Acked-by: Jiri Olsa Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de --- kernel/bpf/syscall.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index ebeb0695305a..53a0b62464e9 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -5505,6 +5505,7 @@ int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size) run_ctx.saved_run_ctx = NULL; if (!__bpf_prog_enter_sleepable_recur(prog, &run_ctx)) { /* recursion detected */ + __bpf_prog_exit_sleepable_recur(prog, 0, &run_ctx); bpf_prog_put(prog); return -EBUSY; } From 6764e767f4af1e35f87f3497e1182d945de37f93 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 30 Aug 2023 10:04:05 +0200 Subject: [PATCH 75/95] bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check. __bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before performing the recursion check which means in case of a recursion __bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx value. __bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx after the recursion check which means in case of a recursion __bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx isn't initialized upfront. Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made. Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens a few cycles later. Fixes: e384c7b7b46d0 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack") Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Daniel Borkmann Acked-by: Jiri Olsa Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de --- kernel/bpf/syscall.c | 1 - kernel/bpf/trampoline.c | 5 ++--- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 53a0b62464e9..eb01c31ed591 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -5502,7 +5502,6 @@ int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size) } run_ctx.bpf_cookie = 0; - run_ctx.saved_run_ctx = NULL; if (!__bpf_prog_enter_sleepable_recur(prog, &run_ctx)) { /* recursion detected */ __bpf_prog_exit_sleepable_recur(prog, 0, &run_ctx); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 78acf28d4873..53ff50cac61e 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -926,13 +926,12 @@ u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog, migrate_disable(); might_fault(); + run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx); + if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) { bpf_prog_inc_misses_counter(prog); return 0; } - - run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx); - return bpf_prog_start_time(); } From a192103a11465e9d517975c50f9944dc80e44d61 Mon Sep 17 00:00:00 2001 From: Ilya Leoshkevich Date: Wed, 6 Sep 2023 02:44:19 +0200 Subject: [PATCH 76/95] s390/bpf: Pass through tail call counter in trampolines s390x eBPF programs use the following extension to the s390x calling convention: tail call counter is passed on stack at offset STK_OFF_TCCNT, which callees otherwise use as scratch space. Currently trampoline does not respect this and clobbers tail call counter. This breaks enforcing tail call limits in eBPF programs, which have trampolines attached to them. Fix by forwarding a copy of the tail call counter to the original eBPF program in the trampoline (for fexit), and by restoring it at the end of the trampoline (for fentry). Fixes: 528eb2cb87bc ("s390/bpf: Implement arch_prepare_bpf_trampoline()") Reported-by: Leon Hwang Signed-off-by: Ilya Leoshkevich Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230906004448.111674-1-iii@linux.ibm.com --- arch/s390/net/bpf_jit_comp.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c index 5e9371fbf3d5..de2fb12120d2 100644 --- a/arch/s390/net/bpf_jit_comp.c +++ b/arch/s390/net/bpf_jit_comp.c @@ -2088,6 +2088,7 @@ struct bpf_tramp_jit { */ int r14_off; /* Offset of saved %r14 */ int run_ctx_off; /* Offset of struct bpf_tramp_run_ctx */ + int tccnt_off; /* Offset of saved tailcall counter */ int do_fexit; /* do_fexit: label */ }; @@ -2258,12 +2259,16 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, tjit->r14_off = alloc_stack(tjit, sizeof(u64)); tjit->run_ctx_off = alloc_stack(tjit, sizeof(struct bpf_tramp_run_ctx)); + tjit->tccnt_off = alloc_stack(tjit, sizeof(u64)); /* The caller has already reserved STACK_FRAME_OVERHEAD bytes. */ tjit->stack_size -= STACK_FRAME_OVERHEAD; tjit->orig_stack_args_off = tjit->stack_size + STACK_FRAME_OVERHEAD; /* aghi %r15,-stack_size */ EMIT4_IMM(0xa70b0000, REG_15, -tjit->stack_size); + /* mvc tccnt_off(4,%r15),stack_size+STK_OFF_TCCNT(%r15) */ + _EMIT6(0xd203f000 | tjit->tccnt_off, + 0xf000 | (tjit->stack_size + STK_OFF_TCCNT)); /* stmg %r2,%rN,fwd_reg_args_off(%r15) */ if (nr_reg_args) EMIT6_DISP_LH(0xeb000000, 0x0024, REG_2, @@ -2400,6 +2405,8 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, (nr_stack_args * sizeof(u64) - 1) << 16 | tjit->stack_args_off, 0xf000 | tjit->orig_stack_args_off); + /* mvc STK_OFF_TCCNT(4,%r15),tccnt_off(%r15) */ + _EMIT6(0xd203f000 | STK_OFF_TCCNT, 0xf000 | tjit->tccnt_off); /* lgr %r1,%r8 */ EMIT4(0xb9040000, REG_1, REG_8); /* %r1() */ @@ -2456,6 +2463,9 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, if (flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET)) EMIT6_DISP_LH(0xe3000000, 0x0004, REG_2, REG_0, REG_15, tjit->retval_off); + /* mvc stack_size+STK_OFF_TCCNT(4,%r15),tccnt_off(%r15) */ + _EMIT6(0xd203f000 | (tjit->stack_size + STK_OFF_TCCNT), + 0xf000 | tjit->tccnt_off); /* aghi %r15,stack_size */ EMIT4_IMM(0xa70b0000, REG_15, tjit->stack_size); /* Emit an expoline for the following indirect jump. */ From a96a44aba556c42b432929d37d60158aca21ad4c Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Fri, 1 Sep 2023 16:11:27 -0700 Subject: [PATCH 77/95] bpf: bpf_sk_storage: Fix invalid wait context lockdep report './test_progs -t test_local_storage' reported a splat: [ 27.137569] ============================= [ 27.138122] [ BUG: Invalid wait context ] [ 27.138650] 6.5.0-03980-gd11ae1b16b0a #247 Tainted: G O [ 27.139542] ----------------------------- [ 27.140106] test_progs/1729 is trying to lock: [ 27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130 [ 27.141834] other info that might help us debug this: [ 27.142437] context-{5:5} [ 27.142856] 2 locks held by test_progs/1729: [ 27.143352] #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40 [ 27.144492] #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0 [ 27.145855] stack backtrace: [ 27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G O 6.5.0-03980-gd11ae1b16b0a #247 [ 27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 27.149127] Call Trace: [ 27.149490] [ 27.149867] dump_stack_lvl+0x130/0x1d0 [ 27.152609] dump_stack+0x14/0x20 [ 27.153131] __lock_acquire+0x1657/0x2220 [ 27.153677] lock_acquire+0x1b8/0x510 [ 27.157908] local_lock_acquire+0x29/0x130 [ 27.159048] obj_cgroup_charge+0xf4/0x3c0 [ 27.160794] slab_pre_alloc_hook+0x28e/0x2b0 [ 27.161931] __kmem_cache_alloc_node+0x51/0x210 [ 27.163557] __kmalloc+0xaa/0x210 [ 27.164593] bpf_map_kzalloc+0xbc/0x170 [ 27.165147] bpf_selem_alloc+0x130/0x510 [ 27.166295] bpf_local_storage_update+0x5aa/0x8e0 [ 27.167042] bpf_fd_sk_storage_update_elem+0xdb/0x1a0 [ 27.169199] bpf_map_update_value+0x415/0x4f0 [ 27.169871] map_update_elem+0x413/0x550 [ 27.170330] __sys_bpf+0x5e9/0x640 [ 27.174065] __x64_sys_bpf+0x80/0x90 [ 27.174568] do_syscall_64+0x48/0xa0 [ 27.175201] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 27.175932] RIP: 0033:0x7effb40e41ad [ 27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8 [ 27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad [ 27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002 [ 27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788 [ 27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000 [ 27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000 [ 27.184958] It complains about acquiring a local_lock while holding a raw_spin_lock. It means it should not allocate memory while holding a raw_spin_lock since it is not safe for RT. raw_spin_lock is needed because bpf_local_storage supports tracing context. In particular for task local storage, it is easy to get a "current" task PTR_TO_BTF_ID in tracing bpf prog. However, task (and cgroup) local storage has already been moved to bpf mem allocator which can be used after raw_spin_lock. The splat is for the sk storage. For sk (and inode) storage, it has not been moved to bpf mem allocator. Using raw_spin_lock or not, kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context. However, the local storage helper requires a verifier accepted sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running a bpf prog in a kzalloc unsafe context and also able to hold a verifier accepted sk pointer) could happen. This patch avoids kzalloc after raw_spin_lock to silent the splat. There is an existing kzalloc before the raw_spin_lock. At that point, a kzalloc is very likely required because a lookup has just been done before. Thus, this patch always does the kzalloc before acquiring the raw_spin_lock and remove the later kzalloc usage after the raw_spin_lock. After this change, it will have a charge and then uncharge during the syscall bpf_map_update_elem() code path. This patch opts for simplicity and not continue the old optimization to save one charge and uncharge. This issue is dated back to the very first commit of bpf_sk_storage which had been refactored multiple times to create task, inode, and cgroup storage. This patch uses a Fixes tag with a more recent commit that should be easier to do backport. Fixes: b00fa38a9c1c ("bpf: Enable non-atomic allocations in local storage") Signed-off-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev --- kernel/bpf/bpf_local_storage.c | 47 ++++++++++------------------------ 1 file changed, 14 insertions(+), 33 deletions(-) diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index b5149cfce7d4..37ad47d52dc5 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -553,7 +553,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap, void *value, u64 map_flags, gfp_t gfp_flags) { struct bpf_local_storage_data *old_sdata = NULL; - struct bpf_local_storage_elem *selem = NULL; + struct bpf_local_storage_elem *alloc_selem, *selem = NULL; struct bpf_local_storage *local_storage; unsigned long flags; int err; @@ -607,11 +607,12 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap, } } - if (gfp_flags == GFP_KERNEL) { - selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags); - if (!selem) - return ERR_PTR(-ENOMEM); - } + /* A lookup has just been done before and concluded a new selem is + * needed. The chance of an unnecessary alloc is unlikely. + */ + alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags); + if (!alloc_selem) + return ERR_PTR(-ENOMEM); raw_spin_lock_irqsave(&local_storage->lock, flags); @@ -623,13 +624,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap, * simple. */ err = -EAGAIN; - goto unlock_err; + goto unlock; } old_sdata = bpf_local_storage_lookup(local_storage, smap, false); err = check_flags(old_sdata, map_flags); if (err) - goto unlock_err; + goto unlock; if (old_sdata && (map_flags & BPF_F_LOCK)) { copy_map_value_locked(&smap->map, old_sdata->data, value, @@ -638,23 +639,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap, goto unlock; } - if (gfp_flags != GFP_KERNEL) { - /* local_storage->lock is held. Hence, we are sure - * we can unlink and uncharge the old_sdata successfully - * later. Hence, instead of charging the new selem now - * and then uncharge the old selem later (which may cause - * a potential but unnecessary charge failure), avoid taking - * a charge at all here (the "!old_sdata" check) and the - * old_sdata will not be uncharged later during - * bpf_selem_unlink_storage_nolock(). - */ - selem = bpf_selem_alloc(smap, owner, value, !old_sdata, gfp_flags); - if (!selem) { - err = -ENOMEM; - goto unlock_err; - } - } - + alloc_selem = NULL; /* First, link the new selem to the map */ bpf_selem_link_map(smap, selem); @@ -665,20 +650,16 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap, if (old_sdata) { bpf_selem_unlink_map(SELEM(old_sdata)); bpf_selem_unlink_storage_nolock(local_storage, SELEM(old_sdata), - false, false); + true, false); } unlock: raw_spin_unlock_irqrestore(&local_storage->lock, flags); - return SDATA(selem); - -unlock_err: - raw_spin_unlock_irqrestore(&local_storage->lock, flags); - if (selem) { + if (alloc_selem) { mem_uncharge(smap, owner, smap->elem_size); - bpf_selem_free(selem, smap, true); + bpf_selem_free(alloc_selem, smap, true); } - return ERR_PTR(err); + return err ? ERR_PTR(err) : SDATA(selem); } static u16 bpf_local_storage_cache_idx_get(struct bpf_local_storage_cache *cache) From 55d49f750b1cb1f177fb1b00ae02cba4613bcfb7 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Fri, 1 Sep 2023 16:11:28 -0700 Subject: [PATCH 78/95] bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc The commit c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into bpf_local_storage_unlink_nolock() which then later renamed to bpf_local_storage_destroy(). The commit accidentally passed the "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock() which then stopped the uncharge from happening to the sk->sk_omem_alloc. This missing uncharge only happens when the sk is going away (during __sk_destruct). This patch fixes it by always passing "uncharge_mem = true". It is a noop to the task/inode/cgroup storage because they do not have the map_local_storage_(un)charge enabled in the map_ops. A followup patch will be done in bpf-next to remove the uncharge_mem argument. A selftest is added in the next patch. Fixes: c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions for reuse") Signed-off-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev --- kernel/bpf/bpf_local_storage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index 37ad47d52dc5..146824cc9689 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -760,7 +760,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage) * of the loop will set the free_cgroup_storage to true. */ free_storage = bpf_selem_unlink_storage_nolock( - local_storage, selem, false, true); + local_storage, selem, true, true); } raw_spin_unlock_irqrestore(&local_storage->lock, flags); From a96d1cfb2da040bdf692d22022371b249742abb2 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Fri, 1 Sep 2023 16:11:29 -0700 Subject: [PATCH 79/95] selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc This patch checks the sk_omem_alloc has been uncharged by bpf_sk_storage during the __sk_destruct. Signed-off-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20230901231129.578493-4-martin.lau@linux.dev --- .../bpf/prog_tests/sk_storage_omem_uncharge.c | 56 +++++++++++++++++ .../selftests/bpf/progs/bpf_tracing_net.h | 1 + .../bpf/progs/sk_storage_omem_uncharge.c | 61 +++++++++++++++++++ 3 files changed, 118 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_storage_omem_uncharge.c create mode 100644 tools/testing/selftests/bpf/progs/sk_storage_omem_uncharge.c diff --git a/tools/testing/selftests/bpf/prog_tests/sk_storage_omem_uncharge.c b/tools/testing/selftests/bpf/prog_tests/sk_storage_omem_uncharge.c new file mode 100644 index 000000000000..f35852d245e3 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/sk_storage_omem_uncharge.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2023 Facebook */ +#include +#include +#include +#include +#include "sk_storage_omem_uncharge.skel.h" + +void test_sk_storage_omem_uncharge(void) +{ + struct sk_storage_omem_uncharge *skel; + int sk_fd = -1, map_fd, err, value; + socklen_t optlen; + + skel = sk_storage_omem_uncharge__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel open_and_load")) + return; + map_fd = bpf_map__fd(skel->maps.sk_storage); + + /* A standalone socket not binding to addr:port, + * so nentns is not needed. + */ + sk_fd = socket(AF_INET6, SOCK_STREAM, 0); + if (!ASSERT_GE(sk_fd, 0, "socket")) + goto done; + + optlen = sizeof(skel->bss->cookie); + err = getsockopt(sk_fd, SOL_SOCKET, SO_COOKIE, &skel->bss->cookie, &optlen); + if (!ASSERT_OK(err, "getsockopt(SO_COOKIE)")) + goto done; + + value = 0; + err = bpf_map_update_elem(map_fd, &sk_fd, &value, 0); + if (!ASSERT_OK(err, "bpf_map_update_elem(value=0)")) + goto done; + + value = 0xdeadbeef; + err = bpf_map_update_elem(map_fd, &sk_fd, &value, 0); + if (!ASSERT_OK(err, "bpf_map_update_elem(value=0xdeadbeef)")) + goto done; + + err = sk_storage_omem_uncharge__attach(skel); + if (!ASSERT_OK(err, "attach")) + goto done; + + close(sk_fd); + sk_fd = -1; + + ASSERT_EQ(skel->bss->cookie_found, 2, "cookie_found"); + ASSERT_EQ(skel->bss->omem, 0, "omem"); + +done: + sk_storage_omem_uncharge__destroy(skel); + if (sk_fd != -1) + close(sk_fd); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h index cfed4df490f3..0b793a102791 100644 --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h @@ -88,6 +88,7 @@ #define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr #define sk_flags __sk_common.skc_flags #define sk_reuse __sk_common.skc_reuse +#define sk_cookie __sk_common.skc_cookie #define s6_addr32 in6_u.u6_addr32 diff --git a/tools/testing/selftests/bpf/progs/sk_storage_omem_uncharge.c b/tools/testing/selftests/bpf/progs/sk_storage_omem_uncharge.c new file mode 100644 index 000000000000..3e745793b27a --- /dev/null +++ b/tools/testing/selftests/bpf/progs/sk_storage_omem_uncharge.c @@ -0,0 +1,61 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2023 Facebook */ +#include "vmlinux.h" +#include "bpf_tracing_net.h" +#include +#include +#include + +void *local_storage_ptr = NULL; +void *sk_ptr = NULL; +int cookie_found = 0; +__u64 cookie = 0; +__u32 omem = 0; + +void *bpf_rdonly_cast(void *, __u32) __ksym; + +struct { + __uint(type, BPF_MAP_TYPE_SK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, int); +} sk_storage SEC(".maps"); + +SEC("fexit/bpf_local_storage_destroy") +int BPF_PROG(bpf_local_storage_destroy, struct bpf_local_storage *local_storage) +{ + struct sock *sk; + + if (local_storage_ptr != local_storage) + return 0; + + sk = bpf_rdonly_cast(sk_ptr, bpf_core_type_id_kernel(struct sock)); + if (sk->sk_cookie.counter != cookie) + return 0; + + cookie_found++; + omem = sk->sk_omem_alloc.counter; + local_storage_ptr = NULL; + + return 0; +} + +SEC("fentry/inet6_sock_destruct") +int BPF_PROG(inet6_sock_destruct, struct sock *sk) +{ + int *value; + + if (!cookie || sk->sk_cookie.counter != cookie) + return 0; + + value = bpf_sk_storage_get(&sk_storage, sk, 0, 0); + if (value && *value == 0xdeadbeef) { + cookie_found++; + sk_ptr = sk; + local_storage_ptr = sk->sk_bpf_storage; + } + + return 0; +} + +char _license[] SEC("license") = "GPL"; From fd94d9dadee58e09b49075240fe83423eb1dcd36 Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Tue, 5 Sep 2023 23:13:56 +0200 Subject: [PATCH 80/95] netfilter: nftables: exthdr: fix 4-byte stack OOB write If priv->len is a multiple of 4, then dst[len / 4] can write past the destination array which leads to stack corruption. This construct is necessary to clean the remainder of the register in case ->len is NOT a multiple of the register size, so make it conditional just like nft_payload.c does. The bug was added in 4.1 cycle and then copied/inherited when tcp/sctp and ip option support was added. Bug reported by Zero Day Initiative project (ZDI-CAN-21950, ZDI-CAN-21951, ZDI-CAN-21961). Fixes: 49499c3e6e18 ("netfilter: nf_tables: switch registers to 32 bit addressing") Fixes: 935b7f643018 ("netfilter: nft_exthdr: add TCP option matching") Fixes: 133dc203d77d ("netfilter: nft_exthdr: Support SCTP chunks") Fixes: dbb5281a1f84 ("netfilter: nf_tables: add support for matching IPv4 options") Signed-off-by: Florian Westphal --- net/netfilter/nft_exthdr.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c index a9844eefedeb..3fbaa7bf41f9 100644 --- a/net/netfilter/nft_exthdr.c +++ b/net/netfilter/nft_exthdr.c @@ -35,6 +35,14 @@ static unsigned int optlen(const u8 *opt, unsigned int offset) return opt[offset + 1]; } +static int nft_skb_copy_to_reg(const struct sk_buff *skb, int offset, u32 *dest, unsigned int len) +{ + if (len % NFT_REG32_SIZE) + dest[len / NFT_REG32_SIZE] = 0; + + return skb_copy_bits(skb, offset, dest, len); +} + static void nft_exthdr_ipv6_eval(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) @@ -56,8 +64,7 @@ static void nft_exthdr_ipv6_eval(const struct nft_expr *expr, } offset += priv->offset; - dest[priv->len / NFT_REG32_SIZE] = 0; - if (skb_copy_bits(pkt->skb, offset, dest, priv->len) < 0) + if (nft_skb_copy_to_reg(pkt->skb, offset, dest, priv->len) < 0) goto err; return; err: @@ -153,8 +160,7 @@ static void nft_exthdr_ipv4_eval(const struct nft_expr *expr, } offset += priv->offset; - dest[priv->len / NFT_REG32_SIZE] = 0; - if (skb_copy_bits(pkt->skb, offset, dest, priv->len) < 0) + if (nft_skb_copy_to_reg(pkt->skb, offset, dest, priv->len) < 0) goto err; return; err: @@ -210,7 +216,8 @@ static void nft_exthdr_tcp_eval(const struct nft_expr *expr, if (priv->flags & NFT_EXTHDR_F_PRESENT) { *dest = 1; } else { - dest[priv->len / NFT_REG32_SIZE] = 0; + if (priv->len % NFT_REG32_SIZE) + dest[priv->len / NFT_REG32_SIZE] = 0; memcpy(dest, opt + offset, priv->len); } @@ -388,9 +395,8 @@ static void nft_exthdr_sctp_eval(const struct nft_expr *expr, offset + ntohs(sch->length) > pkt->skb->len) break; - dest[priv->len / NFT_REG32_SIZE] = 0; - if (skb_copy_bits(pkt->skb, offset + priv->offset, - dest, priv->len) < 0) + if (nft_skb_copy_to_reg(pkt->skb, offset + priv->offset, + dest, priv->len) < 0) break; return; } From f4f8a7803119005e87b716874bec07c751efafec Mon Sep 17 00:00:00 2001 From: Wander Lairson Costa Date: Fri, 1 Sep 2023 10:50:20 -0300 Subject: [PATCH 81/95] netfilter: nfnetlink_osf: avoid OOB read The opt_num field is controlled by user mode and is not currently validated inside the kernel. An attacker can take advantage of this to trigger an OOB read and potentially leak information. BUG: KASAN: slab-out-of-bounds in nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88 Read of size 2 at addr ffff88804bc64272 by task poc/6431 CPU: 1 PID: 6431 Comm: poc Not tainted 6.0.0-rc4 #1 Call Trace: nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88 nf_osf_find+0x186/0x2f0 net/netfilter/nfnetlink_osf.c:281 nft_osf_eval+0x37f/0x590 net/netfilter/nft_osf.c:47 expr_call_ops_eval net/netfilter/nf_tables_core.c:214 nft_do_chain+0x2b0/0x1490 net/netfilter/nf_tables_core.c:264 nft_do_chain_ipv4+0x17c/0x1f0 net/netfilter/nft_chain_filter.c:23 [..] Also add validation to genre, subtype and version fields. Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match") Reported-by: Lucas Leong Signed-off-by: Wander Lairson Costa Signed-off-by: Florian Westphal --- net/netfilter/nfnetlink_osf.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c index 8f1bfa6ccc2d..50723ba08289 100644 --- a/net/netfilter/nfnetlink_osf.c +++ b/net/netfilter/nfnetlink_osf.c @@ -315,6 +315,14 @@ static int nfnl_osf_add_callback(struct sk_buff *skb, f = nla_data(osf_attrs[OSF_ATTR_FINGER]); + if (f->opt_num > ARRAY_SIZE(f->opt)) + return -EINVAL; + + if (!memchr(f->genre, 0, MAXGENRELEN) || + !memchr(f->subtype, 0, MAXGENRELEN) || + !memchr(f->version, 0, MAXGENRELEN)) + return -EINVAL; + kf = kmalloc(sizeof(struct nf_osf_finger), GFP_KERNEL); if (!kf) return -ENOMEM; From fdc04cc2d5fd0bb9c17f36d0a895cf3e151109e6 Mon Sep 17 00:00:00 2001 From: Phil Sutter Date: Fri, 1 Sep 2023 14:15:16 +0200 Subject: [PATCH 82/95] netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID Add a brief description to the enum's comment. Fixes: 837830a4b439 ("netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute") Signed-off-by: Phil Sutter Signed-off-by: Florian Westphal --- include/uapi/linux/netfilter/nf_tables.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 8466c2a9938f..ca30232b7bc8 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -263,6 +263,7 @@ enum nft_chain_attributes { * @NFTA_RULE_USERDATA: user data (NLA_BINARY, NFT_USERDATA_MAXLEN) * @NFTA_RULE_ID: uniquely identifies a rule in a transaction (NLA_U32) * @NFTA_RULE_POSITION_ID: transaction unique identifier of the previous rule (NLA_U32) + * @NFTA_RULE_CHAIN_ID: add the rule to chain by ID, alternative to @NFTA_RULE_CHAIN (NLA_U32) */ enum nft_rule_attributes { NFTA_RULE_UNSPEC, From 2ee52ae94baabf7ee09cf2a8d854b990dac5d0e4 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Mon, 4 Sep 2023 02:14:36 +0200 Subject: [PATCH 83/95] netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction New elements in this transaction might expired before such transaction ends. Skip sync GC for such elements otherwise commit path might walk over an already released object. Once transaction is finished, async GC will collect such expired element. Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Pablo Neira Ayuso Signed-off-by: Florian Westphal --- net/netfilter/nft_set_rbtree.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index c6435e709231..f250b5399344 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -312,6 +312,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, struct nft_rbtree_elem *rbe, *rbe_le = NULL, *rbe_ge = NULL; struct rb_node *node, *next, *parent, **p, *first = NULL; struct nft_rbtree *priv = nft_set_priv(set); + u8 cur_genmask = nft_genmask_cur(net); u8 genmask = nft_genmask_next(net); int d, err; @@ -357,8 +358,11 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, if (!nft_set_elem_active(&rbe->ext, genmask)) continue; - /* perform garbage collection to avoid bogus overlap reports. */ - if (nft_set_elem_expired(&rbe->ext)) { + /* perform garbage collection to avoid bogus overlap reports + * but skip new elements in this transaction. + */ + if (nft_set_elem_expired(&rbe->ext) && + nft_set_elem_active(&rbe->ext, cur_genmask)) { err = nft_rbtree_gc_elem(set, priv, rbe, genmask); if (err < 0) return err; From 050d91c03b28ca479df13dfb02bcd2c60dd6a878 Mon Sep 17 00:00:00 2001 From: Kyle Zeng Date: Tue, 5 Sep 2023 15:04:09 -0700 Subject: [PATCH 84/95] netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c The missing IP_SET_HASH_WITH_NET0 macro in ip_set_hash_netportnet can lead to the use of wrong `CIDR_POS(c)` for calculating array offsets, which can lead to integer underflow. As a result, it leads to slab out-of-bound access. This patch adds back the IP_SET_HASH_WITH_NET0 macro to ip_set_hash_netportnet to address the issue. Fixes: 886503f34d63 ("netfilter: ipset: actually allow allowable CIDR 0 in hash:net,port,net") Suggested-by: Jozsef Kadlecsik Signed-off-by: Kyle Zeng Acked-by: Jozsef Kadlecsik Signed-off-by: Florian Westphal --- net/netfilter/ipset/ip_set_hash_netportnet.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/netfilter/ipset/ip_set_hash_netportnet.c b/net/netfilter/ipset/ip_set_hash_netportnet.c index 005a7ce87217..bf4f91b78e1d 100644 --- a/net/netfilter/ipset/ip_set_hash_netportnet.c +++ b/net/netfilter/ipset/ip_set_hash_netportnet.c @@ -36,6 +36,7 @@ MODULE_ALIAS("ip_set_hash:net,port,net"); #define IP_SET_HASH_WITH_PROTO #define IP_SET_HASH_WITH_NETS #define IPSET_NET_COUNT 2 +#define IP_SET_HASH_WITH_NET0 /* IPv4 variant */ From 9b5ba5c9c5109bf89dc64a3f4734bd125d1ce52e Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Wed, 6 Sep 2023 11:42:02 +0200 Subject: [PATCH 85/95] netfilter: nf_tables: Unbreak audit log reset Deliver audit log from __nf_tables_dump_rules(), table dereference at the end of the table list loop might point to the list head, leading to this crash. [ 4137.407349] BUG: unable to handle page fault for address: 00000000001f3c50 [ 4137.407357] #PF: supervisor read access in kernel mode [ 4137.407359] #PF: error_code(0x0000) - not-present page [ 4137.407360] PGD 0 P4D 0 [ 4137.407363] Oops: 0000 [#1] PREEMPT SMP PTI [ 4137.407365] CPU: 4 PID: 500177 Comm: nft Not tainted 6.5.0+ #277 [ 4137.407369] RIP: 0010:string+0x49/0xd0 [ 4137.407374] Code: ff 77 36 45 89 d1 31 f6 49 01 f9 66 45 85 d2 75 19 eb 1e 49 39 f8 76 02 88 07 48 83 c7 01 83 c6 01 48 83 c2 01 4c 39 cf 74 07 <0f> b6 02 84 c0 75 e2 4c 89 c2 e9 58 e5 ff ff 48 c7 c0 0e b2 ff 81 [ 4137.407377] RSP: 0018:ffff8881179737f0 EFLAGS: 00010286 [ 4137.407379] RAX: 00000000001f2c50 RBX: ffff888117973848 RCX: ffff0a00ffffff04 [ 4137.407380] RDX: 00000000001f3c50 RSI: 0000000000000000 RDI: 0000000000000000 [ 4137.407381] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff [ 4137.407383] R10: ffffffffffffffff R11: ffff88813584d200 R12: 0000000000000000 [ 4137.407384] R13: ffffffffa15cf709 R14: 0000000000000000 R15: ffffffffa15cf709 [ 4137.407385] FS: 00007fcfc18bb580(0000) GS:ffff88840e700000(0000) knlGS:0000000000000000 [ 4137.407387] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4137.407388] CR2: 00000000001f3c50 CR3: 00000001055b2001 CR4: 00000000001706e0 [ 4137.407390] Call Trace: [ 4137.407392] [ 4137.407393] ? __die+0x1b/0x60 [ 4137.407397] ? page_fault_oops+0x6b/0xa0 [ 4137.407399] ? exc_page_fault+0x60/0x120 [ 4137.407403] ? asm_exc_page_fault+0x22/0x30 [ 4137.407408] ? string+0x49/0xd0 [ 4137.407410] vsnprintf+0x257/0x4f0 [ 4137.407414] kvasprintf+0x3e/0xb0 [ 4137.407417] kasprintf+0x3e/0x50 [ 4137.407419] nf_tables_dump_rules+0x1c0/0x360 [nf_tables] [ 4137.407439] ? __alloc_skb+0xc3/0x170 [ 4137.407442] netlink_dump+0x170/0x330 [ 4137.407447] __netlink_dump_start+0x227/0x300 [ 4137.407449] nf_tables_getrule+0x205/0x390 [nf_tables] Deliver audit log only once at the end of the rule dump+reset for consistency with the set dump+reset. Ensure audit reset access to table under rcu read side lock. The table list iteration holds rcu read lock side, but recent audit code dereferences table object out of the rcu read lock side. Fixes: ea078ae9108e ("netfilter: nf_tables: Audit log rule reset") Fixes: 7e9be1124dbe ("netfilter: nf_tables: Audit log setelem reset") Signed-off-by: Pablo Neira Ayuso Acked-by: Phil Sutter Signed-off-by: Florian Westphal --- net/netfilter/nf_tables_api.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 2c81cee858d6..e429ebba74b3 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3480,6 +3480,10 @@ cont: cont_skip: (*idx)++; } + + if (reset && *idx) + audit_log_rule_reset(table, cb->seq, *idx); + return 0; } @@ -3540,9 +3544,6 @@ static int nf_tables_dump_rules(struct sk_buff *skb, done: rcu_read_unlock(); - if (reset && idx > cb->args[0]) - audit_log_rule_reset(table, cb->seq, idx - cb->args[0]); - cb->args[0] = idx; return skb->len; } @@ -5760,8 +5761,6 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb) if (!args.iter.err && args.iter.count == cb->args[0]) args.iter.err = nft_set_catchall_dump(net, skb, set, reset, cb->seq); - rcu_read_unlock(); - nla_nest_end(skb, nest); nlmsg_end(skb, nlh); @@ -5769,6 +5768,8 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb) audit_log_nft_set_reset(table, cb->seq, args.iter.count - args.iter.skip); + rcu_read_unlock(); + if (args.iter.err && args.iter.err != -EMSGSIZE) return args.iter.err; if (args.iter.count == cb->args[0]) From 08c6d8bae48c2c28f7017d7b61b5d5a1518ceb39 Mon Sep 17 00:00:00 2001 From: Lukasz Majewski Date: Tue, 5 Sep 2023 11:33:15 +0200 Subject: [PATCH 86/95] net: phy: Provide Module 4 KSZ9477 errata (DS80000754C) The KSZ9477 errata points out (in 'Module 4') the link up/down problems when EEE (Energy Efficient Ethernet) is enabled in the device to which the KSZ9477 tries to auto negotiate. The suggested workaround is to clear advertisement of EEE for PHYs in this chip driver. To avoid regressions with other switch ICs the new MICREL_NO_EEE flag has been introduced. Moreover, the in-register disablement of MMD_DEVICE_ID_EEE_ADV.MMD_EEE_ADV MMD register is removed, as this code is both; now executed too late (after previous rework of the PHY and DSA for KSZ switches) and not required as setting all members of eee_broken_modes bit field prevents the KSZ9477 from advertising EEE. Fixes: 69d3b36ca045 ("net: dsa: microchip: enable EEE support") # for KSZ9477 Signed-off-by: Lukasz Majewski Tested-by: Oleksij Rempel # Confirmed disabled EEE with oscilloscope. Reviewed-by: Oleksij Rempel Reviewed-by: Florian Fainelli Link: https://lore.kernel.org/r/20230905093315.784052-1-lukma@denx.de Signed-off-by: Jakub Kicinski --- drivers/net/dsa/microchip/ksz_common.c | 16 +++++++++++++++- drivers/net/phy/micrel.c | 9 ++++++--- include/linux/micrel_phy.h | 1 + 3 files changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 6673122266b7..42db7679c360 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -2335,13 +2335,27 @@ static u32 ksz_get_phy_flags(struct dsa_switch *ds, int port) { struct ksz_device *dev = ds->priv; - if (dev->chip_id == KSZ8830_CHIP_ID) { + switch (dev->chip_id) { + case KSZ8830_CHIP_ID: /* Silicon Errata Sheet (DS80000830A): * Port 1 does not work with LinkMD Cable-Testing. * Port 1 does not respond to received PAUSE control frames. */ if (!port) return MICREL_KSZ8_P1_ERRATA; + break; + case KSZ9477_CHIP_ID: + /* KSZ9477 Errata DS80000754C + * + * Module 4: Energy Efficient Ethernet (EEE) feature select must + * be manually disabled + * The EEE feature is enabled by default, but it is not fully + * operational. It must be manually disabled through register + * controls. If not disabled, the PHY ports can auto-negotiate + * to enable EEE, and this feature can cause link drops when + * linked to another device supporting EEE. + */ + return MICREL_NO_EEE; } return 0; diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c index b6d7981b2d1e..927d3d54658e 100644 --- a/drivers/net/phy/micrel.c +++ b/drivers/net/phy/micrel.c @@ -1800,9 +1800,6 @@ static const struct ksz9477_errata_write ksz9477_errata_writes[] = { /* Transmit waveform amplitude can be improved (1000BASE-T, 100BASE-TX, 10BASE-Te) */ {0x1c, 0x04, 0x00d0}, - /* Energy Efficient Ethernet (EEE) feature select must be manually disabled */ - {0x07, 0x3c, 0x0000}, - /* Register settings are required to meet data sheet supply current specifications */ {0x1c, 0x13, 0x6eff}, {0x1c, 0x14, 0xe6ff}, @@ -1847,6 +1844,12 @@ static int ksz9477_config_init(struct phy_device *phydev) return err; } + /* According to KSZ9477 Errata DS80000754C (Module 4) all EEE modes + * in this switch shall be regarded as broken. + */ + if (phydev->dev_flags & MICREL_NO_EEE) + phydev->eee_broken_modes = -1; + err = genphy_restart_aneg(phydev); if (err) return err; diff --git a/include/linux/micrel_phy.h b/include/linux/micrel_phy.h index 322d87255984..4e27ca7c49de 100644 --- a/include/linux/micrel_phy.h +++ b/include/linux/micrel_phy.h @@ -44,6 +44,7 @@ #define MICREL_PHY_50MHZ_CLK BIT(0) #define MICREL_PHY_FXEN BIT(1) #define MICREL_KSZ8_P1_ERRATA BIT(2) +#define MICREL_NO_EEE BIT(3) #define MICREL_KSZ9021_EXTREG_CTRL 0xB #define MICREL_KSZ9021_EXTREG_DATA_WRITE 0xC From 61a1deacc3d4fd3d57d7fda4d935f7f7503e8440 Mon Sep 17 00:00:00 2001 From: Jian Shen Date: Wed, 6 Sep 2023 15:20:12 +0800 Subject: [PATCH 87/95] net: hns3: fix tx timeout issue Currently, the driver knocks the ring doorbell before updating the ring->last_to_use in tx flow. if the hardware transmiting packet and napi poll scheduling are fast enough, it may get the old ring->last_to_use in drivers' napi poll. In this case, the driver will think the tx is not completed, and return directly without clear the flag __QUEUE_STATE_STACK_XOFF, which may cause tx timeout. Fixes: 20d06ca2679c ("net: hns3: optimize the tx clean process") Signed-off-by: Jian Shen Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index eac2d0573241..81947c4e5100 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2103,8 +2103,12 @@ static void hns3_tx_doorbell(struct hns3_enet_ring *ring, int num, */ if (test_bit(HNS3_NIC_STATE_TX_PUSH_ENABLE, &priv->state) && num && !ring->pending_buf && num <= HNS3_MAX_PUSH_BD_NUM && doorbell) { + /* This smp_store_release() pairs with smp_load_aquire() in + * hns3_nic_reclaim_desc(). Ensure that the BD valid bit + * is updated. + */ + smp_store_release(&ring->last_to_use, ring->next_to_use); hns3_tx_push_bd(ring, num); - WRITE_ONCE(ring->last_to_use, ring->next_to_use); return; } @@ -2115,6 +2119,11 @@ static void hns3_tx_doorbell(struct hns3_enet_ring *ring, int num, return; } + /* This smp_store_release() pairs with smp_load_aquire() in + * hns3_nic_reclaim_desc(). Ensure that the BD valid bit is updated. + */ + smp_store_release(&ring->last_to_use, ring->next_to_use); + if (ring->tqp->mem_base) hns3_tx_mem_doorbell(ring); else @@ -2122,7 +2131,6 @@ static void hns3_tx_doorbell(struct hns3_enet_ring *ring, int num, ring->tqp->io_base + HNS3_RING_TX_RING_TAIL_REG); ring->pending_buf = 0; - WRITE_ONCE(ring->last_to_use, ring->next_to_use); } static void hns3_tsyn(struct net_device *netdev, struct sk_buff *skb, @@ -3563,9 +3571,8 @@ static void hns3_reuse_buffer(struct hns3_enet_ring *ring, int i) static bool hns3_nic_reclaim_desc(struct hns3_enet_ring *ring, int *bytes, int *pkts, int budget) { - /* pair with ring->last_to_use update in hns3_tx_doorbell(), - * smp_store_release() is not used in hns3_tx_doorbell() because - * the doorbell operation already have the needed barrier operation. + /* This smp_load_acquire() pairs with smp_store_release() in + * hns3_tx_doorbell(). */ int ltu = smp_load_acquire(&ring->last_to_use); int ntc = ring->next_to_clean; From dd2bbc2ef69a920d93801321b0b01ac6c4e5cacd Mon Sep 17 00:00:00 2001 From: Jijie Shao Date: Wed, 6 Sep 2023 15:20:13 +0800 Subject: [PATCH 88/95] net: hns3: Support query tx timeout threshold by debugfs support query tx timeout threshold by debugfs Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index f276b5ecb431..8086722a56c0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -1045,6 +1045,7 @@ hns3_dbg_dev_specs(struct hnae3_handle *h, char *buf, int len, int *pos) struct hnae3_ae_dev *ae_dev = pci_get_drvdata(h->pdev); struct hnae3_dev_specs *dev_specs = &ae_dev->dev_specs; struct hnae3_knic_private_info *kinfo = &h->kinfo; + struct net_device *dev = kinfo->netdev; *pos += scnprintf(buf + *pos, len - *pos, "dev_spec:\n"); *pos += scnprintf(buf + *pos, len - *pos, "MAC entry num: %u\n", @@ -1087,6 +1088,9 @@ hns3_dbg_dev_specs(struct hnae3_handle *h, char *buf, int len, int *pos) dev_specs->mc_mac_size); *pos += scnprintf(buf + *pos, len - *pos, "MAC statistics number: %u\n", dev_specs->mac_stats_num); + *pos += scnprintf(buf + *pos, len - *pos, + "TX timeout threshold: %d seconds\n", + dev->watchdog_timeo / HZ); } static int hns3_dbg_dev_info(struct hnae3_handle *h, char *buf, int len) From efccf655e99b6907ca07a466924e91805892e7d3 Mon Sep 17 00:00:00 2001 From: Hao Chen Date: Wed, 6 Sep 2023 15:20:14 +0800 Subject: [PATCH 89/95] net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read() req1->tcam_data is defined as "u8 tcam_data[8]", and we convert it as (u32 *) without considerring byte order conversion, it may result in printing wrong data for tcam_data. Convert tcam_data to (__le32 *) first to fix it. Fixes: b5a0b70d77b9 ("net: hns3: refactor dump fd tcam of debugfs") Signed-off-by: Hao Chen Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- .../ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c index f01a7a9ee02c..ff3f8f424ad9 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c @@ -1519,7 +1519,7 @@ static int hclge_dbg_fd_tcam_read(struct hclge_dev *hdev, bool sel_x, struct hclge_desc desc[3]; int pos = 0; int ret, i; - u32 *req; + __le32 *req; hclge_cmd_setup_basic_desc(&desc[0], HCLGE_OPC_FD_TCAM_OP, true); desc[0].flag |= cpu_to_le16(HCLGE_COMM_CMD_FLAG_NEXT); @@ -1544,22 +1544,22 @@ static int hclge_dbg_fd_tcam_read(struct hclge_dev *hdev, bool sel_x, tcam_msg.loc); /* tcam_data0 ~ tcam_data1 */ - req = (u32 *)req1->tcam_data; + req = (__le32 *)req1->tcam_data; for (i = 0; i < 2; i++) pos += scnprintf(tcam_buf + pos, HCLGE_DBG_TCAM_BUF_SIZE - pos, - "%08x\n", *req++); + "%08x\n", le32_to_cpu(*req++)); /* tcam_data2 ~ tcam_data7 */ - req = (u32 *)req2->tcam_data; + req = (__le32 *)req2->tcam_data; for (i = 0; i < 6; i++) pos += scnprintf(tcam_buf + pos, HCLGE_DBG_TCAM_BUF_SIZE - pos, - "%08x\n", *req++); + "%08x\n", le32_to_cpu(*req++)); /* tcam_data8 ~ tcam_data12 */ - req = (u32 *)req3->tcam_data; + req = (__le32 *)req3->tcam_data; for (i = 0; i < 5; i++) pos += scnprintf(tcam_buf + pos, HCLGE_DBG_TCAM_BUF_SIZE - pos, - "%08x\n", *req++); + "%08x\n", le32_to_cpu(*req++)); return ret; } From c295160b1d95e885f1af4586a221cb221d232d10 Mon Sep 17 00:00:00 2001 From: Hao Chen Date: Wed, 6 Sep 2023 15:20:15 +0800 Subject: [PATCH 90/95] net: hns3: fix debugfs concurrency issue between kfree buffer and read Now in hns3_dbg_uninit(), there may be concurrency between kfree buffer and read, it may result in memory error. Moving debugfs_remove_recursive() in front of kfree buffer to ensure they don't happen at the same time. Fixes: 5e69ea7ee2a6 ("net: hns3: refactor the debugfs process") Signed-off-by: Hao Chen Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index 8086722a56c0..b8508533878b 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -1415,9 +1415,9 @@ int hns3_dbg_init(struct hnae3_handle *handle) return 0; out: - mutex_destroy(&handle->dbgfs_lock); debugfs_remove_recursive(handle->hnae3_dbgfs); handle->hnae3_dbgfs = NULL; + mutex_destroy(&handle->dbgfs_lock); return ret; } @@ -1425,6 +1425,9 @@ void hns3_dbg_uninit(struct hnae3_handle *handle) { u32 i; + debugfs_remove_recursive(handle->hnae3_dbgfs); + handle->hnae3_dbgfs = NULL; + for (i = 0; i < ARRAY_SIZE(hns3_dbg_cmd); i++) if (handle->dbgfs_buf[i]) { kvfree(handle->dbgfs_buf[i]); @@ -1432,8 +1435,6 @@ void hns3_dbg_uninit(struct hnae3_handle *handle) } mutex_destroy(&handle->dbgfs_lock); - debugfs_remove_recursive(handle->hnae3_dbgfs); - handle->hnae3_dbgfs = NULL; } void hns3_dbg_register_debugfs(const char *debugfs_dir_name) From fa5564945f7d15ae2390b00c08b6abaef0165cda Mon Sep 17 00:00:00 2001 From: Jijie Shao Date: Wed, 6 Sep 2023 15:20:16 +0800 Subject: [PATCH 91/95] net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue We hope that tc qdisc and dcb ets commands can not be used crosswise. If we want to use any of the commands to configure tc, We must use the other command to clear the existing configuration. However, when we configure a single tc with tc qdisc, we can still configure it with dcb ets. Because we use mqprio_active as the tag of tc qdisc configuration, but with dcb ets, we do not check mqprio_active. This patch fix this issue by check mqprio_active before executing the dcb ets command. and add dcb_ets_active to replace HCLGE_FLAG_DCB_ENABLE and HCLGE_FLAG_MQPRIO_ENABLE at the hclge layer, Fixes: cacde272dd00 ("net: hns3: Add hclge_dcb module for the support of DCB feature") Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + .../hisilicon/hns3/hns3pf/hclge_dcb.c | 20 +++++-------------- .../hisilicon/hns3/hns3pf/hclge_main.c | 5 +++-- .../hisilicon/hns3/hns3pf/hclge_main.h | 2 -- 4 files changed, 9 insertions(+), 19 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index a4b43bcd2f0c..aaf1f42624a7 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -814,6 +814,7 @@ struct hnae3_tc_info { u8 max_tc; /* Total number of TCs */ u8 num_tc; /* Total number of enabled TCs */ bool mqprio_active; + bool dcb_ets_active; }; #define HNAE3_MAX_DSCP 64 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index fad5a5ff3cda..b98301e205f7 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -259,7 +259,7 @@ static int hclge_ieee_setets(struct hnae3_handle *h, struct ieee_ets *ets) int ret; if (!(hdev->dcbx_cap & DCB_CAP_DCBX_VER_IEEE) || - hdev->flag & HCLGE_FLAG_MQPRIO_ENABLE) + h->kinfo.tc_info.mqprio_active) return -EINVAL; ret = hclge_ets_validate(hdev, ets, &num_tc, &map_changed); @@ -275,10 +275,7 @@ static int hclge_ieee_setets(struct hnae3_handle *h, struct ieee_ets *ets) } hclge_tm_schd_info_update(hdev, num_tc); - if (num_tc > 1) - hdev->flag |= HCLGE_FLAG_DCB_ENABLE; - else - hdev->flag &= ~HCLGE_FLAG_DCB_ENABLE; + h->kinfo.tc_info.dcb_ets_active = num_tc > 1; ret = hclge_ieee_ets_to_tm_info(hdev, ets); if (ret) @@ -487,7 +484,7 @@ static u8 hclge_getdcbx(struct hnae3_handle *h) struct hclge_vport *vport = hclge_get_vport(h); struct hclge_dev *hdev = vport->back; - if (hdev->flag & HCLGE_FLAG_MQPRIO_ENABLE) + if (h->kinfo.tc_info.mqprio_active) return 0; return hdev->dcbx_cap; @@ -611,7 +608,8 @@ static int hclge_setup_tc(struct hnae3_handle *h, if (!test_bit(HCLGE_STATE_NIC_REGISTERED, &hdev->state)) return -EBUSY; - if (hdev->flag & HCLGE_FLAG_DCB_ENABLE) + kinfo = &vport->nic.kinfo; + if (kinfo->tc_info.dcb_ets_active) return -EINVAL; ret = hclge_mqprio_qopt_check(hdev, mqprio_qopt); @@ -625,7 +623,6 @@ static int hclge_setup_tc(struct hnae3_handle *h, if (ret) return ret; - kinfo = &vport->nic.kinfo; memcpy(&old_tc_info, &kinfo->tc_info, sizeof(old_tc_info)); hclge_sync_mqprio_qopt(&kinfo->tc_info, mqprio_qopt); kinfo->tc_info.mqprio_active = tc > 0; @@ -634,13 +631,6 @@ static int hclge_setup_tc(struct hnae3_handle *h, if (ret) goto err_out; - hdev->flag &= ~HCLGE_FLAG_DCB_ENABLE; - - if (tc > 1) - hdev->flag |= HCLGE_FLAG_MQPRIO_ENABLE; - else - hdev->flag &= ~HCLGE_FLAG_MQPRIO_ENABLE; - return hclge_notify_init_up(hdev); err_out: diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 0f50dba6cc47..8ca368424436 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -11026,6 +11026,7 @@ static void hclge_get_mdix_mode(struct hnae3_handle *handle, static void hclge_info_show(struct hclge_dev *hdev) { + struct hnae3_handle *handle = &hdev->vport->nic; struct device *dev = &hdev->pdev->dev; dev_info(dev, "PF info begin:\n"); @@ -11042,9 +11043,9 @@ static void hclge_info_show(struct hclge_dev *hdev) dev_info(dev, "This is %s PF\n", hdev->flag & HCLGE_FLAG_MAIN ? "main" : "not main"); dev_info(dev, "DCB %s\n", - hdev->flag & HCLGE_FLAG_DCB_ENABLE ? "enable" : "disable"); + handle->kinfo.tc_info.dcb_ets_active ? "enable" : "disable"); dev_info(dev, "MQPRIO %s\n", - hdev->flag & HCLGE_FLAG_MQPRIO_ENABLE ? "enable" : "disable"); + handle->kinfo.tc_info.mqprio_active ? "enable" : "disable"); dev_info(dev, "Default tx spare buffer size: %u\n", hdev->tx_spare_buf_size); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index ec233ec57222..7bc2049b723d 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -919,8 +919,6 @@ struct hclge_dev { #define HCLGE_FLAG_MAIN BIT(0) #define HCLGE_FLAG_DCB_CAPABLE BIT(1) -#define HCLGE_FLAG_DCB_ENABLE BIT(2) -#define HCLGE_FLAG_MQPRIO_ENABLE BIT(3) u32 flag; u32 pkt_buf_size; /* Total pf buf size for tx/rx */ From 674d9591a32d01df75d6b5fffed4ef942a294376 Mon Sep 17 00:00:00 2001 From: Yisen Zhuang Date: Wed, 6 Sep 2023 15:20:17 +0800 Subject: [PATCH 92/95] net: hns3: fix the port information display when sfp is absent When sfp is absent or unidentified, the port type should be displayed as PORT_OTHERS, rather than PORT_FIBRE. Fixes: 88d10bd6f730 ("net: hns3: add support for multiple media type") Signed-off-by: Yisen Zhuang Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 36858a72d771..682239f33082 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -773,7 +773,9 @@ static int hns3_get_link_ksettings(struct net_device *netdev, hns3_get_ksettings(h, cmd); break; case HNAE3_MEDIA_TYPE_FIBER: - if (module_type == HNAE3_MODULE_TYPE_CR) + if (module_type == HNAE3_MODULE_TYPE_UNKNOWN) + cmd->base.port = PORT_OTHER; + else if (module_type == HNAE3_MODULE_TYPE_CR) cmd->base.port = PORT_DA; else cmd->base.port = PORT_FIBRE; From 60326634f6c54528778de18bfef1e8a7a93b3771 Mon Sep 17 00:00:00 2001 From: Jie Wang Date: Wed, 6 Sep 2023 15:20:18 +0800 Subject: [PATCH 93/95] net: hns3: remove GSO partial feature bit HNS3 NIC does not support GSO partial packets segmentation. Actually tunnel packets for example NvGRE packets segment offload and checksum offload is already supported. There is no need to keep gso partial feature bit. So this patch removes it. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Jie Wang Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 81947c4e5100..b4895c7b3efd 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -3316,8 +3316,6 @@ static void hns3_set_default_feature(struct net_device *netdev) netdev->priv_flags |= IFF_UNICAST_FLT; - netdev->gso_partial_features |= NETIF_F_GSO_GRE_CSUM; - netdev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_RXCSUM | NETIF_F_SG | NETIF_F_GSO | From 6afcf0fb92701487421aa73c692855aa70fbc796 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 7 Sep 2023 11:01:04 -0700 Subject: [PATCH 94/95] Revert "net: team: do not use dynamic lockdep key" This reverts commit 39285e124edbc752331e98ace37cc141a6a3747a. Looks like the change has unintended consequences in exposing objects before they are initialized. Let's drop this patch and try again in net-next. Reported-by: syzbot+44ae022028805f4600fc@syzkaller.appspotmail.com Fixes: 39285e124edb ("net: team: do not use dynamic lockdep key") Link: https://lore.kernel.org/all/20230907103124.6adb7256@kernel.org/ Signed-off-by: Jakub Kicinski --- drivers/net/team/team.c | 113 ++++++++++++----------- drivers/net/team/team_mode_loadbalance.c | 4 +- include/linux/if_team.h | 30 +----- 3 files changed, 61 insertions(+), 86 deletions(-) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index ad29122a5468..e8b94580194e 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -1135,8 +1135,8 @@ static int team_port_add(struct team *team, struct net_device *port_dev, struct netlink_ext_ack *extack) { struct net_device *dev = team->dev; - char *portname = port_dev->name; struct team_port *port; + char *portname = port_dev->name; int err; if (port_dev->flags & IFF_LOOPBACK) { @@ -1203,26 +1203,6 @@ static int team_port_add(struct team *team, struct net_device *port_dev, memcpy(port->orig.dev_addr, port_dev->dev_addr, port_dev->addr_len); - err = dev_open(port_dev, extack); - if (err) { - netdev_dbg(dev, "Device %s opening failed\n", - portname); - goto err_dev_open; - } - - err = team_upper_dev_link(team, port, extack); - if (err) { - netdev_err(dev, "Device %s failed to set upper link\n", - portname); - goto err_set_upper_link; - } - - /* lockdep subclass variable(dev->nested_level) was updated by - * team_upper_dev_link(). - */ - team_unlock(team); - team_lock(team); - err = team_port_enter(team, port); if (err) { netdev_err(dev, "Device %s failed to enter team mode\n", @@ -1230,6 +1210,13 @@ static int team_port_add(struct team *team, struct net_device *port_dev, goto err_port_enter; } + err = dev_open(port_dev, extack); + if (err) { + netdev_dbg(dev, "Device %s opening failed\n", + portname); + goto err_dev_open; + } + err = vlan_vids_add_by_dev(port_dev, dev); if (err) { netdev_err(dev, "Failed to add vlan ids to device %s\n", @@ -1255,6 +1242,13 @@ static int team_port_add(struct team *team, struct net_device *port_dev, goto err_handler_register; } + err = team_upper_dev_link(team, port, extack); + if (err) { + netdev_err(dev, "Device %s failed to set upper link\n", + portname); + goto err_set_upper_link; + } + err = __team_option_inst_add_port(team, port); if (err) { netdev_err(dev, "Device %s failed to add per-port options\n", @@ -1301,6 +1295,9 @@ err_set_slave_promisc: __team_option_inst_del_port(team, port); err_option_port_add: + team_upper_dev_unlink(team, port); + +err_set_upper_link: netdev_rx_handler_unregister(port_dev); err_handler_register: @@ -1310,16 +1307,13 @@ err_enable_netpoll: vlan_vids_del_by_dev(port_dev, dev); err_vids_add: - team_port_leave(team, port); - -err_port_enter: - team_upper_dev_unlink(team, port); - -err_set_upper_link: dev_close(port_dev); err_dev_open: + team_port_leave(team, port); team_port_set_orig_dev_addr(port); + +err_port_enter: dev_set_mtu(port_dev, port->orig.mtu); err_set_mtu: @@ -1622,7 +1616,6 @@ static int team_init(struct net_device *dev) int err; team->dev = dev; - mutex_init(&team->lock); team_set_no_mode(team); team->notifier_ctx = false; @@ -1650,6 +1643,8 @@ static int team_init(struct net_device *dev) goto err_options_register; netif_carrier_off(dev); + lockdep_register_key(&team->team_lock_key); + __mutex_init(&team->lock, "team->team_lock_key", &team->team_lock_key); netdev_lockdep_set_classes(dev); return 0; @@ -1670,7 +1665,7 @@ static void team_uninit(struct net_device *dev) struct team_port *port; struct team_port *tmp; - team_lock(team); + mutex_lock(&team->lock); list_for_each_entry_safe(port, tmp, &team->port_list, list) team_port_del(team, port->dev); @@ -1679,8 +1674,9 @@ static void team_uninit(struct net_device *dev) team_mcast_rejoin_fini(team); team_notify_peers_fini(team); team_queue_override_fini(team); - team_unlock(team); + mutex_unlock(&team->lock); netdev_change_features(dev); + lockdep_unregister_key(&team->team_lock_key); } static void team_destructor(struct net_device *dev) @@ -1794,18 +1790,18 @@ static void team_set_rx_mode(struct net_device *dev) static int team_set_mac_address(struct net_device *dev, void *p) { - struct team *team = netdev_priv(dev); struct sockaddr *addr = p; + struct team *team = netdev_priv(dev); struct team_port *port; if (dev->type == ARPHRD_ETHER && !is_valid_ether_addr(addr->sa_data)) return -EADDRNOTAVAIL; dev_addr_set(dev, addr->sa_data); - team_lock(team); + mutex_lock(&team->lock); list_for_each_entry(port, &team->port_list, list) if (team->ops.port_change_dev_addr) team->ops.port_change_dev_addr(team, port); - team_unlock(team); + mutex_unlock(&team->lock); return 0; } @@ -1819,7 +1815,7 @@ static int team_change_mtu(struct net_device *dev, int new_mtu) * Alhough this is reader, it's guarded by team lock. It's not possible * to traverse list in reverse under rcu_read_lock */ - team_lock(team); + mutex_lock(&team->lock); team->port_mtu_change_allowed = true; list_for_each_entry(port, &team->port_list, list) { err = dev_set_mtu(port->dev, new_mtu); @@ -1830,7 +1826,7 @@ static int team_change_mtu(struct net_device *dev, int new_mtu) } } team->port_mtu_change_allowed = false; - team_unlock(team); + mutex_unlock(&team->lock); dev->mtu = new_mtu; @@ -1840,7 +1836,7 @@ unwind: list_for_each_entry_continue_reverse(port, &team->port_list, list) dev_set_mtu(port->dev, dev->mtu); team->port_mtu_change_allowed = false; - team_unlock(team); + mutex_unlock(&team->lock); return err; } @@ -1894,20 +1890,20 @@ static int team_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid) * Alhough this is reader, it's guarded by team lock. It's not possible * to traverse list in reverse under rcu_read_lock */ - team_lock(team); + mutex_lock(&team->lock); list_for_each_entry(port, &team->port_list, list) { err = vlan_vid_add(port->dev, proto, vid); if (err) goto unwind; } - team_unlock(team); + mutex_unlock(&team->lock); return 0; unwind: list_for_each_entry_continue_reverse(port, &team->port_list, list) vlan_vid_del(port->dev, proto, vid); - team_unlock(team); + mutex_unlock(&team->lock); return err; } @@ -1917,10 +1913,10 @@ static int team_vlan_rx_kill_vid(struct net_device *dev, __be16 proto, u16 vid) struct team *team = netdev_priv(dev); struct team_port *port; - team_lock(team); + mutex_lock(&team->lock); list_for_each_entry(port, &team->port_list, list) vlan_vid_del(port->dev, proto, vid); - team_unlock(team); + mutex_unlock(&team->lock); return 0; } @@ -1942,9 +1938,9 @@ static void team_netpoll_cleanup(struct net_device *dev) { struct team *team = netdev_priv(dev); - team_lock(team); + mutex_lock(&team->lock); __team_netpoll_cleanup(team); - team_unlock(team); + mutex_unlock(&team->lock); } static int team_netpoll_setup(struct net_device *dev, @@ -1954,7 +1950,7 @@ static int team_netpoll_setup(struct net_device *dev, struct team_port *port; int err = 0; - team_lock(team); + mutex_lock(&team->lock); list_for_each_entry(port, &team->port_list, list) { err = __team_port_enable_netpoll(port); if (err) { @@ -1962,7 +1958,7 @@ static int team_netpoll_setup(struct net_device *dev, break; } } - team_unlock(team); + mutex_unlock(&team->lock); return err; } #endif @@ -1973,9 +1969,9 @@ static int team_add_slave(struct net_device *dev, struct net_device *port_dev, struct team *team = netdev_priv(dev); int err; - team_lock(team); + mutex_lock(&team->lock); err = team_port_add(team, port_dev, extack); - team_unlock(team); + mutex_unlock(&team->lock); if (!err) netdev_change_features(dev); @@ -1988,12 +1984,19 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev) struct team *team = netdev_priv(dev); int err; - team_lock(team); + mutex_lock(&team->lock); err = team_port_del(team, port_dev); - team_unlock(team); + mutex_unlock(&team->lock); - if (!err) - netdev_change_features(dev); + if (err) + return err; + + if (netif_is_team_master(port_dev)) { + lockdep_unregister_key(&team->team_lock_key); + lockdep_register_key(&team->team_lock_key); + lockdep_set_class(&team->lock, &team->team_lock_key); + } + netdev_change_features(dev); return err; } @@ -2313,13 +2316,13 @@ static struct team *team_nl_team_get(struct genl_info *info) } team = netdev_priv(dev); - __team_lock(team); + mutex_lock(&team->lock); return team; } static void team_nl_team_put(struct team *team) { - team_unlock(team); + mutex_unlock(&team->lock); dev_put(team->dev); } @@ -2981,9 +2984,9 @@ static void team_port_change_check(struct team_port *port, bool linkup) { struct team *team = port->team; - team_lock(team); + mutex_lock(&team->lock); __team_port_change_check(port, linkup); - team_unlock(team); + mutex_unlock(&team->lock); } diff --git a/drivers/net/team/team_mode_loadbalance.c b/drivers/net/team/team_mode_loadbalance.c index 7bcc9d37447a..00f8989c29c0 100644 --- a/drivers/net/team/team_mode_loadbalance.c +++ b/drivers/net/team/team_mode_loadbalance.c @@ -478,7 +478,7 @@ static void lb_stats_refresh(struct work_struct *work) team = lb_priv_ex->team; lb_priv = get_lb_priv(team); - if (!team_trylock(team)) { + if (!mutex_trylock(&team->lock)) { schedule_delayed_work(&lb_priv_ex->stats.refresh_dw, 0); return; } @@ -515,7 +515,7 @@ static void lb_stats_refresh(struct work_struct *work) schedule_delayed_work(&lb_priv_ex->stats.refresh_dw, (lb_priv_ex->stats.refresh_interval * HZ) / 10); - team_unlock(team); + mutex_unlock(&team->lock); } static void lb_stats_refresh_interval_get(struct team *team, diff --git a/include/linux/if_team.h b/include/linux/if_team.h index 12d4447fc8ab..1b9b15a492fa 100644 --- a/include/linux/if_team.h +++ b/include/linux/if_team.h @@ -221,38 +221,10 @@ struct team { atomic_t count_pending; struct delayed_work dw; } mcast_rejoin; + struct lock_class_key team_lock_key; long mode_priv[TEAM_MODE_PRIV_LONGS]; }; -static inline void __team_lock(struct team *team) -{ - mutex_lock(&team->lock); -} - -static inline int team_trylock(struct team *team) -{ - return mutex_trylock(&team->lock); -} - -#ifdef CONFIG_LOCKDEP -static inline void team_lock(struct team *team) -{ - ASSERT_RTNL(); - mutex_lock_nested(&team->lock, team->dev->nested_level); -} - -#else -static inline void team_lock(struct team *team) -{ - __team_lock(team); -} -#endif - -static inline void team_unlock(struct team *team) -{ - mutex_unlock(&team->lock); -} - static inline int team_dev_queue_xmit(struct team *team, struct team_port *port, struct sk_buff *skb) { From 1b36955cc048c8ff6ba448dbf4be0e52f59f2963 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Wed, 6 Sep 2023 17:16:09 +0300 Subject: [PATCH 95/95] net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs() enetc_psi_create() returns an ERR_PTR() or a valid station interface pointer, but checking for the non-NULL quality of the return code blurs that difference away. So if enetc_psi_create() fails, we call enetc_psi_destroy() when we shouldn't. This will likely result in crashes, since enetc_psi_create() cleans up everything after itself when it returns an ERR_PTR(). Fixes: f0168042a212 ("net: enetc: reimplement RFS/RSS memory clearing as PCI quirk") Reported-by: Dan Carpenter Closes: https://lore.kernel.org/netdev/582183ef-e03b-402b-8e2d-6d9bb3c83bd9@moroto.mountain/ Suggested-by: Dan Carpenter Signed-off-by: Vladimir Oltean Reviewed-by: Simon Horman Link: https://lore.kernel.org/r/20230906141609.247579-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/freescale/enetc/enetc_pf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c b/drivers/net/ethernet/freescale/enetc/enetc_pf.c index e0a4cb7e3f50..c153dc083aff 100644 --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c @@ -1402,7 +1402,7 @@ static void enetc_fixup_clear_rss_rfs(struct pci_dev *pdev) return; si = enetc_psi_create(pdev); - if (si) + if (!IS_ERR(si)) enetc_psi_destroy(pdev); } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_FREESCALE, ENETC_DEV_ID_PF,