Ipv6 flowlabels historically require a reservation before use.
Optionally in exclusive mode (e.g., user-private).
Commit 59c820b231 ("ipv6: elide flowlabel check if no exclusive
leases exist") introduced a fastpath that avoids this check when no
exclusive leases exist in the system, and thus any flowlabel use
will be granted.
That allows skipping the control operation to reserve a flowlabel
entirely. Though with a warning if the fast path fails:
This is an optimization. Robust applications still have to revert to
requesting leases if the fast path fails due to an exclusive lease.
Still, this is subtle. Better isolate network namespaces from each
other. Flowlabels are per-netns. Also record per-netns whether
exclusive leases are in use. Then behavior does not change based on
activity in other netns.
Changes
v2
- wrap in IS_ENABLED(CONFIG_IPV6) to avoid breakage if disabled
Fixes: 59c820b231 ("ipv6: elide flowlabel check if no exclusive leases exist")
Link: https://lore.kernel.org/netdev/MWHPR2201MB1072BCCCFCE779E4094837ACD0329@MWHPR2201MB1072.namprd22.prod.outlook.com/
Reported-by: Congyu Liu <liu3101@purdue.edu>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Congyu Liu <liu3101@purdue.edu>
Link: https://lore.kernel.org/r/20220215160037.1976072-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__skb_vlan_pop() needs skb->data to point at the mac_header, while
skb_vlan_tag_present() and skb_vlan_tag_get() don't, because they don't
look at skb->data at all.
So we can avoid uselessly moving around skb->data for the case where the
VLAN tag was offloaded by the DSA master.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20220215204722.2134816-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever bridge driver hits the max capacity of MDBs, it disables
the MC processing (by setting corresponding bridge option), but never
notifies switchdev about such change (the notifiers are called only upon
explicit setting of this option, through the registered netlink interface).
This could lead to situation when Software MDB processing gets disabled,
but this event never gets offloaded to the underlying Hardware.
Fix this by adding a notify message in such case.
Fixes: 147c1e9b90 ("switchdev: bridge: Offload multicast disabled")
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20220215165303.31908-1-oleksandr.mazur@plvision.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replacements:
queueing to queuing
trasfer to transfer
aditional to additional
adaptor to adapter
transactino to transaction
Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20220215213802.3043178-1-trix@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When smc_connect_clc() times out, it will return -EAGAIN(tcp_recvmsg
retuns -EAGAIN while timeout), then this value will passed to the
application, which is quite confusing to the applications, makes
inconsistency with TCP.
From the manual of connect, ETIMEDOUT is more suitable, and this patch
try convert EAGAIN to ETIMEDOUT in that case.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/1644913490-21594-1-git-send-email-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is unused since commit 8e2288cad6 ("net: hns3: refactor PF
cmdq init and uninit APIs with new common APIs").
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Jie Wang <wangjie125@huawei.com>
Link: https://lore.kernel.org/r/20220216113507.22368-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mark C++-specific T::open() and other methods as static inline to avoid
symbol redefinition when multiple files use the same skeleton header in
an application.
Fixes: bb8ffe61ea ("bpftool: Add C++-specific open/load/etc skeleton wrappers")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220216233540.216642-1-andrii@kernel.org
The commit e5043894b2 ("bpftool: Use libbpf_get_error() to check error")
fails to dump map without BTF loaded in pretty mode (-p option).
Fixing this by making sure get_map_kv_btf won't fail in case there's
no BTF available for the map.
Fixes: e5043894b2 ("bpftool: Use libbpf_get_error() to check error")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220216092102.125448-1-jolsa@kernel.org
Sysfs support might be disabled so we need to guard the code that
instantiates "compression" attribute with an #ifdef.
Fixes: b1ae6dc41e ("module: add in-kernel support for decompressing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
When commit e6ac2450d6 ("bpf: Support bpf program calling kernel function") added
kfunc support, it defined reg2btf_ids as a cheap way to translate the verifier
reg type to the appropriate btf_vmlinux BTF ID, however
commit c25b2ae136 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL")
moved the __BPF_REG_TYPE_MAX from the last member of bpf_reg_type enum to after
the base register types, and defined other variants using type flag
composition. However, now, the direct usage of reg->type to index into
reg2btf_ids may no longer fall into __BPF_REG_TYPE_MAX range, and hence lead to
out of bounds access and kernel crash on dereference of bad pointer.
Fixes: c25b2ae136 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220216201943.624869-1-memxor@gmail.com
Mauricio Vásquez <mauricio@kinvolk.io> says:
====================
CO-RE requires to have BTF information describing the kernel types in
order to perform the relocations. This is usually provided by the kernel
itself when it's configured with CONFIG_DEBUG_INFO_BTF. However, this
configuration is not enabled in all the distributions and it's not
available on kernels before 5.12.
It's possible to use CO-RE in kernels without CONFIG_DEBUG_INFO_BTF
support by providing the BTF information from an external source.
BTFHub[0] contains BTF files to each released kernel not supporting BTF,
for the most popular distributions.
Providing this BTF file for a given kernel has some challenges:
1. Each BTF file is a few MBs big, then it's not possible to ship the
eBPF program with all the BTF files needed to run in different kernels.
(The BTF files will be in the order of GBs if you want to support a high
number of kernels)
2. Downloading the BTF file for the current kernel at runtime delays the
start of the program and it's not always possible to reach an external
host to download such a file.
Providing the BTF file with the information about all the data types of
the kernel for running an eBPF program is an overkill in many of the
cases. Usually the eBPF programs access only some kernel fields.
This series implements BTFGen support in bpftool. This idea was
discussed during the "Towards truly portable eBPF"[1] presentation at
Linux Plumbers 2021.
There is a good example[2] on how to use BTFGen and BTFHub together
to generate multiple BTF files, to each existing/supported kernel,
tailored to one application. For example: a complex bpf object might
support nearly 400 kernels by having BTF files summing only 1.5 MB.
[0]: https://github.com/aquasecurity/btfhub/
[1]: https://www.youtube.com/watch?v=igJLKyP1lFk&t=2418s
[2]: https://github.com/aquasecurity/btfhub/tree/main/tools
Changelog:
v6 > v7:
- use array instead of hashmap to store ids
- use btf__add_{struct,union}() instead of memcpy()
- don't use fixed path for testing BTF file
- update example to use DECLARE_LIBBPF_OPTS()
v5 > v6:
- use BTF structure to store used member/types instead of hashmaps
- remove support for input/output folders
- remove bpf_core_{created,free}_cand_cache()
- reorganize commits to avoid having unused static functions
- remove usage of libbpf_get_error()
- fix some errno propagation issues
- do not record full types for type-based relocations
- add support for BTF_KIND_FUNC_PROTO
- implement tests based on core_reloc ones
v4 > v5:
- move some checks before invoking prog->obj->gen_loader
- use p_info() instead of printf()
- improve command output
- fix issue with record_relo_core()
- implement bash completion
- write man page
- implement some tests
v3 > v4:
- parse BTF and BTF.ext sections in bpftool and use
bpf_core_calc_relo_insn() directly
- expose less internal details from libbpf to bpftool
- implement support for enum-based relocations
- split commits in a more granular way
v2 > v3:
- expose internal libbpf APIs to bpftool instead
- implement btfgen in bpftool
- drop btf__raw_data() from libbpf
v1 > v2:
- introduce bpf_object__prepare() and ‘record_core_relos’ to expose
CO-RE relocations instead of bpf_object__reloc_info_gen()
- rename btf__save_to_file() to btf__raw_data()
v1: https://lore.kernel.org/bpf/20211027203727.208847-1-mauricio@kinvolk.io/
v2: https://lore.kernel.org/bpf/20211116164208.164245-1-mauricio@kinvolk.io/
v3: https://lore.kernel.org/bpf/20211217185654.311609-1-mauricio@kinvolk.io/
v4: https://lore.kernel.org/bpf/20220112142709.102423-1-mauricio@kinvolk.io/
v5: https://lore.kernel.org/bpf/20220128223312.1253169-1-mauricio@kinvolk.io/
v6: https://lore.kernel.org/bpf/20220209222646.348365-1-mauricio@kinvolk.io/
Mauricio Vásquez (6):
libbpf: split bpf_core_apply_relo()
libbpf: Expose bpf_core_{add,free}_cands() to bpftool
bpftool: Add gen min_core_btf command
bpftool: Implement "gen min_core_btf" logic
bpftool: Implement btfgen_get_btf()
selftests/bpf: Test "bpftool gen min_core_btf"
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
This commit reuses the core_reloc test to check if the BTF files
generated with "bpftool gen min_core_btf" are correct. This introduces
test_core_btfgen() that runs all the core_reloc tests, but this time
the source BTF files are generated by using "bpftool gen min_core_btf".
The goal of this test is to check that the generated files are usable,
and not to check if the algorithm is creating an optimized BTF file.
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rafael David Tinoco <rafael.tinoco@aquasec.com>
Signed-off-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Leonardo Di Donato <leonardo.didonato@elastic.co>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220215225856.671072-8-mauricio@kinvolk.io
Daniel Gibson reports that the n_tty code gets line termination wrong in
very specific cases:
"If you feed a line with exactly 64 chars + terminating newline, and
directly afterwards (without reading) another line into a pseudo
terminal, the the first read() on the other side will return the 64
char line *without* terminating newline, and the next read() will
return the missing terminating newline AND the complete next line (if
it fits in the buffer)"
and bisected the behavior to commit 3b830a9c34 ("tty: convert
tty_ldisc_ops 'read()' function to take a kernel pointer").
Now, digging deeper, it turns out that the behavior isn't exactly new:
what changed in commit 3b830a9c34 was that the tty line discipline
.read() function is now passed an intermediate kernel buffer rather than
the final user space buffer.
And that intermediate kernel buffer is 64 bytes in size - thus that
special case with exactly 64 bytes plus terminating newline.
The same problem did exist before, but historically the boundary was not
the 64-byte chunk, but the user-supplied buffer size, which is obviously
generally bigger (and potentially bigger than N_TTY_BUF_SIZE, which
would hide the issue entirely).
The reason is that the n_tty canon_copy_from_read_buf() code would look
ahead for the EOL character one byte further than it would actually
copy. It would then decide that it had found the terminator, and unmark
it as an EOL character - which in turn explains why the next read
wouldn't then be terminated by it.
Now, the reason it did all this in the first place is related to some
historical and pretty obscure EOF behavior, see commit ac8f3bf883
("n_tty: Fix poll() after buffer-limited eof push read") and commit
40d5e0905a ("n_tty: Fix EOF push handling").
And the reason for the EOL confusion is that we treat EOF as a special
EOL condition, with the EOL character being NUL (aka "__DISABLED_CHAR"
in the kernel sources).
So that EOF look-ahead also affects the normal EOL handling.
This patch just removes the look-ahead that causes problems, because EOL
is much more critical than the historical "EOF in the middle of a line
that coincides with the end of the buffer" handling ever was.
Now, it is possible that we should indeed re-introduce the "look at next
character to see if it's a EOF" behavior, but if so, that should be done
not at the kernel buffer chunk boundary in canon_copy_from_read_buf(),
but at a higher level, when we run out of the user buffer.
In particular, the place to do that would be at the top of
'n_tty_read()', where we check if it's a continuation of a previously
started read, and there is no more buffer space left, we could decide to
just eat the __DISABLED_CHAR at that point.
But that would be a separate patch, because I suspect nobody actually
cares, and I'd like to get a report about it before bothering.
Fixes: 3b830a9c34 ("tty: convert tty_ldisc_ops 'read()' function to take a kernel pointer")
Fixes: ac8f3bf883 ("n_tty: Fix poll() after buffer-limited eof push read")
Fixes: 40d5e0905a ("n_tty: Fix EOF push handling")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215611
Reported-and-tested-by: Daniel Gibson <metalcaedes@gmail.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add "min_core_btf" feature explanation and one example of how to use it
to bpftool-gen man page.
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rafael David Tinoco <rafael.tinoco@aquasec.com>
Signed-off-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Leonardo Di Donato <leonardo.didonato@elastic.co>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220215225856.671072-7-mauricio@kinvolk.io
The last part of the BTFGen algorithm is to create a new BTF object with
all the types that were recorded in the previous steps.
This function performs two different steps:
1. Add the types to the new BTF object by using btf__add_type(). Some
special logic around struct and unions is implemented to only add the
members that are really used in the field-based relocations. The type
ID on the new and old BTF objects is stored on a map.
2. Fix all the type IDs on the new BTF object by using the IDs saved in
the previous step.
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rafael David Tinoco <rafael.tinoco@aquasec.com>
Signed-off-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Leonardo Di Donato <leonardo.didonato@elastic.co>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220215225856.671072-6-mauricio@kinvolk.io
This commit implements the logic for the gen min_core_btf command.
Specifically, it implements the following functions:
- minimize_btf(): receives the path of a source and destination BTF
files and a list of BPF objects. This function records the relocations
for all objects and then generates the BTF file by calling
btfgen_get_btf() (implemented in the following commit).
- btfgen_record_obj(): loads the BTF and BTF.ext sections of the BPF
objects and loops through all CO-RE relocations. It uses
bpf_core_calc_relo_insn() from libbpf and passes the target spec to
btfgen_record_reloc(), that calls one of the following functions
depending on the relocation kind.
- btfgen_record_field_relo(): uses the target specification to mark all
the types that are involved in a field-based CO-RE relocation. In this
case types resolved and marked recursively using btfgen_mark_type().
Only the struct and union members (and their types) involved in the
relocation are marked to optimize the size of the generated BTF file.
- btfgen_record_type_relo(): marks the types involved in a type-based
CO-RE relocation. In this case no members for the struct and union types
are marked as libbpf doesn't use them while performing this kind of
relocation. Pointed types are marked as they are used by libbpf in this
case.
- btfgen_record_enumval_relo(): marks the whole enum type for enum-based
relocations.
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rafael David Tinoco <rafael.tinoco@aquasec.com>
Signed-off-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Leonardo Di Donato <leonardo.didonato@elastic.co>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220215225856.671072-5-mauricio@kinvolk.io
This command is implemented under the "gen" command in bpftool and the
syntax is the following:
$ bpftool gen min_core_btf INPUT OUTPUT OBJECT [OBJECT...]
INPUT is the file that contains all the BTF types for a kernel and
OUTPUT is the path of the minimize BTF file that will be created with
only the types needed by the objects.
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rafael David Tinoco <rafael.tinoco@aquasec.com>
Signed-off-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Leonardo Di Donato <leonardo.didonato@elastic.co>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220215225856.671072-4-mauricio@kinvolk.io
BTFGen needs to run the core relocation logic in order to understand
what are the types involved in a given relocation.
Currently bpf_core_apply_relo() calculates and **applies** a relocation
to an instruction. Having both operations in the same function makes it
difficult to only calculate the relocation without patching the
instruction. This commit splits that logic in two different phases: (1)
calculate the relocation and (2) patch the instruction.
For the first phase bpf_core_apply_relo() is renamed to
bpf_core_calc_relo_insn() who is now only on charge of calculating the
relocation, the second phase uses the already existing
bpf_core_patch_insn(). bpf_object__relocate_core() uses both of them and
the BTFGen will use only bpf_core_calc_relo_insn().
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rafael David Tinoco <rafael.tinoco@aquasec.com>
Signed-off-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Leonardo Di Donato <leonardo.didonato@elastic.co>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220215225856.671072-2-mauricio@kinvolk.io
The struct perf_event_attr is initialised differently in Arm64 when
recording in call-graph fp mode, so update the relevant tests, and add
two extra arm64-only tests.
Before:
$ perf test 17 -v
17: Setup struct perf_event_attr
[...]
running './tests/attr/test-record-graph-default'
expected sample_type=295, got 4391
expected sample_regs_user=0, got 1073741824
FAILED './tests/attr/test-record-graph-default' - match failure
test child finished with -1
---- end ----
After:
[...]
running './tests/attr/test-record-graph-default-aarch64'
test limitation 'aarch64'
running './tests/attr/test-record-graph-fp-aarch64'
test limitation 'aarch64'
running './tests/attr/test-record-graph-default'
test limitation '!aarch64'
excluded architecture list ['aarch64']
skipped [aarch64] './tests/attr/test-record-graph-default'
running './tests/attr/test-record-graph-fp'
test limitation '!aarch64'
excluded architecture list ['aarch64']
skipped [aarch64] './tests/attr/test-record-graph-fp'
[...]
Fixes: 7248e308a5 ("perf tools: Record ARM64 LR register automatically")
Signed-off-by: German Gomez <german.gomez@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Truong <alexandre.truong@arm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Link: http://lore.kernel.org/lkml/20220125104435.2737-1-german.gomez@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
GCC 12 correctly reports a potential use-after-free condition in the
xrealloc helper. Fix the warning by avoiding an implicit "free(ptr)"
when size == 0:
In file included from help.c:12:
In function 'xrealloc',
inlined from 'add_cmdname' at help.c:24:2: subcmd-util.h:56:23: error: pointer may be used after 'realloc' [-Werror=use-after-free]
56 | ret = realloc(ptr, size);
| ^~~~~~~~~~~~~~~~~~
subcmd-util.h:52:21: note: call to 'realloc' here
52 | void *ret = realloc(ptr, size);
| ^~~~~~~~~~~~~~~~~~
subcmd-util.h:58:31: error: pointer may be used after 'realloc' [-Werror=use-after-free]
58 | ret = realloc(ptr, 1);
| ^~~~~~~~~~~~~~~
subcmd-util.h:52:21: note: call to 'realloc' here
52 | void *ret = realloc(ptr, size);
| ^~~~~~~~~~~~~~~~~~
Fixes: 2f4ce5ec1d ("perf tools: Finalize subcmd independence")
Reported-by: Valdis Klētnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Kees Kook <keescook@chromium.org>
Tested-by: Valdis Klētnieks <valdis.kletnieks@vt.edu>
Tested-by: Justin M. Forbes <jforbes@fedoraproject.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: linux-hardening@vger.kernel.org
Cc: Valdis Klētnieks <valdis.kletnieks@vt.edu>
Link: http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tzvetomir Stoyanov reported an issue with using macro
perf_cpu_map__for_each_cpu using private perf_cpu object.
The issue is caused by recent change that wrapped cpu in struct perf_cpu
to distinguish it from cpu indexes. We need to make struct perf_cpu
public.
Add a simple test for using the perf_cpu_map__for_each_cpu macro.
Fixes: 6d18804b96 ("perf cpumap: Give CPUs their own type")
Reported-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20220215153713.31395-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
'perf inject' with Coresight data generates files that cannot be opened
when only the last branch option is specified:
perf inject -i perf.data --itrace=l -o inject.data
perf script -i inject.data
0x33faa8 [0x8]: failed to process type: 9 [Bad address]
This is because cs_etm__synth_instruction_sample() is called even when
the sample type for instructions hasn't been setup. Last branch records
are attached to instruction samples so it doesn't make sense to generate
them when --itrace=i isn't specified anyway.
This change disables all calls of cs_etm__synth_instruction_sample()
unless --itrace=i is specified, resulting in a file with no samples if
only --itrace=l is provided, rather than a bad file.
Reviewed-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: James Clark <james.clark@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: coresight@lists.linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20220210200620.1227232-2-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
sample_branches and sample_instructions are already saved in the
synth_opts struct. Other usages like synth_opts.last_branch don't save a
value, so make this more consistent by always going through synth_opts
and not saving duplicate values.
Reviewed-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: James Clark <james.clark@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: coresight@lists.linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20220210200620.1227232-1-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Commit a7f3713f6b ("libperf tests: Add test_stat_multiplexing test")
added printf's of 64-bit ints using %lu which doesn't work on 32-bit
builds:
tests/test-evlist.c:529:29: error: format ‘%lu’ expects argument of type \
‘long unsigned int’, but argument 4 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=]
Use PRIu64 instead which works on both 32-bit and 64-bit systems.
Fixes: a7f3713f6b ("libperf tests: Add test_stat_multiplexing test")
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shunsuke Nakamura <nakamura.shun@fujitsu.com>
Link: https://lore.kernel.org/r/20220201213903.699656-1-robh@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the trivial change in:
ddecd22878 ("perf: uapi: Document perf_event_attr::sig_data truncation on 32 bit architectures")
Just adds a comment.
This silences this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs from latest version at 'include/uapi/linux/perf_event.h'
diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The function trace__symbols_init() runs "perf-read-vdso32" and that ends up
with a SIGCHLD delivered to 'perf'. And this SIGCHLD make perf exit early.
'perf trace' should exit only if the SIGCHLD is from our workload process.
So let's use sigaction() instead of signal() to match such condition.
Committer notes:
Use memset to zero the 'struct sigaction' variable as the '= { 0 }'
method isn't accepted in many compiler versions, e.g.:
4 34.02 alpine:3.6 : FAIL clang version 4.0.0 (tags/RELEASE_400/final)
builtin-trace.c:4897:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
struct sigaction sigchld_act = { 0 };
^
{}
builtin-trace.c:4897:37: error: missing field 'sa_mask' initializer [-Werror,-Wmissing-field-initializers]
struct sigaction sigchld_act = { 0 };
^
2 errors generated.
6 32.60 alpine:3.8 : FAIL gcc version 6.4.0 (Alpine 6.4.0)
builtin-trace.c:4897:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
struct sigaction sigchld_act = { 0 };
^
{}
builtin-trace.c:4897:37: error: missing field 'sa_mask' initializer [-Werror,-Wmissing-field-initializers]
struct sigaction sigchld_act = { 0 };
^
2 errors generated.
7 34.82 alpine:3.9 : FAIL gcc version 8.3.0 (Alpine 8.3.0)
builtin-trace.c:4897:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
struct sigaction sigchld_act = { 0 };
^
{}
builtin-trace.c:4897:37: error: missing field 'sa_mask' initializer [-Werror,-Wmissing-field-initializers]
struct sigaction sigchld_act = { 0 };
^
2 errors generated.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220208140725.3947-1-changbin.du@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Vladimir Oltean says:
====================
Replay and offload host VLAN entries in DSA
v2->v3:
- make the bridge stop notifying switchdev for !BRENTRY VLANs
- create precommit and commit wrappers around __vlan_add_flags().
- special-case the BRENTRY transition from false to true, instead of
treating it as a change of flags and letting drivers figure out that
it really isn't.
- avoid setting *changed unless we know that functions will not error
out later.
- drop "old_flags" from struct switchdev_obj_port_vlan, nobody needs it
now, in v2 only DSA needed it to filter out BRENTRY transitions, that
is now solved cleaner.
- no BRIDGE_VLAN_INFO_BRENTRY flag checks and manipulations in DSA
whatsoever, use the "bool changed" bit as-is after changing what it
means.
- merge dsa_slave_host_vlan_{add,del}() with
dsa_slave_foreign_vlan_{add,del}(), since now they do the same thing,
because the host_vlan functions no longer need to mangle the vlan
BRENTRY flags and bool changed.
v1->v2:
- prune switchdev VLAN additions with no actual change differently
- no longer need to revert struct net_bridge_vlan changes on error from
switchdev
- no longer need to first delete a changed VLAN before readding it
- pass 'bool changed' and 'u16 old_flags' through switchdev_obj_port_vlan
so that DSA can do some additional post-processing with the
BRIDGE_VLAN_INFO_BRENTRY flag
- support VLANs on foreign interfaces
- fix the same -EOPNOTSUPP error in mv88e6xxx, this time on removal, due
to VLAN deletion getting replayed earlier than FDB deletion
The motivation behind these patches is that Rafael reported the
following error with mv88e6xxx when the first switch port joins a
bridge:
mv88e6085 0x0000000008b96000:00: port 0 failed to add a6:ef:77:c8:5f:3d vid 1 to fdb: -95 (-EOPNOTSUPP)
The FDB entry that's added is the MAC address of the bridge, in VID 1
(the default_pvid), being replayed as part of br_add_if() -> ... ->
nbp_switchdev_sync_objs().
-EOPNOTSUPP is the mv88e6xxx driver's way of saying that VID 1 doesn't
exist in the VTU, so it can't program the ATU with a FID, something
which it needs.
It appears to be a race, but it isn't, since we only end up installing
VID 1 in the VTU by coincidence. DSA's approximation of programming
VLANs on the CPU port together with the user ports breaks down with
host FDB entries on mv88e6xxx, since that strictly requires the VTU to
contain the VID. But the user may freely add VLANs pointing just towards
the bridge, and FDB entries in those VLANs, and DSA will not be aware of
them, because it only listens for VLANs on user ports.
To create a solution that scales properly to cross-chip setups and
doesn't leak entries behind, some changes in the bridge driver are
required. I believe that these are for the better overall, but I may be
wrong. Namely, the same refcounting procedure that DSA has in place for
host FDB and MDB entries can be replicated for VLANs, except that it's
garbage in, garbage out: the VLAN addition and removal notifications
from switchdev aren't balanced. So the first 2 patches attempt to deal
with that.
This patch set has been superficially tested on a board with 3 mv88e6xxx
switches in a daisy chain and appears to produce the primary desired
effect - the driver no longer returns -EOPNOTSUPP when the first port
joins a bridge, and is successful in performing local termination under
a VLAN-aware bridge.
As an additional side effect, it silences the annoying "p%d: already a
member of VLAN %d\n" warning messages that the mv88e6xxx driver produces
when coupled with systemd-networkd, and a few VLANs are configured.
Furthermore, it advances Florian's idea from a few years back, which
never got merged:
https://lore.kernel.org/lkml/20180624153339.13572-1-f.fainelli@gmail.com/
v2 has also been tested on the NXP LS1028A felix switch.
Some testing:
root@debian:~# bridge vlan add dev br0 vid 101 pvid self
[ 100.709220] mv88e6085 d0032004.mdio-mii:10: mv88e6xxx_port_vlan_add: port 9 vlan 101
[ 100.873426] mv88e6085 d0032004.mdio-mii:10: mv88e6xxx_port_vlan_add: port 10 vlan 101
[ 100.892314] mv88e6085 d0032004.mdio-mii:11: mv88e6xxx_port_vlan_add: port 9 vlan 101
[ 101.053392] mv88e6085 d0032004.mdio-mii:11: mv88e6xxx_port_vlan_add: port 10 vlan 101
[ 101.076994] mv88e6085 d0032004.mdio-mii:12: mv88e6xxx_port_vlan_add: port 9 vlan 101
root@debian:~# bridge vlan add dev br0 vid 101 pvid self
root@debian:~# bridge vlan add dev br0 vid 101 pvid self
root@debian:~# bridge vlan
port vlan-id
eth0 1 PVID Egress Untagged
lan9 1 PVID Egress Untagged
lan10 1 PVID Egress Untagged
lan11 1 PVID Egress Untagged
lan12 1 PVID Egress Untagged
lan13 1 PVID Egress Untagged
lan14 1 PVID Egress Untagged
lan15 1 PVID Egress Untagged
lan16 1 PVID Egress Untagged
lan17 1 PVID Egress Untagged
lan18 1 PVID Egress Untagged
lan19 1 PVID Egress Untagged
lan20 1 PVID Egress Untagged
lan21 1 PVID Egress Untagged
lan22 1 PVID Egress Untagged
lan23 1 PVID Egress Untagged
lan24 1 PVID Egress Untagged
sfp 1 PVID Egress Untagged
lan1 1 PVID Egress Untagged
lan2 1 PVID Egress Untagged
lan3 1 PVID Egress Untagged
lan4 1 PVID Egress Untagged
lan5 1 PVID Egress Untagged
lan6 1 PVID Egress Untagged
lan7 1 PVID Egress Untagged
lan8 1 PVID Egress Untagged
br0 1 Egress Untagged
101 PVID
root@debian:~# bridge vlan del dev br0 vid 101 pvid self
[ 108.340487] mv88e6085 d0032004.mdio-mii:11: mv88e6xxx_port_vlan_del: port 9 vlan 101
[ 108.379167] mv88e6085 d0032004.mdio-mii:11: mv88e6xxx_port_vlan_del: port 10 vlan 101
[ 108.402319] mv88e6085 d0032004.mdio-mii:12: mv88e6xxx_port_vlan_del: port 9 vlan 101
[ 108.425866] mv88e6085 d0032004.mdio-mii:10: mv88e6xxx_port_vlan_del: port 9 vlan 101
[ 108.452280] mv88e6085 d0032004.mdio-mii:10: mv88e6xxx_port_vlan_del: port 10 vlan 101
root@debian:~# bridge vlan del dev br0 vid 101 pvid self
root@debian:~# bridge vlan del dev br0 vid 101 pvid self
root@debian:~# bridge vlan
port vlan-id
eth0 1 PVID Egress Untagged
lan9 1 PVID Egress Untagged
lan10 1 PVID Egress Untagged
lan11 1 PVID Egress Untagged
lan12 1 PVID Egress Untagged
lan13 1 PVID Egress Untagged
lan14 1 PVID Egress Untagged
lan15 1 PVID Egress Untagged
lan16 1 PVID Egress Untagged
lan17 1 PVID Egress Untagged
lan18 1 PVID Egress Untagged
lan19 1 PVID Egress Untagged
lan20 1 PVID Egress Untagged
lan21 1 PVID Egress Untagged
lan22 1 PVID Egress Untagged
lan23 1 PVID Egress Untagged
lan24 1 PVID Egress Untagged
sfp 1 PVID Egress Untagged
lan1 1 PVID Egress Untagged
lan2 1 PVID Egress Untagged
lan3 1 PVID Egress Untagged
lan4 1 PVID Egress Untagged
lan5 1 PVID Egress Untagged
lan6 1 PVID Egress Untagged
lan7 1 PVID Egress Untagged
lan8 1 PVID Egress Untagged
br0 1 Egress Untagged
root@debian:~# bridge vlan del dev br0 vid 101 pvid self
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA now explicitly handles VLANs installed with the 'self' flag on the
bridge as host VLANs, instead of just replicating every bridge port VLAN
also on the CPU port and never deleting it, which is what it did before.
However, this leaves a corner case uncovered, as explained by
Tobias Waldekranz:
https://patchwork.kernel.org/project/netdevbpf/patch/20220209213044.2353153-6-vladimir.oltean@nxp.com/#24735260
Forwarding towards a bridge port VLAN installed on a bridge port foreign
to DSA (separate NIC, Wi-Fi AP) used to work by virtue of the fact that
DSA itself needed to have at least one port in that VLAN (therefore, it
also had the CPU port in said VLAN). However, now that the CPU port may
not be member of all VLANs that user ports are members of, we need to
ensure this isn't the case if software forwarding to a foreign interface
is required.
The solution is to treat bridge port VLANs on standalone interfaces in
the exact same way as host VLANs. From DSA's perspective, there is no
difference between local termination and software forwarding; packets in
that VLAN must reach the CPU in both cases.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, DSA programs VLANs on shared (DSA and CPU) ports each time it
does so on user ports. This is good for basic functionality but has
several limitations:
- the VLAN group which must reach the CPU may be radically different
from the VLAN group that must be autonomously forwarded by the switch.
In other words, the admin may want to isolate noisy stations and avoid
traffic from them going to the control processor of the switch, where
it would just waste useless cycles. The bridge already supports
independent control of VLAN groups on bridge ports and on the bridge
itself, and when VLAN-aware, it will drop packets in software anyway
if their VID isn't added as a 'self' entry towards the bridge device.
- Replaying host FDB entries may depend, for some drivers like mv88e6xxx,
on replaying the host VLANs as well. The 2 VLAN groups are
approximately the same in most regular cases, but there are corner
cases when timing matters, and DSA's approximation of replicating
VLANs on shared ports simply does not work.
- If a user makes the bridge (implicitly the CPU port) join a VLAN by
accident, there is no way for the CPU port to isolate itself from that
noisy VLAN except by rebooting the system. This is because for each
VLAN added on a user port, DSA will add it on shared ports too, but
for each VLAN deletion on a user port, it will remain installed on
shared ports, since DSA has no good indication of whether the VLAN is
still in use or not.
Now that the bridge driver emits well-balanced SWITCHDEV_OBJ_ID_PORT_VLAN
addition and removal events, DSA has a simple and straightforward task
of separating the bridge port VLANs (these have an orig_dev which is a
DSA slave interface, or a LAG interface) from the host VLANs (these have
an orig_dev which is a bridge interface), and to keep a simple reference
count of each VID on each shared port.
Forwarding VLANs must be installed on the bridge ports and on all DSA
ports interconnecting them. We don't have a good view of the exact
topology, so we simply install forwarding VLANs on all DSA ports, which
is what has been done until now.
Host VLANs must be installed primarily on the dedicated CPU port of each
bridge port. More subtly, they must also be installed on upstream-facing
and downstream-facing DSA ports that are connecting the bridge ports and
the CPU. This ensures that the mv88e6xxx's problem (VID of host FDB
entry may be absent from VTU) is still addressed even if that switch is
in a cross-chip setup, and it has no local CPU port.
Therefore:
- user ports contain only bridge port (forwarding) VLANs, and no
refcounting is necessary
- DSA ports contain both forwarding and host VLANs. Refcounting is
necessary among these 2 types.
- CPU ports contain only host VLANs. Refcounting is also necessary.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switchdev_handle_port_obj_add() helper is good for replicating a
port object on the lower interfaces of @dev, if that object was emitted
on a bridge, or on a bridge port that is a LAG.
However, drivers that use this helper limit themselves to a box from
which they can no longer intercept port objects notified on neighbor
ports ("foreign interfaces").
One such driver is DSA, where software bridging with foreign interfaces
such as standalone NICs or Wi-Fi APs is an important use case. There, a
VLAN installed on a neighbor bridge port roughly corresponds to a
forwarding VLAN installed on the DSA switch's CPU port.
To support this use case while also making use of the benefits of the
switchdev_handle_* replication helper for port objects, introduce a new
variant of these functions that crawls through the neighbor ports of
@dev, in search of potentially compatible switchdev ports that are
interested in the event.
The strategy is identical to switchdev_handle_fdb_event_to_device():
if @dev wasn't a switchdev interface, then go one step upper, and
recursively call this function on the bridge that this port belongs to.
At the next recursion step, __switchdev_handle_port_obj_add() will
iterate through the bridge's lower interfaces. Among those, some will be
switchdev interfaces, and one will be the original @dev that we came
from. To prevent infinite recursion, we must suppress reentry into the
original @dev, and just call the @add_cb for the switchdev_interfaces.
It looks like this:
br0
/ | \
/ | \
/ | \
swp0 swp1 eth0
1. __switchdev_handle_port_obj_add(eth0)
-> check_cb(eth0) returns false
-> eth0 has no lower interfaces
-> eth0's bridge is br0
-> switchdev_lower_dev_find(br0, check_cb, foreign_dev_check_cb))
finds br0
2. __switchdev_handle_port_obj_add(br0)
-> check_cb(br0) returns false
-> netdev_for_each_lower_dev
-> check_cb(swp0) returns true, so we don't skip this interface
3. __switchdev_handle_port_obj_add(swp0)
-> check_cb(swp0) returns true, so we call add_cb(swp0)
(back to netdev_for_each_lower_dev from 2)
-> check_cb(swp1) returns true, so we don't skip this interface
4. __switchdev_handle_port_obj_add(swp1)
-> check_cb(swp1) returns true, so we call add_cb(swp1)
(back to netdev_for_each_lower_dev from 2)
-> check_cb(eth0) returns false, so we skip this interface to
avoid infinite recursion
Note: eth0 could have been a LAG, and we don't want to suppress the
recursion through its lowers if those exist, so when check_cb() returns
false, we still call switchdev_lower_dev_find() to estimate whether
there's anything worth a recursion beneath that LAG. Using check_cb()
and foreign_dev_check_cb(), switchdev_lower_dev_find() not only figures
out whether the lowers of the LAG are switchdev, but also whether they
actively offload the LAG or not (whether the LAG is "foreign" to the
switchdev interface or not).
The port_obj_info->orig_dev is preserved across recursive calls, so
switchdev drivers still know on which device was this notification
originally emitted.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev_lower_dev_find() assumes RCU read-side critical section
calling context, since it uses netdev_walk_all_lower_dev_rcu().
Rename it appropriately, in preparation of adding a similar iterator
that assumes writer-side rtnl_mutex protection.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The major user of replayed switchdev objects is DSA, and so far it
hasn't needed information about anything other than bridge port VLANs,
so this is all that br_switchdev_vlan_replay() knows to handle.
DSA has managed to get by through replicating every VLAN addition on a
user port such that the same VLAN is also added on all DSA and CPU
ports, but there is a corner case where this does not work.
The mv88e6xxx DSA driver currently prints this error message as soon as
the first port of a switch joins a bridge:
mv88e6085 0x0000000008b96000:00: port 0 failed to add a6:ef:77:c8:5f:3d vid 1 to fdb: -95
where a6:ef:77:c8:5f:3d vid 1 is a local FDB entry corresponding to the
bridge MAC address in the default_pvid.
The -EOPNOTSUPP is returned by mv88e6xxx_port_db_load_purge() because it
tries to map VID 1 to a FID (the ATU is indexed by FID not VID), but
fails to do so. This is because ->port_fdb_add() is called before
->port_vlan_add() for VID 1.
The abridged timeline of the calls is:
br_add_if
-> netdev_master_upper_dev_link
-> dsa_port_bridge_join
-> switchdev_bridge_port_offload
-> br_switchdev_vlan_replay (*)
-> br_switchdev_fdb_replay
-> mv88e6xxx_port_fdb_add
-> nbp_vlan_init
-> nbp_vlan_add
-> mv88e6xxx_port_vlan_add
and the issue is that at the time of (*), the bridge port isn't in VID 1
(nbp_vlan_init hasn't been called), therefore br_switchdev_vlan_replay()
won't have anything to replay, therefore VID 1 won't be in the VTU by
the time mv88e6xxx_port_fdb_add() is called.
This happens only when the first port of a switch joins. For further
ports, the initial mv88e6xxx_port_vlan_add() is sufficient for VID 1 to
be loaded in the VTU (which is switch-wide, not per port).
The problem is somewhat unique to mv88e6xxx by chance, because most
other drivers offload an FDB entry by VID, so FDBs and VLANs can be
added asynchronously with respect to each other, but addressing the
issue at the bridge layer makes sense, since what mv88e6xxx requires
isn't absurd.
To fix this problem, we need to recognize that it isn't the VLAN group
of the port that we're interested in, but the VLAN group of the bridge
itself (so it isn't a timing issue, but rather insufficient information
being passed from switchdev to drivers).
As mentioned, currently nbp_switchdev_sync_objs() only calls
br_switchdev_vlan_replay() for VLANs corresponding to the port, but the
VLANs corresponding to the bridge itself, for local termination, also
need to be replayed. In this case, VID 1 is not (yet) present in the
port's VLAN group but is present in the bridge's VLAN group.
So to fix this bug, DSA is now obligated to explicitly handle VLANs
pointing towards the bridge in order to "close this race" (which isn't
really a race). As Tobias Waldekranz notices, this also implies that it
must explicitly handle port VLANs on foreign interfaces, something that
worked implicitly before:
https://patchwork.kernel.org/project/netdevbpf/patch/20220209213044.2353153-6-vladimir.oltean@nxp.com/#24735260
So in the end, br_switchdev_vlan_replay() must replay all VLANs from all
VLAN groups: all the ports, and the bridge itself.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There may be switchdev drivers that can add/remove a FDB or MDB entry
only as long as the VLAN it's in has been notified and offloaded first.
The nbp_switchdev_sync_objs() method satisfies this requirement on
addition, but nbp_switchdev_unsync_objs() first deletes VLANs, then
deletes MDBs and FDBs. Reverse the order of the function calls to cater
to this requirement.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_switchdev_port_vlan_add() currently emits a SWITCHDEV_PORT_OBJ_ADD
event with a SWITCHDEV_OBJ_ID_PORT_VLAN for 2 distinct cases:
- a struct net_bridge_vlan got created
- an existing struct net_bridge_vlan was modified
This makes it impossible for switchdev drivers to properly balance
PORT_OBJ_ADD with PORT_OBJ_DEL events, so if we want to allow that to
happen, we must provide a way for drivers to distinguish between a
VLAN with changed flags and a new one.
Annotate struct switchdev_obj_port_vlan with a "bool changed" that
distinguishes the 2 cases above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when a VLAN entry is added multiple times in a row to a
bridge port, nbp_vlan_add() calls br_switchdev_port_vlan_add() each
time, even if the VLAN already exists and nothing about it has changed:
bridge vlan add dev lan12 vid 100 master static
Similarly, when a VLAN is added multiple times in a row to a bridge,
br_vlan_add_existing() doesn't filter at all the calls to
br_switchdev_port_vlan_add():
bridge vlan add dev br0 vid 100 self
This behavior makes driver-level accounting of VLANs impossible, since
it is enough for a single deletion event to remove a VLAN, but the
addition event can be emitted an unlimited number of times.
The cause for this can be identified as follows: we rely on
__vlan_add_flags() to retroactively tell us whether it has changed
anything about the VLAN flags or VLAN group pvid. So we'd first have to
call __vlan_add_flags() before calling br_switchdev_port_vlan_add(), in
order to have access to the "bool *changed" information. But we don't
want to change the event ordering, because we'd have to revert the
struct net_bridge_vlan changes we've made if switchdev returns an error.
So to solve this, we need another function that tells us whether any
change is going to occur in the VLAN or VLAN group, _prior_ to calling
__vlan_add_flags().
Split __vlan_add_flags() into a precommit and a commit stage, and rename
it to __vlan_flags_update(). The precommit stage,
__vlan_flags_would_change(), will determine whether there is any reason
to notify switchdev due to a change of flags (note: the BRENTRY flag
transition from false to true is treated separately: as a new switchdev
entry, because we skipped notifying the master VLAN when it wasn't a
brentry yet, and therefore not as a change of flags).
With this lookahead/precommit function in place, we can avoid notifying
switchdev if nothing changed for the VLAN and VLAN group.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently there is a very subtle aspect to the behavior of
__vlan_add_flags(): it changes the struct net_bridge_vlan flags and
pvid, yet it returns true ("changed") even if none of those changed,
just a transition of br_vlan_is_brentry(v) took place from false to
true.
This can be seen in br_vlan_add_existing(), however we do not actually
rely on this subtle behavior, since the "if" condition that checks that
the vlan wasn't a brentry before had a useless (until now) assignment:
*changed = true;
Make things more obvious by actually making __vlan_add_flags() do what's
written on the box, and be more specific about what is actually written
on the box. This is needed because further transformations will be done
to __vlan_add_flags().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a VLAN is added to a bridge port and it doesn't exist on the bridge
device yet, it gets created for the multicast context, but it is
'hidden', since it doesn't have the BRENTRY flag yet:
ip link add br0 type bridge && ip link set swp0 master br0
bridge vlan add dev swp0 vid 100 # the master VLAN 100 gets created
bridge vlan add dev br0 vid 100 self # that VLAN becomes brentry just now
All switchdev drivers ignore switchdev notifiers for VLAN entries which
have the BRENTRY unset, and for good reason: these are merely private
data structures used by the bridge driver. So we might just as well not
notify those at all.
Cleanup in the switchdev drivers that check for the BRENTRY flag is now
possible, and will be handled separately, since those checks just became
dead code.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a VLAN is added to a bridge port, a master VLAN gets created on the
bridge for context, but it doesn't have the BRENTRY flag.
Then, when the same VLAN is added to the bridge itself, that enters
through the br_vlan_add_existing() code path and gains the BRENTRY flag,
thus it becomes "existing".
It seems natural to check for this condition early, because the current
code flow is to notify switchdev of the addition of a VLAN that isn't a
brentry, just to delete it immediately afterwards.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit
a5886ef4f4 ("gve: Introduce per netdev `enum gve_queue_format`")
introduces three queue format type, only GVE_GQI_QPL_FORMAT queue has
page list. So it should use the queue page list number to detect the
zero size queue page list. Correct the design logic.
Using the 'queue_format == GVE_GQI_RDA_FORMAT' may lead to request zero
sized memory allocation, like if the queue format is GVE_DQO_RDA_FORMAT.
The kernel memory subsystem will return ZERO_SIZE_PTR, which is not NULL
address, so the driver can run successfully. Also the code still checks
the queue page list number firstly, then accesses the allocated memory,
so zero number queue page list allocation will not lead to access fault.
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Bailey Forrest <bcf@google.com>
Link: https://lore.kernel.org/r/20220215051751.260866-1-haiyue.wang@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Read HW interrupt pending state from the HW
x86:
* Don't truncate the performance event mask on AMD
* Fix Xen runstate updates to be atomic when preempting vCPU
* Fix for AMD AVIC interrupt injection race
* Several other AMD fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmIL4G4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkQQf/Z75dnmdRl8sHHnGjwH2IhWHwAg+h
5O+mJphYt4cvVMexP5dj69b7mHtKMeg/0TxPvPfwCLlhzKkW1gQFwwBAq/YuBCKw
cnMuVPeCSWo6znpS+jYUF4FAJgPKkzfFR9UwYAR5UexSWyOwU8rLcvSxj8vJjO/l
sIke+f767Ks2KgcTMIudObg+vDcgnQXI8n8ztI7hF1WJKYHdTKFkYN7BYRxQ9BW6
4fq51218DhRMv6S7so5dhYC473f+D0t8b5S/Mygur/x6mzsdQJKeOmi8aWGoDa/B
Bmse+X0lHoOkdXaxqpBgQCYeyrXohNcXx7cpGRVFnS45Jf7MLG4OfVHWNQ==
=kD2l
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Read HW interrupt pending state from the HW
x86:
- Don't truncate the performance event mask on AMD
- Fix Xen runstate updates to be atomic when preempting vCPU
- Fix for AMD AVIC interrupt injection race
- Several other AMD fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
KVM: SVM: fix race between interrupt delivery and AVIC inhibition
KVM: SVM: set IRR in svm_deliver_interrupt
KVM: SVM: extract avic_ring_doorbell
selftests: kvm: Remove absent target file
KVM: arm64: vgic: Read HW interrupt pending state from the HW
KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU
KVM: x86: SVM: move avic definitions from AMD's spec to svm.h
KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it
KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them
KVM: x86: nSVM: expose clean bit support to the guest
KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM
KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state
KVM: x86: nSVM: fix potential NULL derefernce on nested migration
KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case
Revert "svm: Add warning message for AVIC IPI invalid target"
Pull HID fixes from Jiri Kosina:
- memory leak fix for hid-elo driver (Dongliang Mu)
- fix for hangs on newer AMD platforms with amd_sfh-driven hardware
(Basavaraj Natikar )
- locking fix in i2c-hid (Daniel Thompson)
- a few device-ID specific quirks
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: amd_sfh: Add interrupt handler to process interrupts
HID: amd_sfh: Add functionality to clear interrupts
HID: amd_sfh: Disable the interrupt for all command
HID: amd_sfh: Correct the structure field name
HID: amd_sfh: Handle amd_sfh work buffer in PM ops
HID:Add support for UGTABLET WP5540
HID: amd_sfh: Add illuminance mask to limit ALS max value
HID: amd_sfh: Increase sensor command timeout
HID: i2c-hid: goodix: Fix a lockdep splat
HID: elo: fix memory leak in elo_probe
HID: apple: Set the tilde quirk flag on the Wellspring 5 and later
bpf_msg_push_data may return a non-zero value to indicate an error. The
return value should be checked to prevent undetected errors.
To indicate an error, the BPF programs now perform a different action
than their intended one to make the userspace test program notice the
error, i.e., the programs supposed to pass/redirect drop, the program
supposed to drop passes.
Fixes: 84fbfe026a ("bpf: test_sockmap add options to use msg_push_data")
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/89f767bb44005d6b4dd1f42038c438f76b3ebfad.1644601294.git.fmaurer@redhat.com
Now kfunc call uses s32 to represent the offset between the address of
kfunc and __bpf_call_base, but it doesn't check whether or not s32 will
be overflowed. The overflow is possible when kfunc is in module and the
offset between module and kernel is greater than 2GB. Take arm64 as an
example, before commit b2eed9b588 ("arm64/kernel: kaslr: reduce module
randomization range to 2 GB"), the offset between module symbol and
__bpf_call_base will in 4GB range due to KASLR and may overflow s32.
So add an extra checking to reject these invalid kfunc calls.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220215065732.3179408-1-houtao1@huawei.com
Andrii Nakryiko says:
====================
Add minimal C++-specific additions to BPF skeleton codegen to facilitate
easier use of C skeletons in C++ applications. These additions don't add any
extra ongoing maintenance and allows C++ users to fit pure C skeleton better
into their C++ code base. All that without the need to design, implement and
support a separate C++ BPF skeleton implementation.
v1->v2:
- use default argument values in T::open() (Alexei).
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add an example of how to build C++ template-based BPF skeleton wrapper.
It's an actually runnable valid use of skeleton through more C++-like
interface. Note that skeleton destuction happens implicitly through
Skeleton<T>'s destructor.
Also make test_cpp runnable as it would have crashed on invalid btf
passed into btf_dump__new().
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220212055733.539056-3-andrii@kernel.org
Add C++-specific static methods for code-generated BPF skeleton for each
skeleton operation: open, open_opts, open_and_load, load, attach,
detach, destroy, and elf_bytes. This is to facilitate easier C++
templating on top of pure C BPF skeleton.
In C, open/load/destroy/etc "methods" are of the form
<skeleton_name>__<method>() to avoid name collision with similar
"methods" of other skeletons withint the same application. This works
well, but is very inconvenient for C++ applications that would like to
write generic (templated) wrappers around BPF skeleton to fit in with
C++ code base and take advantage of destructors and other convenient C++
constructs.
This patch makes it easier to build such generic templated wrappers by
additionally defining C++ static methods for skeleton's struct with
fixed names. This allows to refer to, say, open method as `T::open()`
instead of having to somehow generate `T__open()` function call.
Next patch adds an example template to test_cpp selftest to demonstrate
how it's possible to have all the operations wrapped in a generic
Skeleton<my_skeleton> type without explicitly passing function references.
An example of generated declaration section without %1$s placeholders:
#ifdef __cplusplus
static struct test_attach_probe *open(const struct bpf_object_open_opts *opts = nullptr);
static struct test_attach_probe *open_and_load();
static int load(struct test_attach_probe *skel);
static int attach(struct test_attach_probe *skel);
static void detach(struct test_attach_probe *skel);
static void destroy(struct test_attach_probe *skel);
static const void *elf_bytes(size_t *sz);
#endif /* __cplusplus */
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220212055733.539056-2-andrii@kernel.org
When compiling selftests in -O2 mode with GCC1, we get three new
compilations warnings about potentially uninitialized variables.
Compiler is wrong 2 out of 3 times, but this patch makes GCC11 happy
anyways, as it doesn't cost us anything and makes optimized selftests
build less annoying.
The amazing one is tc_redirect case of token that is malloc()'ed before
ASSERT_OK_PTR() check is done on it. Seems like GCC pessimistically
assumes that libbpf_get_error() will dereference the contents of the
pointer (no it won't), so the only way I found to shut GCC up was to do
zero-initializaing calloc(). This one was new to me.
For linfo case, GCC didn't realize that linfo_size will be initialized
by the function that is returning linfo_size as out parameter.
core_reloc.c case was a real bug, we can goto cleanup before initializing
obj. But we don't need to do any clean up, so just continue iteration
intstead.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211190927.1434329-1-andrii@kernel.org
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmILuxMACgkQxWXV+ddt
WDvhrA/9Hsyj2DdvvBVR3HudaER51RAJS6dtJCJdFZGWy2tEwtkxhIdbPn1nwJE7
mvZy2UN79JKwPAdX8inyJ68RCMtcifprkUMC2d7y2mVZcCG/a/iYGdDIVB/z4Pyx
NneBBgwdG0V505i7/sm07epLUaNI1MwXc9wNAs00zSXw4eYjLq09fp0lfl74RBhv
HvuYgawk2abY6hPbJnTu2MyyHEZI4oGH/fRurP48cvU/FbIm9en7aX3rEZ+T8yRW
TkEdFF/60Wce3xkyN87Xqma6L0smypJ888yzpwsJtlFOTr7iI58HYqUfx27Q5VQc
xy5fyuuplEb0ky4GBnscpsoutj5C241+4+YE4HGqf9ne5EYU1rzJATlEFUBk84hY
YwjdordS/nTScyFVCBG9yiTL30KsQ2SQc1TzIt/rIJiYIJexJyppOJMFmxbuN9By
WSrLB5/uN56dRe/A8LMGpuJdwTVrYr2SPXfSseAxCEONt5fppPnDaCGEgVKIdmHq
sQXbs/LMGHQ1lq2JsPFD12p8kQJ5Redxy0KIzDwmeBAL3HlXwpFiMia0AhvKOzPj
UFtU/KcOmtqWMMv3P0aHydmDmUid4c3612BtvbKOhIXTVzKodzcQhkyTw1ducAa1
GMkKIHCaPCzbsJwiogZGSBmIyDyMwitVvAybZIpRTR9i0xSA61A=
=AqU+
-----END PGP SIGNATURE-----
Merge tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- yield CPU more often when defragmenting a large file
- skip defragmenting extents already under writeback
- improve error message when send fails to write file data
- get rid of warning when mounted with 'flushoncommit'
* tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: in case of IO error log it
btrfs: get rid of warning on transaction commit when using flushoncommit
btrfs: defrag: don't try to defrag extents which are under writeback
btrfs: don't hold CPU for too long when defragging a file