mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-12-13 05:54:23 +08:00
68f66f97c5
6259 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Masami Hiramatsu (Google)
|
0add699ad0 |
tracing: probes: Fix to zero initialize a local variable
Fix to initialize 'val' local variable with zero.
Dan reported that Smatch static code checker reports an error that a local
'val' variable needs to be initialized. Actually, the 'val' is expected to
be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
initialize it with zero.
Link: https://lore.kernel.org/all/171092223833.237219.17304490075697026697.stgit@devnote2/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/
Fixes:
|
||
Andrii Nakryiko
|
68ca5d4eeb |
bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs
Wire up BPF cookie for raw tracepoint programs (both BTF and non-BTF aware variants). This brings them up to part w.r.t. BPF cookie usage with classic tracepoint and fentry/fexit programs. Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240319233852.1977493-4-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Andrii Nakryiko
|
d4dfc5700e |
bpf: pass whole link instead of prog when triggering raw tracepoint
Instead of passing prog as an argument to bpf_trace_runX() helpers, that are called from tracepoint triggering calls, store BPF link itself (struct bpf_raw_tp_link for raw tracepoints). This will allow to pass extra information like BPF cookie into raw tracepoint registration. Instead of replacing `struct bpf_prog *prog = __data;` with corresponding `struct bpf_raw_tp_link *link = __data;` assignment in `__bpf_trace_##call` I just passed `__data` through into underlying bpf_trace_runX() call. This works well because we implicitly cast `void *`, and it also avoids naming clashes with arguments coming from tracepoint's "proto" list. We could have run into the same problem with "prog", we just happened to not have a tracepoint that has "prog" input argument. We are less lucky with "link", as there are tracepoints using "link" argument name already. So instead of trying to avoid naming conflicts, let's just remove intermediate local variable. It doesn't hurt readibility, it's either way a bit of a maze of calls and macros, that requires careful reading. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240319233852.1977493-3-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Andrii Nakryiko
|
6b9c2950c9 |
bpf: flatten bpf_probe_register call chain
bpf_probe_register() and __bpf_probe_register() have identical signatures and bpf_probe_register() just redirect to __bpf_probe_register(). So get rid of this extra function call step to simplify following the source code. It has no difference at runtime due to inlining, of course. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240319233852.1977493-2-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yonghong Song
|
eb166e522c |
bpf: Allow helper bpf_get_[ns_]current_pid_tgid() for all prog types
Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed in tracing progs. We have an internal use case where for an application running in a container (with pid namespace), user wants to get the pid associated with the pid namespace in a cgroup bpf program. Currently, cgroup bpf progs already allow bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid() as well. With auditing the code, bpf_get_current_pid_tgid() is also used by sk_msg prog. But there are no side effect to expose these two helpers to all prog types since they do not reveal any kernel specific data. The detailed discussion is in [1]. So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid() are put in bpf_base_func_proto(), making them available to all program types. [1] https://lore.kernel.org/bpf/20240307232659.1115872-1-yonghong.song@linux.dev/ Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240315184854.2975190-1-yonghong.song@linux.dev |
||
Thorsten Blum
|
d6cb38e108 |
tracing: Use div64_u64() instead of do_div()
Fixes Coccinelle/coccicheck warnings reported by do_div.cocci. Compared to do_div(), div64_u64() does not implicitly cast the divisor and does not unnecessarily calculate the remainder. Link: https://lore.kernel.org/linux-trace-kernel/20240225164507.232942-2-thorsten.blum@toblux.com Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Huang Yiwei
|
19f0423fd5 |
tracing: Support to dump instance traces by ftrace_dump_on_oops
Currently ftrace only dumps the global trace buffer on an OOPs. For debugging a production usecase, instance trace will be helpful to check specific problems since global trace buffer may be used for other purposes. This patch extend the ftrace_dump_on_oops parameter to dump a specific or multiple trace instances: - ftrace_dump_on_oops=0: as before -- don't dump - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer on all CPUs - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global trace buffer on CPU that triggered the oops - ftrace_dump_on_oops=<instance_name>: new behavior -- dump the tracing instance matching <instance_name> - ftrace_dump_on_oops[=2/orig_cpu],<instance1_name>[=2/orig_cpu], <instrance2_name>[=2/orig_cpu]: new behavior -- dump the global trace buffer and multiple instance buffer on all CPUs, or only dump on CPU that triggered the oops if =2 or =orig_cpu is given Also, the sysctl node can handle the input accordingly. Link: https://lore.kernel.org/linux-trace-kernel/20240223083126.1817731-1-quic_hyiwei@quicinc.com Cc: Ross Zwisler <zwisler@google.com> Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mcgrof@kernel.org> Cc: <keescook@chromium.org> Cc: <j.granados@samsung.com> Cc: <mathieu.desnoyers@efficios.com> Cc: <corbet@lwn.net> Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Randy Dunlap
|
d15304135c |
ftrace: Fix most kernel-doc warnings
Reduce the number of kernel-doc warnings from 52 down to 10, i.e., fix 42 kernel-doc warnings by (a) using the Returns: format for function return values or (b) using "@var:" instead of "@var -" for function parameter descriptions. Fix one return values list so that it is formatted correctly when rendered for output. Spell "non-zero" with a hyphen in several places. Link: https://lore.kernel.org/linux-trace-kernel/20240223054833.15471-1-rdunlap@infradead.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/oe-kbuild-all/202312180518.X6fRyDSN-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
2048fdc275 |
tracing: Decrement the snapshot if the snapshot trigger fails to register
Running the ftrace selftests caused the ring buffer mapping test to fail. Investigating, I found that the snapshot counter would be incremented every time a snapshot trigger was added, even if that snapshot trigger failed. # cd /sys/kernel/tracing # echo "snapshot" > events/sched/sched_process_fork/trigger # echo "snapshot" > events/sched/sched_process_fork/trigger -bash: echo: write error: File exists That second one that fails increments the snapshot counter but doesn't decrement it. It needs to be decremented when the snapshot fails. Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.729055907@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
cca990c7b5 |
tracing: Fix snapshot counter going between two tracers that use it
Running the ftrace selftests caused the ring buffer mapping test to fail. Investigating, I found that the snapshot counter would be incremented every time a tracer that uses the snapshot is enabled even if the snapshot was used by the previous tracer. That is: # cd /sys/kernel/tracing # echo wakeup_rt > current_tracer # echo wakeup_dl > current_tracer # echo nop > current_tracer would leave the snapshot counter at 1 and not zero. That's because the enabling of wakeup_dl would increment the counter again but the setting the tracer to nop would only decrement it once. Do not arm the snapshot for a tracer if the previous tracer already had it armed. Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.570525723@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
John Garry
|
ed89683763 |
tracing: Use init_utsname()->release
Instead of using UTS_RELEASE, use init_utsname()->release, which means that we don't need to rebuild the code just for the git head commit changing. Link: https://lore.kernel.org/linux-trace-kernel/20240222124639.65629-1-john.g.garry@oracle.com Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Beau Belgrave
|
64805e4039 |
tracing/user_events: Introduce multi-format events
Currently user_events supports 1 event with the same name and must have the exact same format when referenced by multiple programs. This opens an opportunity for malicious or poorly thought through programs to create events that others use with different formats. Another scenario is user programs wishing to use the same event name but add more fields later when the software updates. Various versions of a program may be running side-by-side, which is prevented by the current single format requirement. Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates the user program wishes to use the same user_event name, but may have several different formats of the event. When this flag is used, create the underlying tracepoint backing the user_event with a unique name per-version of the format. It's important that existing ABI users do not get this logic automatically, even if one of the multi format events matches the format. This ensures existing programs that create events and assume the tracepoint name will match exactly continue to work as expected. Add logic to only check multi-format events with other multi-format events and single-format events to only check single-format events during find. Change system name of the multi-format event tracepoint to ensure that multi-format events are isolated completely from single-format events. This prevents single-format names from conflicting with multi-format events if they end with the same suffix as the multi-format events. Add a register_name (reg_name) to the user_event struct which allows for split naming of events. We now have the name that was used to register within user_events as well as the unique name for the tracepoint. Upon registering events ensure matches based on first the reg_name, followed by the fields and format of the event. This allows for multiple events with the same registered name to have different formats. The underlying tracepoint will have a unique name in the format of {reg_name}.{unique_id}. For example, if both "test u32 value" and "test u64 value" are used with the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique tracepoints. The dynamic_events file would then show the following: u:test u64 count u:test u32 count The actual tracepoint names look like this: test.0 test.1 Both would be under the new user_events_multi system name to prevent the older ABI from being used to squat on multi-formatted events and block their use. Deleting events via "!u:test u64 count" would only delete the first tracepoint that matched that format. When the delete ABI is used all events with the same name will be attempted to be deleted. If per-version deletion is required, user programs should either not use persistent events or delete them via dynamic_events. Link: https://lore.kernel.org/linux-trace-kernel/20240222001807.1463-3-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Beau Belgrave
|
1e953de9e9 |
tracing/user_events: Prepare find/delete for same name events
The current code for finding and deleting events assumes that there will never be cases when user_events are registered with the same name, but different formats. Scenarios exist where programs want to use the same name but have different formats. An example is multiple versions of a program running side-by-side using the same event name, but with updated formats in each version. This change does not yet allow for multi-format events. If user_events are registered with the same name but different arguments the programs see the same return values as before. This change simply makes it possible to easily accommodate for this. Update find_user_event() to take in argument parameters and register flags to accommodate future multi-format event scenarios. Have find validate argument matching and return error pointers to cover when an existing event has the same name but different format. Update callers to handle error pointer logic. Move delete_user_event() to use hash walking directly now that find_user_event() has changed. Delete all events found that match the register name, stop if an error occurs and report back to the user. Update user_fields_match() to cover list_empty() scenarios now that find_user_event() uses it directly. This makes the logic consistent across several callsites. Link: https://lore.kernel.org/linux-trace-kernel/20240222001807.1463-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Vincent Donnefort
|
180e4e3909 |
tracing: Add snapshot refcount
When a ring-buffer is memory mapped by user-space, no trace or ring-buffer swap is possible. This means the snapshot feature is mutually exclusive with the memory mapping. Having a refcount on snapshot users will help to know if a mapping is possible or not. Instead of relying on the global trace_types_lock, a new spinlock is introduced to serialize accesses to trace_array->snapshot. This intends to allow access to that variable in a context where the mmap lock is already held. Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-4-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
b70f293824 |
ring-buffer: Make wake once of ring_buffer_wait() more robust
The default behavior of ring_buffer_wait() when passed a NULL "cond"
parameter is to exit the function the first time it is woken up. The
current implementation uses a counter that starts at zero and when it is
greater than one it exits the wait_event_interruptible().
But this relies on the internal working of wait_event_interruptible() as
that code basically has:
if (cond)
return;
prepare_to_wait();
if (!cond)
schedule();
finish_wait();
That is, cond is called twice before it sleeps. The default cond of
ring_buffer_wait() needs to account for that and wait for its counter to
increment twice before exiting.
Instead, use the seq/atomic_inc logic that is used by the tracing code
that calls this function. Add an atomic_t seq to rb_irq_work and when cond
is NULL, have the default callback take a descriptor as its data that
holds the rbwork and the value of the seq when it started.
The wakeups will now increment the rbwork->seq and the cond callback will
simply check if that number is different, and no longer have to rely on
the implementation of wait_event_interruptible().
Link: https://lore.kernel.org/linux-trace-kernel/20240315063115.6cb5d205@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
linke li
|
f1e30cb636 |
ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment
In function ring_buffer_iter_empty(), cpu_buffer->commit_page is read while other threads may change it. It may cause the time_stamp that read in the next line come from a different page. Use READ_ONCE() to avoid having to reason about compiler optimizations now and in future. Link: https://lore.kernel.org/linux-trace-kernel/tencent_DFF7D3561A0686B5E8FC079150A02505180A@qq.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: linke li <lilinke99@qq.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Vincent Donnefort
|
6b76323e5a |
ring-buffer: Zero ring-buffer sub-buffers
In preparation for the ring-buffer memory mapping where each subbuf will be accessible to user-space, zero all the page allocations. Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-2-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
2cc621fd2e |
tracing: Move saved_cmdline code into trace_sched_switch.c
The code that handles saved_cmdlines is split between the trace.c file and the trace_sched_switch.c. There's some history to this. The trace_sched_switch.c was originally created to handle the sched_switch tracer that was deprecated due to sched_switch trace event making it obsolete. But that file did not get deleted as it had some code to help with saved_cmdlines. But trace.c has grown tremendously since then. Just move all the saved_cmdlines code into trace_sched_switch.c as that's the only reason that file still exists, and trace.c has gotten too big. No functional changes. Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.497966629@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
e85d471c2b |
tracing: Move open coded processing of tgid_map into helper function
In preparation of moving the saved_cmdlines logic out of trace.c and into trace_sched_switch.c, replace the open coded manipulation of tgid_map in set_tracer_flag() into a helper function trace_alloc_tgid_map() so that it can be easily moved into trace_sched_switch.c without changing existing functions in trace.c. No functional changes. Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.338116216@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
0b18c852cc |
tracing: Have saved_cmdlines arrays all in one allocation
The saved_cmdlines have three arrays for mapping PIDs to COMMs:
- map_pid_to_cmdline[]
- map_cmdline_to_pid[]
- saved_cmdlines
The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index
into the other arrays. The map_cmdline_to_pid[] is a mapping back to the
full pid as it can be larger than PID_MAX_DEFAULT. And the
saved_cmdlines[] just holds the COMMs associated to the pids.
Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated
together (in reality the saved_cmdlines is just in the memory of the
rounding of the allocation of the structure as it is always allocated in
powers of two). The map_cmdline_to_pid[] array is allocated separately.
Since the rounding to a power of two is rather large (it allows for 8000
elements in saved_cmdlines), also include the map_cmdline_to_pid[] array.
(This drops it to 6000 by default, which is still plenty for most use
cases). This saves even more memory as the map_cmdline_to_pid[] array
doesn't need to be allocated.
Link: https://lore.kernel.org/linux-trace-kernel/20240212174011.068211d9@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.182330529@goodmis.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Fixes:
|
||
Linus Torvalds
|
63bd30f249 |
Tracing/ring-buffer fixes for 6.8 (to be applied in 6.9-rc):
- Do not update shortest_full in rb_watermark_hit() if the watermark is hit. The shortest_full field was being updated regardless if the task was going to wait or not. If the watermark is hit, then the task is not going to wait, so do not update the shortest_full field (used by the waker). - Update shortest_full field before setting the full_waiters_pending flag In the poll logic, the full_waiters_pending flag was being set before the shortest_full field was set. If the full_waiters_pending flag is set, writers will check the shortest_full field which has the least percentage of data that the ring buffer needs to be filled before waking up. The writer will check shortest_full if full_waiters_pending is set, and if the ring buffer percentage filled is greater than shortest full, then it will call the irq_work to wake up the waiters. The problem was that the poll logic set the full_waiters_pending flag before updating shortest_full, which when zero will always trigger the writer to call the irq_work to wake up the waiters. The irq_work will reset the shortest_full field back to zero as the woken waiters is suppose to reset it. - There's some optimized logic in the rb_watermark_hit() that is used in ring_buffer_wait(). Use that helper function in the poll logic as well. - Restructure ring_buffer_wait() to use wait_event_interruptible() The logic to wake up pending readers when the file descriptor is closed is racy. Restructure ring_buffer_wait() to allow callers to pass in conditions besides the ring buffer having enough data in it by using wait_event_interruptible(). - Update the tracing_wait_on_pipe() to call ring_buffer_wait() with its own conditions to exit the wait loop. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZfH6MRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qtlwAP9ZoSIkvw2MVu7FclgAguaX2CaylGEw sv0wZaCy1kgAPgD8CFhezZcHrt/RwJibpMxVnUs+DDqYnGdJsHYLihlbWgg= =99FG -----END PGP SIGNATURE----- Merge tag 'trace-ring-buffer-v6.8-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Do not update shortest_full in rb_watermark_hit() if the watermark is hit. The shortest_full field was being updated regardless if the task was going to wait or not. If the watermark is hit, then the task is not going to wait, so do not update the shortest_full field (used by the waker). - Update shortest_full field before setting the full_waiters_pending flag In the poll logic, the full_waiters_pending flag was being set before the shortest_full field was set. If the full_waiters_pending flag is set, writers will check the shortest_full field which has the least percentage of data that the ring buffer needs to be filled before waking up. The writer will check shortest_full if full_waiters_pending is set, and if the ring buffer percentage filled is greater than shortest full, then it will call the irq_work to wake up the waiters. The problem was that the poll logic set the full_waiters_pending flag before updating shortest_full, which when zero will always trigger the writer to call the irq_work to wake up the waiters. The irq_work will reset the shortest_full field back to zero as the woken waiters is suppose to reset it. - There's some optimized logic in the rb_watermark_hit() that is used in ring_buffer_wait(). Use that helper function in the poll logic as well. - Restructure ring_buffer_wait() to use wait_event_interruptible() The logic to wake up pending readers when the file descriptor is closed is racy. Restructure ring_buffer_wait() to allow callers to pass in conditions besides the ring buffer having enough data in it by using wait_event_interruptible(). - Update the tracing_wait_on_pipe() to call ring_buffer_wait() with its own conditions to exit the wait loop. * tag 'trace-ring-buffer-v6.8-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/ring-buffer: Fix wait_on_pipe() race ring-buffer: Use wait_event_interruptible() in ring_buffer_wait() ring-buffer: Reuse rb_watermark_hit() for the poll logic ring-buffer: Fix full_waiters_pending in poll ring-buffer: Do not set shortest_full when full target is hit |
||
Linus Torvalds
|
01732755ee |
Probes updates for v6.9:
- x96/kprobes: Use boolean for some function return instead of 0 and 1. - x86/kprobes: Prohibit probing on INT/UD. This prevents user to put kprobe on INTn/INT1/INT3/INTO and UD0/UD1/UD2 because these are used for a special purpose in the kernel. - x86/kprobes: Boost Grp instructions. Because a few percent of kernel instructions are Grp 2/3/4/5 and those are safe to be executed without ip register fixup, allow those to be boosted (direct execution on the trampoline buffer with a JMP). - tracing/probes: Add function argument access from return events (kretprobe and fprobe). This allows user to compare how a data structure field is changed after executing a function. With BTF, return event also accepts function argument access by name. This also includes below patches; . Fix a wrong comment (using "Kretprobe" in fprobe) . Cleanup a big probe argument parser function into three parts, type parser, post-processing function, and main parser. . Cleanup to set nr_args field when initializing trace_probe instead of counting up it while parsing. . Cleanup a redundant #else block from tracefs/README source code. . Update selftests to check entry argument access from return probes. . Documentation update about entry argument access from return probes. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmXwW4kbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bH80H/3H6JENlDAjaSLi4vYrP Qyw/cOGIuGu8cDEzkkOaFMol3TY23M7tQZH1lFefvV92gebZ0ttXnrQhSsKeO5XT PCZ6Eoift5rwJCY967W4V6O0DrAkOGHlPtlKs47APJnTXwn8RcFTqWlQmhWg1AfD g/FCWV7cs3eewZgV9iQcLydOoLLgRMr3G3rtPYQbCXhPzze0WTu4dSOXxCTjFe04 riHQy7R+ut6Cur8njpoqZl6bCMkQqAylByXf6wK96HjcS0+ZI7Ivi8Ey3l2aAFen EeIViMU2Bl02XzBszj7Xq2cT/ebYAgDonFW3/5ZKD1YMO6F7wPoVH5OHrQ518Xuw hQ8= =O6l5 -----END PGP SIGNATURE----- Merge tag 'probes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: "x86 kprobes: - Use boolean for some function return instead of 0 and 1 - Prohibit probing on INT/UD. This prevents user to put kprobe on INTn/INT1/INT3/INTO and UD0/UD1/UD2 because these are used for a special purpose in the kernel - Boost Grp instructions. Because a few percent of kernel instructions are Grp 2/3/4/5 and those are safe to be executed without ip register fixup, allow those to be boosted (direct execution on the trampoline buffer with a JMP) tracing: - Add function argument access from return events (kretprobe and fprobe). This allows user to compare how a data structure field is changed after executing a function. With BTF, return event also accepts function argument access by name. - Fix a wrong comment (using "Kretprobe" in fprobe) - Cleanup a big probe argument parser function into three parts, type parser, post-processing function, and main parser - Cleanup to set nr_args field when initializing trace_probe instead of counting up it while parsing - Cleanup a redundant #else block from tracefs/README source code - Update selftests to check entry argument access from return probes - Documentation update about entry argument access from return probes" * tag 'probes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: Documentation: tracing: Add entry argument access at function exit selftests/ftrace: Add test cases for entry args at function exit tracing/probes: Support $argN in return probe (kprobe and fprobe) tracing: Remove redundant #else block for BTF args from README tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init tracing/probes: Cleanup probe argument parser tracing/fprobe-event: cleanup: Fix a wrong comment in fprobe event x86/kprobes: Boost more instructions from grp2/3/4/5 x86/kprobes: Prohibit kprobing on INT and UD x86/kprobes: Refactor can_{probe,boost} return type to bool |
||
Linus Torvalds
|
9187210eee |
Networking changes for 6.9.
Core & protocols ---------------- - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc.) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code -------------------------------------------- - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter --------- - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF --- - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless -------- - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API ---------- - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc ---- - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers ------- - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmXv0mgACgkQMUZtbf5S IrtgMxAAuRd+WJW++SENr4KxIWhYO1q6Xcxnai43wrNkan9swD24icG8TYALt4f3 yoT6idQvWReAb5JNlh9rUQz8R7E0nJXlvEFn5MtJwcthx2C6wFo/XkJlddlRrT+j c2xGILwLjRhW65LaC0MZ2ECbEERkFz8xcGfK2SWzUgh6KYvPjcRfKFxugpM7xOQK P/Wnqhs4fVRS/Mj/bCcXcO+yhwC121Q3qVeQVjGS0AzEC65hAW87a/kc2BfgcegD EyI9R7mf6criQwX+0awubjfoIdr4oW/8oDVNvUDczkJkbaEVaLMQk9P5x/0XnnVS UHUchWXyI80Q8Rj12uN1/I0h3WtwNQnCRBuLSmtm6GLfCAwbLvp2nGWDnaXiqryW DVKUIHGvqPKjkOOMOVfSvfB3LvkS3xsFVVYiQBQCn0YSs/gtu4CoF2Nty9CiLPbK tTuxUnLdPDZDxU//l0VArZmP8p2JM7XQGJ+JH8GFH4SBTyBR23e0iyPSoyaxjnYn RReDnHMVsrS1i7GPhbqDJWn+uqMSs7N149i0XmmyeqwQHUVSJN3J2BApP2nCaDfy H2lTuYly5FfEezt61NvCE4qr/VsWeEjm1fYlFQ9dFn4pGn+HghyCpw+xD1ZN56DN lujemau5B3kk1UTtAT4ypPqvuqjkRFqpNV2LzsJSk/Js+hApw8Y= =oY52 -----END PGP SIGNATURE----- Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code: - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter: - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF: - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless: - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API: - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc: - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers: - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro" * tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits) nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y nexthop: Fix out-of-bounds access during attribute validation nexthop: Only parse NHA_OP_FLAGS for dump messages that require it nexthop: Only parse NHA_OP_FLAGS for get messages that require it bpf: move sleepable flag from bpf_prog_aux to bpf_prog bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes() selftests/bpf: Add kprobe multi triggering benchmarks ptp: Move from simple ida to xarray vxlan: Remove generic .ndo_get_stats64 vxlan: Do not alloc tstats manually devlink: Add comments to use netlink gen tool nfp: flower: handle acti_netdevs allocation failure net/packet: Add getsockopt support for PACKET_COPY_THRESH net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID selftests/bpf: Add bpf_arena_htab test. selftests/bpf: Add bpf_arena_list test. selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages bpf: Add helper macro bpf_addr_space_cast() libbpf: Recognize __arena global variables. bpftool: Recognize arena map type ... |
||
Steven Rostedt (Google)
|
2aa043a55b |
tracing/ring-buffer: Fix wait_on_pipe() race
When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.
CPU 0 CPU 1
----- -----
wait_index++;
index = wait_index;
ring_buffer_wake_waiters();
wait_on_pipe()
ring_buffer_wait();
The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:
prepare_to_wait();
if (!condition)
schedule();
Where the missing condition check is the iter->wait_index update.
Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.
In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.
Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.
Have the wait_on_pipe() condition callback also check the closed field.
Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.
Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.557950713@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes:
|
||
Steven Rostedt (Google)
|
7af9ded0c2 |
ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()
Convert ring_buffer_wait() over to wait_event_interruptible(). The default
condition is to execute the wait loop inside __wait_event() just once.
This does not change the ring_buffer_wait() prototype yet, but
restructures the code so that it can take a "cond" and "data" parameter
and will call wait_event_interruptible() with a helper function as the
condition.
The helper function (rb_wait_cond) takes the cond function and data
parameters. It will first check if the buffer hit the watermark defined by
the "full" parameter and then call the passed in condition parameter. If
either are true, it returns true.
If rb_wait_cond() does not return true, it will set the appropriate
"waiters_pending" flag and returns false.
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.399598519@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes:
|
||
Steven Rostedt (Google)
|
e36f19a645 |
ring-buffer: Reuse rb_watermark_hit() for the poll logic
The check for knowing if the poll should wait or not is basically the exact same logic as rb_watermark_hit(). The only difference is that rb_watermark_hit() also handles the !full case. But for the full case, the logic is the same. Just call that instead of duplicating the code in ring_buffer_poll_wait(). Link: https://lore.kernel.org/linux-trace-kernel/20240312131952.802267543@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
8145f1c35f |
ring-buffer: Fix full_waiters_pending in poll
If a reader of the ring buffer is doing a poll, and waiting for the ring
buffer to hit a specific watermark, there could be a case where it gets
into an infinite ping-pong loop.
The poll code has:
rbwork->full_waiters_pending = true;
if (!cpu_buffer->shortest_full ||
cpu_buffer->shortest_full > full)
cpu_buffer->shortest_full = full;
The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.
But the code could get into a circular loop:
CPU 0 CPU 1
----- -----
[ Poll ]
[ shortest_full = 0 ]
rbwork->full_waiters_pending = true;
if (rbwork->full_waiters_pending &&
[ buffer percent ] > shortest_full) {
rbwork->wakeup_full = true;
[ queue_irqwork ]
cpu_buffer->shortest_full = full;
[ IRQ work ]
if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
[woken]
if ([ buffer percent ] > full)
break;
rbwork->full_waiters_pending = true;
if (rbwork->full_waiters_pending &&
[ buffer percent ] > shortest_full) {
rbwork->wakeup_full = true;
[ queue_irqwork ]
cpu_buffer->shortest_full = full;
[ IRQ work ]
if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
[woken]
[ Wash, rinse, repeat! ]
In the poll, the shortest_full needs to be set before the
full_pending_waiters, as once that is set, the writer will compare the
current shortest_full (which is incorrect) to decide to call the irq_work,
which will reset the shortest_full (expecting the readers to update it).
Also move the setting of full_waiters_pending after the check if the ring
buffer has the required percentage filled. There's no reason to tell the
writer to wake up waiters if there are no waiters.
Link: https://lore.kernel.org/linux-trace-kernel/20240312131952.630922155@goodmis.org
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
|
||
Steven Rostedt (Google)
|
761d9473e2 |
ring-buffer: Do not set shortest_full when full target is hit
The rb_watermark_hit() checks if the amount of data in the ring buffer is
above the percentage level passed in by the "full" variable. If it is, it
returns true.
But it also sets the "shortest_full" field of the cpu_buffer that informs
writers that it needs to call the irq_work if the amount of data on the
ring buffer is above the requested amount.
The rb_watermark_hit() always sets the shortest_full even if the amount in
the ring buffer is what it wants. As it is not going to wait, because it
has what it wants, there's no reason to set shortest_full.
Link: https://lore.kernel.org/linux-trace-kernel/20240312115641.6aa8ba08@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Linus Torvalds
|
685d982112 |
Core x86 changes for v6.9:
- The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code. - Propagate more RIP-relative addressing in assembly code, to generate slightly better code. - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options. - Rework the x86 idle code to cure RCU violations and to clean up the logic. - Clean up the vDSO Makefile logic. - Misc cleanups and fixes. [ Please note that there's a higher number of merge commits in this branch (three) than is usual in x86 topic trees. This happened due to the long testing lifecycle of the percpu changes that involved 3 merge windows, which generated a longer history and various interactions with other core x86 changes that we felt better about to carry in a single branch. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvB0gRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jUqRAAqnEQPiabF5acQlHrwviX+cjSobDlqtH5 9q2AQy9qaEHapzD0XMOxvFye6XIvehGOGxSPvk6CoviSxBND8rb56lvnsEZuLeBV Bo5QSIL2x42Zrvo11iPHwgXZfTIusU90sBuKDRFkYBAxY3HK2naMDZe8MAsYCUE9 nwgHF8DDc/NYiSOXV8kosWoWpNIkoK/STyH5bvTQZMqZcwyZ49AIeP1jGZb/prbC e/rbnlrq5Eu6brpM7xo9kELO0Vhd34urV14KrrIpdkmUKytW2KIsyvW8D6fqgDBj NSaQLLcz0pCXbhF+8Nqvdh/1coR4L7Ymt08P1rfEjCsQgb/2WnSAGUQuC5JoGzaj ngkbFcZllIbD9gNzMQ1n4Aw5TiO+l9zxCqPC/r58Uuvstr+K9QKlwnp2+B3Q73Ft rojIJ04NJL6lCHdDgwAjTTks+TD2PT/eBWsDfJ/1pnUWttmv9IjMpnXD5sbHxoiU 2RGGKnYbxXczYdq/ALYDWM6JXpfnJZcXL3jJi0IDcCSsb92xRvTANYFHnTfyzGfw EHkhbF4e4Vy9f6QOkSP3CvW5H26BmZS9DKG0J9Il5R3u2lKdfbb5vmtUmVTqHmAD Ulo5cWZjEznlWCAYSI/aIidmBsp9OAEvYd+X7Z5SBIgTfSqV7VWHGt0BfA1heiVv F/mednG0gGc= =3v4F -----END PGP SIGNATURE----- Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ... |
||
Jakub Kicinski
|
5f20e6ab1f |
for-netdev
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmXvm7IACgkQ6rmadz2v bTqdMA//VMHNHVLb4oROoXyQD9fw2mCmIUEKzP88RXfqcxsfEX7HF+k8B5ZTk0ro CHXTAnc79+Qqg0j24bkQKxup/fKBQVw9D+Ia4b3ytlm1I2MtyU/16xNEzVhAPU2D iKk6mVBsEdCbt/GjpWORy/VVnZlZpC7BOpZLxsbbxgXOndnCegyjXzSnLGJGxdvi zkrQTn2SrFzLi6aNpVLqrv6Nks6HJusfCKsIrtlbkQ85dulasHOtwK9s6GF60nte aaho+MPx3L+lWEgapsm8rR779pHaYIB/GbZUgEPxE/xUJ/V8BzDgFNLMzEiIBRMN a0zZam11BkBzCfcO9gkvDRByaei/dZz2jdqfU4GlHklFj1WFfz8Q7fRLEPINksvj WXLgJADGY5mtGbjG21FScThxzj+Ruqwx0a13ddlyI/W+P3y5yzSWsLwJG5F9p0oU 6nlkJ4U8yg+9E1ie5ae0TibqvRJzXPjfOERZGwYDSVvfQGzv1z+DGSOPMmgNcWYM dIaO+A/+NS3zdbk8+1PP2SBbhHPk6kWyCUByWc7wMzCPTiwriFGY/DD2sN+Fsufo zorzfikUQOlTfzzD5jbmT49U8hUQUf6QIWsu7BijSiHaaC7am4S8QB2O6ibJMqdv yNiwvuX+ThgVIY3QKrLLqL0KPGeKMR5mtfq6rrwSpfp/b4g27FE= =eFgA -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2024-03-11 We've added 59 non-merge commits during the last 9 day(s) which contain a total of 88 files changed, 4181 insertions(+), 590 deletions(-). The main changes are: 1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena, from Alexei. 2) Introduce bpf_arena which is sparse shared memory region between bpf program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and bpf programs, from Alexei and Andrii. 3) Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it, from Alexei. 4) Use IETF format for field definitions in the BPF standard document, from Dave. 5) Extend struct_ops libbpf APIs to allow specify version suffixes for stuct_ops map types, share the same BPF program between several map definitions, and other improvements, from Eduard. 6) Enable struct_ops support for more than one page in trampolines, from Kui-Feng. 7) Support kCFI + BPF on riscv64, from Puranjay. 8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay. 9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke. ==================== Link: https://lore.kernel.org/r/20240312003646.8692-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Andrii Nakryiko
|
66c8473135 |
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
prog->aux->sleepable is checked very frequently as part of (some) BPF program run hot paths. So this extra aux indirection seems wasteful and on busy systems might cause unnecessary memory cache misses. Let's move sleepable flag into prog itself to eliminate unnecessary pointer dereference. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Message-ID: <20240309004739.2961431-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Linus Torvalds
|
fa4b851b4a |
Tracing fixes for v6.8-rc7:
- Do not allow large strings (> 4096) as single write to trace_marker The size of a string written into trace_marker was determined by the size of the sub-buffer in the ring buffer. That size is dependent on the PAGE_SIZE of the architecture as it can be mapped into user space. But on PowerPC, where PAGE_SIZE is 64K, that made the limit of the string of writing into trace_marker 64K. One of the selftests looks at the size of the ring buffer sub-buffers and writes that plus more into the trace_marker. The write will take what it can and report back what it consumed so that the user space application (like echo) will write the rest of the string. The string is stored in the ring buffer and can be read via the "trace" or "trace_pipe" files. The reading of the ring buffer uses vsnprintf(), which uses a precision "%.*s" to make sure it only reads what is stored in the buffer, as a bug could cause the string to be non terminated. With the combination of the precision change and the PAGE_SIZE of 64K allowing huge strings to be added into the ring buffer, plus the test that would actually stress that limit, a bug was reported that the precision used was too big for "%.*s" as the string was close to 64K in size and the max precision of vsnprintf is 32K. Linus suggested not to have that precision as it could hide a bug if the string was again stored without a nul byte. Another issue that was brought up is that the trace_seq buffer is also based on PAGE_SIZE even though it is not tied to the architecture limit like the ring buffer sub-buffer is. Having it be 64K * 2 is simply just too big and wasting memory on systems with 64K page sizes. It is now hardcoded to 8K which is what all other architectures with 4K PAGE_SIZE has. Finally, the write to trace_marker is now limited to 4K as there is no reason to write larger strings into trace_marker. - ring_buffer_wait() should not loop. The ring_buffer_wait() does not have the full context (yet) on if it should loop or not. Just exit the loop as soon as its woken up and let the callers decide to loop or not (they already do, so it's a bit redundant). - Fix shortest_full field to be the smallest amount in the ring buffer that a waiter is waiting for. The "shortest_full" field is updated when a new waiter comes in and wants to wait for a smaller amount of data in the ring buffer than other waiters. But after all waiters are woken up, it's not reset, so if another waiter comes in wanting to wait for more data, it will be woken up when the ring buffer has a smaller amount from what the previous waiters were waiting for. - The wake up all waiters on close is incorrectly called frome .release() and not from .flush() so it will never wake up any waiters as the .release() will not get called until all .read() calls are finished. And the wakeup is for the waiters in those .read() calls. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZe3j6xQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qmYOAQD6rPZ+ILqHmRQMZjsxaasBeVYidspY wj3fRGzwfiB6fgEAkIeA7FOrkOK0CuG8R+2AtQNF5ZjXdmfZdiYQD1/EjQU= =Hqlf -----END PGP SIGNATURE----- Merge tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Do not allow large strings (> 4096) as single write to trace_marker The size of a string written into trace_marker was determined by the size of the sub-buffer in the ring buffer. That size is dependent on the PAGE_SIZE of the architecture as it can be mapped into user space. But on PowerPC, where PAGE_SIZE is 64K, that made the limit of the string of writing into trace_marker 64K. One of the selftests looks at the size of the ring buffer sub-buffers and writes that plus more into the trace_marker. The write will take what it can and report back what it consumed so that the user space application (like echo) will write the rest of the string. The string is stored in the ring buffer and can be read via the "trace" or "trace_pipe" files. The reading of the ring buffer uses vsnprintf(), which uses a precision "%.*s" to make sure it only reads what is stored in the buffer, as a bug could cause the string to be non terminated. With the combination of the precision change and the PAGE_SIZE of 64K allowing huge strings to be added into the ring buffer, plus the test that would actually stress that limit, a bug was reported that the precision used was too big for "%.*s" as the string was close to 64K in size and the max precision of vsnprintf is 32K. Linus suggested not to have that precision as it could hide a bug if the string was again stored without a nul byte. Another issue that was brought up is that the trace_seq buffer is also based on PAGE_SIZE even though it is not tied to the architecture limit like the ring buffer sub-buffer is. Having it be 64K * 2 is simply just too big and wasting memory on systems with 64K page sizes. It is now hardcoded to 8K which is what all other architectures with 4K PAGE_SIZE has. Finally, the write to trace_marker is now limited to 4K as there is no reason to write larger strings into trace_marker. - ring_buffer_wait() should not loop. The ring_buffer_wait() does not have the full context (yet) on if it should loop or not. Just exit the loop as soon as its woken up and let the callers decide to loop or not (they already do, so it's a bit redundant). - Fix shortest_full field to be the smallest amount in the ring buffer that a waiter is waiting for. The "shortest_full" field is updated when a new waiter comes in and wants to wait for a smaller amount of data in the ring buffer than other waiters. But after all waiters are woken up, it's not reset, so if another waiter comes in wanting to wait for more data, it will be woken up when the ring buffer has a smaller amount from what the previous waiters were waiting for. - The wake up all waiters on close is incorrectly called frome .release() and not from .flush() so it will never wake up any waiters as the .release() will not get called until all .read() calls are finished. And the wakeup is for the waiters in those .read() calls. * tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Use .flush() call to wake up readers ring-buffer: Fix resetting of shortest_full ring-buffer: Fix waking up ring buffer readers tracing: Limit trace_marker writes to just 4K tracing: Limit trace_seq size to just 8K and not depend on architecture PAGE_SIZE tracing: Remove precision vsnprintf() check from print event |
||
Steven Rostedt (Google)
|
e5d7c19165 |
tracing: Use .flush() call to wake up readers
The .release() function does not get called until all readers of a file
descriptor are finished.
If a thread is blocked on reading a file descriptor in ring_buffer_wait(),
and another thread closes the file descriptor, it will not wake up the
other thread as ring_buffer_wake_waiters() is called by .release(), and
that will not get called until the .read() is finished.
The issue originally showed up in trace-cmd, but the readers are actually
other processes with their own file descriptors. So calling close() would wake
up the other tasks because they are blocked on another descriptor then the
one that was closed(). But there's other wake ups that solve that issue.
When a thread is blocked on a read, it can still hang even when another
thread closed its descriptor.
This is what the .flush() callback is for. Have the .flush() wake up the
readers.
Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes:
|
||
Steven Rostedt (Google)
|
68282dd930 |
ring-buffer: Fix resetting of shortest_full
The "shortest_full" variable is used to keep track of the waiter that is
waiting for the smallest amount on the ring buffer before being woken up.
When a tasks waits on the ring buffer, it passes in a "full" value that is
a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to
100% full buffer.
As all waiters are on the same wait queue, the wake up happens for the
waiter with the smallest percentage.
The problem is that the smallest_full on the cpu_buffer that stores the
smallest amount doesn't get reset when all the waiters are woken up. It
does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/trace).
This means that tasks may be woken up more often then when they want to
be. Instead, have the shortest_full field get reset just before waking up
all the tasks. If the tasks wait again, they will update the shortest_full
before sleeping.
Also add locking around setting of shortest_full in the poll logic, and
change "work" to "rbwork" to match the variable name for rb_irq_work
structures that are used in other places.
Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes:
|
||
Steven Rostedt (Google)
|
b359457368 |
ring-buffer: Fix waking up ring buffer readers
A task can wait on a ring buffer for when it fills up to a specific
watermark. The writer will check the minimum watermark that waiters are
waiting for and if the ring buffer is past that, it will wake up all the
waiters.
The waiters are in a wait loop, and will first check if a signal is
pending and then check if the ring buffer is at the desired level where it
should break out of the loop.
If a file that uses a ring buffer closes, and there's threads waiting on
the ring buffer, it needs to wake up those threads. To do this, a
"wait_index" was used.
Before entering the wait loop, the waiter will read the wait_index. On
wakeup, it will check if the wait_index is different than when it entered
the loop, and will exit the loop if it is. The waker will only need to
update the wait_index before waking up the waiters.
This had a couple of bugs. One trivial one and one broken by design.
The trivial bug was that the waiter checked the wait_index after the
schedule() call. It had to be checked between the prepare_to_wait() and
the schedule() which it was not.
The main bug is that the first check to set the default wait_index will
always be outside the prepare_to_wait() and the schedule(). That's because
the ring_buffer_wait() doesn't have enough context to know if it should
break out of the loop.
The loop itself is not needed, because all the callers to the
ring_buffer_wait() also has their own loop, as the callers have a better
sense of what the context is to decide whether to break out of the loop
or not.
Just have the ring_buffer_wait() block once, and if it gets woken up, exit
the function and let the callers decide what to do next.
Link: https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwWgPi0JgTe6OYUYMNSRZfg@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes:
|
||
Jakub Kicinski
|
e3afe5dd3a |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/page_pool_user.c |
||
Steven Rostedt (Google)
|
095fe48912 |
tracing: Limit trace_marker writes to just 4K
Limit the max print event of trace_marker to just 4K string size. This must also be less than the amount that can be held by a trace_seq along with the text that is before the output (like the task name, PID, CPU, state, etc). As trace_seq is made to handle large events (some greater than 4K). Make the max size of a trace_marker write event be 4K which is guaranteed to fit in the trace_seq buffer. Link: https://lore.kernel.org/linux-trace-kernel/20240304223433.4ba47dff@gandalf.local.home Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
5efd3e2aef |
tracing: Remove precision vsnprintf() check from print event
This reverts |
||
Masami Hiramatsu (Google)
|
25f00e40ce |
tracing/probes: Support $argN in return probe (kprobe and fprobe)
Support accessing $argN in the return probe events. This will help users to record entry data in function return (exit) event for simplfing the function entry/exit information in one event, and record the result values (e.g. allocated object/initialized object) at function exit. For example, if we have a function `int init_foo(struct foo *obj, int param)` sometimes we want to check how `obj` is initialized. In such case, we can define a new return event like below; # echo 'r init_foo retval=$retval param=$arg2 field1=+0($arg1)' >> kprobe_events Thus it records the function parameter `param` and its result `obj->field1` (the dereference will be done in the function exit timing) value at once. This also support fprobe, BTF args and'$arg*'. So if CONFIG_DEBUG_INFO_BTF is enabled, we can trace both function parameters and the return value by following command. # echo 'f target_function%return $arg* $retval' >> dynamic_events Link: https://lore.kernel.org/all/170952365552.229804.224112990211602895.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Masami Hiramatsu (Google)
|
c18f9eabee |
tracing: Remove redundant #else block for BTF args from README
Remove redundant #else block for BTF args from README message. This is a cleanup, so no change on the message. Link: https://lore.kernel.org/all/170952364558.229804.17285528811097152410.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
035ba76014 |
tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init
Instead of incrementing the trace_probe::nr_args, init it at trace_probe_init(). Without this change, there is no way to get the number of trace_probe arguments while parsing it. This is a cleanup, so the behavior is not changed. Link: https://lore.kernel.org/all/170952363585.229804.13060759900346411951.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Masami Hiramatsu (Google)
|
032330abd0 |
tracing/probes: Cleanup probe argument parser
Cleanup traceprobe_parse_probe_arg_body() to split out the type parser and post-processing part of fetch_insn. This makes no functional change. Link: https://lore.kernel.org/all/170952362603.229804.9942703761682605372.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
7e37b6bc3c |
tracing/fprobe-event: cleanup: Fix a wrong comment in fprobe event
Despite the fprobe event, "Kretprobe" was commented. So fix it. Link: https://lore.kernel.org/all/170952361630.229804.10832200172327797860.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Jakub Kicinski
|
4b2765ae41 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZeEKVAAKCRDbK58LschI g7oYAQD5Jlv4fIVTvxvfZrTTZ2tU+OsPa75mc8SDKwpash3YygEA8kvESy8+t6pg D6QmSf1DIZdFoSp/bV+pfkNWMeR8gwg= =mTAj -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-02-29 We've added 119 non-merge commits during the last 32 day(s) which contain a total of 150 files changed, 3589 insertions(+), 995 deletions(-). The main changes are: 1) Extend the BPF verifier to enable static subprog calls in spin lock critical sections, from Kumar Kartikeya Dwivedi. 2) Fix confusing and incorrect inference of PTR_TO_CTX argument type in BPF global subprogs, from Andrii Nakryiko. 3) Larger batch of riscv BPF JIT improvements and enabling inlining of the bpf_kptr_xchg() for RV64, from Pu Lehui. 4) Allow skeleton users to change the values of the fields in struct_ops maps at runtime, from Kui-Feng Lee. 5) Extend the verifier's capabilities of tracking scalars when they are spilled to stack, especially when the spill or fill is narrowing, from Maxim Mikityanskiy & Eduard Zingerman. 6) Various BPF selftest improvements to fix errors under gcc BPF backend, from Jose E. Marchesi. 7) Avoid module loading failure when the module trying to register a struct_ops has its BTF section stripped, from Geliang Tang. 8) Annotate all kfuncs in .BTF_ids section which eventually allows for automatic kfunc prototype generation from bpftool, from Daniel Xu. 9) Several updates to the instruction-set.rst IETF standardization document, from Dave Thaler. 10) Shrink the size of struct bpf_map resp. bpf_array, from Alexei Starovoitov. 11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer, from Benjamin Tissoires. 12) Fix bpftool to be more portable to musl libc by using POSIX's basename(), from Arnaldo Carvalho de Melo. 13) Add libbpf support to gcc in CORE macro definitions, from Cupertino Miranda. 14) Remove a duplicate type check in perf_event_bpf_event, from Florian Lehner. 15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them with notrace correctly, from Yonghong Song. 16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible array to fix build warnings, from Kees Cook. 17) Fix resolve_btfids cross-compilation to non host-native endianness, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) selftests/bpf: Test if shadow types work correctly. bpftool: Add an example for struct_ops map and shadow type. bpftool: Generated shadow variables for struct_ops maps. libbpf: Convert st_ops->data to shadow type. libbpf: Set btf_value_type_id of struct bpf_map for struct_ops. bpf: Replace bpf_lpm_trie_key 0-length array with flexible array bpf, arm64: use bpf_prog_pack for memory management arm64: patching: implement text_poke API bpf, arm64: support exceptions arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT bpf: add is_async_callback_calling_insn() helper bpf: introduce in_sleepable() helper bpf: allow more maps in sleepable bpf programs selftests/bpf: Test case for lacking CFI stub functions. bpf: Check cfi_stubs before registering a struct_ops type. bpf: Clarify batch lookup/lookup_and_delete semantics bpf, docs: specify which BPF_ABS and BPF_IND fields were zero bpf, docs: Fix typos in instruction-set.rst selftests/bpf: update tcp_custom_syncookie to use scalar packet offset bpf: Shrink size of struct bpf_map/bpf_array. ... ==================== Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Linus Torvalds
|
161671a6eb |
Probes fixes for v6.8-rc5:
- fprobe: Fix to allocate entry_data_size buffer for each rethook instance. This fixes a buffer overrun bug (which leads a kernel crash) when fprobe user uses its entry_data in the entry_handler. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmXhIPgbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bhwIH/1h5q2ZqNwNplDGVQpWU G1uuRHLlt47jwbGR3gfeYqVELtX0gFigBsmVouCKK3u3qerB1pDscYhULzKeHjS4 1HAsonj+vKY2pbdCaRnxRT7ejlEioN8CwPb1eqY6Bf6XQ2tJqS5gUqdej8JDJuY5 tpNAhHWqAnRvf1V5muclGAIU+9zavrAjbetpgrPEDIjE5idFvN+6D+4PXiM1cRIW KXW1oA7VlShdfY7xprSZ33Lx7C/dLWojM2P/z/BvqyXOf4f1ovqtGFJegW4n7V5b ZgamgOcSBwFLTVOTpOzn0peucduLFTfEWyC7fFGkHjBxTl2JypsQLEupdoaWLvBB el4= =bUgZ -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull fprobe fix from Masami Hiramatsu: - allocate entry_data_size buffer for each rethook instance. This fixes a buffer overrun bug (which leads a kernel crash) when fprobe user uses its entry_data in the entry_handler. * tag 'probes-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fprobe: Fix to allocate entry_data_size buffer with rethook instances |
||
Masami Hiramatsu (Google)
|
6572786006 |
fprobe: Fix to allocate entry_data_size buffer with rethook instances
Fix to allocate fprobe::entry_data_size buffer with rethook instances.
If fprobe doesn't allocate entry_data_size buffer for each rethook instance,
fprobe entry handler can cause a buffer overrun when storing entry data in
entry handler.
Link: https://lore.kernel.org/all/170920576727.107552.638161246679734051.stgit@devnote2/
Reported-by: Jiri Olsa <olsajiri@gmail.com>
Closes: https://lore.kernel.org/all/Zd9eBn2FTQzYyg7L@krava/
Fixes:
|
||
Jakub Kicinski
|
fecc51559a |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/udp.c |
||
Steven Rostedt (Google)
|
e78fb4eac8 |
ring-buffer: Do not let subbuf be bigger than write mask
The data on the subbuffer is measured by a write variable that also
contains status flags. The counter is just 20 bits in length. If the
subbuffer is bigger than then counter, it will fail.
Make sure that the subbuffer can not be set to greater than the counter
that keeps track of the data on the subbuffer.
Link: https://lore.kernel.org/linux-trace-kernel/20240220095112.77e9cb81@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Linus Torvalds
|
ad645dea35 |
Probes fixes for v6.8-rc4:
- tracing/probes: Fix BTF structure member finder to find the members which are placed after any anonymous union member correctly. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmXQqH8bHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bKCsH/3EWGLBbwNMFFje+t2Bp L6LgeyG7ZUzeDb4ioO2bae++e3m7xtR7YAWbZoX193/BfPbBpr9rkUCfYqAaImBk dzdI+E8/gOkliLP0Vvj5GIecKDUdlB053NRgoOcq0n1NYOO3/4kyPZ+53tX5gsBI YNcdIKDdEVWX0+vinK6cCjWOLdVJc/8pH/6G2KNgG6Qs2T1dV0YzW5QRv1hhUOr3 1Obog64AFt7tVtIkGOK19gRJKa6VpysdD8wpekJnKfquTtFffrsgf4yTg84zf7Fe ueaOtJ7pV4R0SbQ/n5SnXe5cxXjX7MD24FZaeHCB7mp8sOsdeTukFmSTsIgOFeD/ cBQ= =KY+B -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: - tracing/probes: Fix BTF structure member finder to find the members which are placed after any anonymous union member correctly. * tag 'probes-fixes-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Fix to search structure fields correctly |
||
Masami Hiramatsu (Google)
|
9704669c38 |
tracing/probes: Fix to search structure fields correctly
Fix to search a field from the structure which has anonymous union
correctly.
Since the reference `type` pointer was updated in the loop, the search
loop suddenly aborted where it hits an anonymous union. Thus it can not
find the field after the anonymous union. This avoids updating the
cursor `type` pointer in the loop.
Link: https://lore.kernel.org/all/170791694361.389532.10047514554799419688.stgit@devnote2/
Fixes:
|
||
Linus Torvalds
|
4b6f7c624e |
Tracing fixes for v6.8:
- Fix the #ifndef that didn't have CONFIG_ on HAVE_DYNAMIC_FTRACE_WITH_REGS The fix to have dynamic trampolines work with x86 broke arm64 as the config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the previous fix was to fix. - Fix tracing_on state The code to test if "tracing_on" is set used ring_buffer_record_is_on() which returns false if the ring buffer isn't able to be written to. But the ring buffer disable has several bits that disable it. One is internal disabling which is used for resizing and other modifications of the ring buffer. But the "tracing_on" user space visible flag should only report if tracing is actually on and not internally disabled, as this can cause confusion as writing "1" when it is disabled will not enable it. Instead use ring_buffer_record_is_set_on() which shows the user space visible settings. - Fix a false positive kmemleak on saved cmdlines Now that the saved_cmdlines structure is allocated via alloc_page() and not via kmalloc() it has become invisible to kmemleak. The allocation done to one of its pointers was flagged as a dangling allocation leak. Make kmemleak aware of this allocation and free. - Fix synthetic event dynamic strings. A update that cleaned up the synthetic event code removed the return value of trace_string(), and had it return zero instead of the length, causing dynamic strings in the synthetic event to always have zero size. - Clean up documentation and header files for seq_buf -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZc92CxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qiD6AQCCtF9hBWq7slLlBQ+k07hWXOi1h4nR Ae6UmoGlu3u4ugEA6/g8mO2vjABagnK7RSk/R+s1SvSGqWkmAsWeEKirEAY= =Eqjs -----END PGP SIGNATURE----- Merge tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the #ifndef that didn't have the 'CONFIG_' prefix on HAVE_DYNAMIC_FTRACE_WITH_REGS The fix to have dynamic trampolines work with x86 broke arm64 as the config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the previous fix was to fix. - Fix tracing_on state The code to test if "tracing_on" is set incorrectly used ring_buffer_record_is_on() which returns false if the ring buffer isn't able to be written to. But the ring buffer disable has several bits that disable it. One is internal disabling which is used for resizing and other modifications of the ring buffer. But the "tracing_on" user space visible flag should only report if tracing is actually on and not internally disabled, as this can cause confusion as writing "1" when it is disabled will not enable it. Instead use ring_buffer_record_is_set_on() which shows the user space visible settings. - Fix a false positive kmemleak on saved cmdlines Now that the saved_cmdlines structure is allocated via alloc_page() and not via kmalloc() it has become invisible to kmemleak. The allocation done to one of its pointers was flagged as a dangling allocation leak. Make kmemleak aware of this allocation and free. - Fix synthetic event dynamic strings An update that cleaned up the synthetic event code removed the return value of trace_string(), and had it return zero instead of the length, causing dynamic strings in the synthetic event to always have zero size. - Clean up documentation and header files for seq_buf * tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: seq_buf: Fix kernel documentation seq_buf: Don't use "proxy" headers tracing/synthetic: Fix trace_string() return value tracing: Inform kmemleak of saved_cmdlines allocation tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on() tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef |
||
Jakub Kicinski
|
73be9a3aab |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c |
||
Thorsten Blum
|
9b6326354c |
tracing/synthetic: Fix trace_string() return value
Fix trace_string() by assigning the string length to the return variable which got lost in commit |
||
Steven Rostedt (Google)
|
2394ac4145 |
tracing: Inform kmemleak of saved_cmdlines allocation
The allocation of the struct saved_cmdlines_buffer structure changed from:
s = kmalloc(sizeof(*s), GFP_KERNEL);
s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);
to:
orig_size = sizeof(*s) + val * TASK_COMM_LEN;
order = get_order(orig_size);
size = 1 << (order + PAGE_SHIFT);
page = alloc_pages(GFP_KERNEL, order);
if (!page)
return NULL;
s = page_address(page);
memset(s, 0, sizeof(*s));
s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);
Where that s->saved_cmdlines allocation looks to be a dangling allocation
to kmemleak. That's because kmemleak only keeps track of kmalloc()
allocations. For allocations that use page_alloc() directly, the kmemleak
needs to be explicitly informed about it.
Add kmemleak_alloc() and kmemleak_free() around the page allocation so
that it doesn't give the following false positive:
unreferenced object 0xffff8881010c8000 (size 32760):
comm "swapper", pid 0, jiffies 4294667296
hex dump (first 32 bytes):
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
backtrace (crc ae6ec1b9):
[<ffffffff86722405>] kmemleak_alloc+0x45/0x80
[<ffffffff8414028d>] __kmalloc_large_node+0x10d/0x190
[<ffffffff84146ab1>] __kmalloc+0x3b1/0x4c0
[<ffffffff83ed7103>] allocate_cmdlines_buffer+0x113/0x230
[<ffffffff88649c34>] tracer_alloc_buffers.isra.0+0x124/0x460
[<ffffffff8864a174>] early_trace_init+0x14/0xa0
[<ffffffff885dd5ae>] start_kernel+0x12e/0x3c0
[<ffffffff885f5758>] x86_64_start_reservations+0x18/0x30
[<ffffffff885f582b>] x86_64_start_kernel+0x7b/0x80
[<ffffffff83a001c3>] secondary_startup_64_no_verify+0x15e/0x16b
Link: https://lore.kernel.org/linux-trace-kernel/87r0hfnr9r.fsf@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20240214112046.09a322d6@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Fixes:
|
||
Ingo Molnar
|
4589f199eb |
Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches
Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Sven Schnelle
|
a6eaa24f1c |
tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
tracer_tracing_is_on() checks whether record_disabled is not zero. This checks both the record_disabled counter and the RB_BUFFER_OFF flag. Reading the source it looks like this function should only check for the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). This fixes spurious fails in the 'test for function traceon/off triggers' test from the ftrace testsuite when the system is under load. Link: https://lore.kernel.org/linux-trace-kernel/20240205065340.2848065-1-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-By: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Petr Pavlu
|
bdbddb109c |
tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef
Commit |
||
Linus Torvalds
|
ca8a66738a |
Tracing fixes for v6.8-rc3:
- Fix broken direct trampolines being called when another callback is attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and when it added direct trampoline calls from ftrace, it removed the "WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't noticed because direct trampolines work as long as the function it is attached to is not shared with other callbacks (like the function tracer). When there's other callbacks, a helper trampoline is called, to call all the non direct callbacks and when it returns, the direct trampoline is called. For x86, the direct trampoline sets a flag in the regs field to tell the x86 specific code to call the direct trampoline. But this only works if the ftrace_ops had WITH_REGS set. ARM does things differently that does not require this. For now, set WITH_REGS if the arch supports WITH_REGS (which ARM does not), and this makes it work for both ARM64 and x86. - Fix wasted memory in the saved_cmdlines logic. The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can use. Most trace events only save the PID in the event. The saved_cmdlines file lists PIDs to COMMs so that the tracing tools can show an actual name and not just a PID for each event. There's an array of PIDs that map to a small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which is usually set to 32768. When a PID comes in, it will add itself to this array along with the index into the COMM array (note if the system allows more than PID_MAX_DEFAULT, this cache is similar to cache lines as an update of a PID that has the same PID_MAX_DEFAULT bits set will flush out another task with the same matching bits set). A while ago, the size of this cache was changed to be dynamic and the array was moved into a structure and created with kmalloc(). But this new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc allocates in powers of two, it was actually allocating 0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The last element of this structure was a pointer to the COMM string array which defaulted to just saving 128 COMMs. By changing the last field of this structure to a variable length string, and just having it round up to fill the allocated memory, the default size of the saved COMM cache is now 8190. This not only uses the wasted space, but actually saves space by removing the extra allocation for the COMM names. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZcYi8RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qqENAQD6xGE9EPkbHArElKfgpSuQOfGhcyyP LjgVhqVgmIoqUwD8CeVpxk3VwZIOQYvPn5XictcZgkYSeEWUZcKYg4c/3gs= =iIBv -----END PGP SIGNATURE----- Merge tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix broken direct trampolines being called when another callback is attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and when it added direct trampoline calls from ftrace, it removed the "WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't noticed because direct trampolines work as long as the function it is attached to is not shared with other callbacks (like the function tracer). When there are other callbacks, a helper trampoline is called, to call all the non direct callbacks and when it returns, the direct trampoline is called. For x86, the direct trampoline sets a flag in the regs field to tell the x86 specific code to call the direct trampoline. But this only works if the ftrace_ops had WITH_REGS set. ARM does things differently that does not require this. For now, set WITH_REGS if the arch supports WITH_REGS (which ARM does not), and this makes it work for both ARM64 and x86. - Fix wasted memory in the saved_cmdlines logic. The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can use. Most trace events only save the PID in the event. The saved_cmdlines file lists PIDs to COMMs so that the tracing tools can show an actual name and not just a PID for each event. There's an array of PIDs that map to a small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which is usually set to 32768. When a PID comes in, it will add itself to this array along with the index into the COMM array (note if the system allows more than PID_MAX_DEFAULT, this cache is similar to cache lines as an update of a PID that has the same PID_MAX_DEFAULT bits set will flush out another task with the same matching bits set). A while ago, the size of this cache was changed to be dynamic and the array was moved into a structure and created with kmalloc(). But this new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc allocates in powers of two, it was actually allocating 0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The last element of this structure was a pointer to the COMM string array which defaulted to just saving 128 COMMs. By changing the last field of this structure to a variable length string, and just having it round up to fill the allocated memory, the default size of the saved COMM cache is now 8190. This not only uses the wasted space, but actually saves space by removing the extra allocation for the COMM names. * tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix wasted memory in saved_cmdlines logic ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default |
||
Steven Rostedt (Google)
|
44dc5c41b5 |
tracing: Fix wasted memory in saved_cmdlines logic
While looking at improving the saved_cmdlines cache I found a huge amount
of wasted memory that should be used for the cmdlines.
The tracing data saves pids during the trace. At sched switch, if a trace
occurred, it will save the comm of the task that did the trace. This is
saved in a "cache" that maps pids to comms and exposed to user space via
the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by
default 128 comms.
The structure that uses this creates an array to store the pids using
PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure
to be of the size of 131104 bytes on 64 bit machines.
In hex: 131104 = 0x20020, and since the kernel allocates generic memory in
powers of two, the kernel would allocate 0x40000 or 262144 bytes to store
this structure. That leaves 131040 bytes of wasted space.
Worse, the structure points to an allocated array to store the comm names,
which is 16 bytes times the amount of names to save (currently 128), which
is 2048 bytes. Instead of allocating a separate array, make the structure
end with a variable length string and use the extra space for that.
This is similar to a recommendation that Linus had made about eventfs_inode names:
https://lore.kernel.org/all/20240130190355.11486-5-torvalds@linux-foundation.org/
Instead of allocating a separate string array to hold the saved comms,
have the structure end with: char saved_cmdlines[]; and round up to the
next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * TASK_COMM_LEN
It will use this extra space for the saved_cmdline portion.
Now, instead of saving only 128 comms by default, by using this wasted
space at the end of the structure it can save over 8000 comms and even
saves space by removing the need for allocating the other array.
Link: https://lore.kernel.org/linux-trace-kernel/20240209063622.1f7b6d5f@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Fixes:
|
||
Masami Hiramatsu (Google)
|
a8b9cf62ad |
ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default
The commit |
||
Jakub Kicinski
|
3be042cf46 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/stmicro/stmmac/common.h |
||
Masami Hiramatsu (Google)
|
9a571c1e27 |
tracing/probes: Fix to set arg size and fmt after setting type from BTF
Since the BTF type setting updates probe_arg::type, the type size
calculation and setting print-fmt should be done after that.
Without this fix, the argument size and print-fmt can be wrong.
Link: https://lore.kernel.org/all/170602218196.215583.6417859469540955777.stgit@devnote2/
Fixes:
|
||
Masami Hiramatsu (Google)
|
8c427cc2fa |
tracing/probes: Fix to show a parse error for bad type for $comm
Fix to show a parse error for bad type (non-string) for $comm/$COMM and
immediate-string. With this fix, error_log file shows appropriate error
message as below.
/sys/kernel/tracing # echo 'p vfs_read $comm:u32' >> kprobe_events
sh: write error: Invalid argument
/sys/kernel/tracing # echo 'p vfs_read \"hoge":u32' >> kprobe_events
sh: write error: Invalid argument
/sys/kernel/tracing # cat error_log
[ 30.144183] trace_kprobe: error: $comm and immediate-string only accepts string type
Command: p vfs_read $comm:u32
^
[ 62.618500] trace_kprobe: error: $comm and immediate-string only accepts string type
Command: p vfs_read \"hoge":u32
^
Link: https://lore.kernel.org/all/170602215411.215583.2238016352271091852.stgit@devnote2/
Fixes:
|
||
Jakub Kicinski
|
cf244463a2 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Daniel Bristot de Oliveira
|
1389358bb0 |
tracing/timerlat: Move hrtimer_init to timerlat_fd open()
Currently, the timerlat's hrtimer is initialized at the first read of
timerlat_fd, and destroyed at close(). It works, but it causes an error
if the user program open() and close() the file without reading.
Here's an example:
# echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
# echo timerlat > /sys/kernel/debug/tracing/current_tracer
# cat <<EOF > ./timerlat_load.py
# !/usr/bin/env python3
timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
timerlat_fd.close();
EOF
# ./taskset -c 0 ./timerlat_load.py
<BOOM>
BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:hrtimer_active+0xd/0x50
Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286
RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08
RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08
R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000
FS: 00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x23/0x70
? page_fault_oops+0x171/0x4e0
? srso_alias_return_thunk+0x5/0x7f
? avc_has_extended_perms+0x237/0x520
? exc_page_fault+0x7f/0x180
? asm_exc_page_fault+0x26/0x30
? hrtimer_active+0xd/0x50
hrtimer_cancel+0x15/0x40
timerlat_fd_release+0x48/0xe0
__fput+0xf5/0x290
__x64_sys_close+0x3d/0x80
do_syscall_64+0x60/0x90
? srso_alias_return_thunk+0x5/0x7f
? __x64_sys_ioctl+0x72/0xd0
? srso_alias_return_thunk+0x5/0x7f
? syscall_exit_to_user_mode+0x2b/0x40
? srso_alias_return_thunk+0x5/0x7f
? do_syscall_64+0x6c/0x90
? srso_alias_return_thunk+0x5/0x7f
? exit_to_user_mode_prepare+0x142/0x1f0
? srso_alias_return_thunk+0x5/0x7f
? syscall_exit_to_user_mode+0x2b/0x40
? srso_alias_return_thunk+0x5/0x7f
? do_syscall_64+0x6c/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7f2ffb321594
Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000
R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003
R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668
</TASK>
CR2: 0000000000000010
---[ end trace 0000000000000000 ]---
Move hrtimer_init to timerlat_fd open() to avoid this problem.
Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
Daniel Xu
|
6f3189f38a |
bpf: treewide: Annotate BPF kfuncs in BTF
This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototypes for downstream users. The process is as follows: 1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs 2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for each function inside BTF_KFUNCS sets 3. At runtime, vmlinux or module BTF is made available in sysfs 4. At runtime, bpftool (or similar) can look at provided BTF and generate appropriate prototypes for functions with "bpf_kfunc" tag To ensure future kfunc are similarly tagged, we now also return error inside kfunc registration for untagged kfuncs. For vmlinux kfuncs, we also WARN(), as initcall machinery does not handle errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Vincent Donnefort
|
66bbea9ed6 |
ring-buffer: Clean ring_buffer_poll_wait() error return
The return type for ring_buffer_poll_wait() is __poll_t. This is behind
the scenes an unsigned where we can set event bits. In case of a
non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us,
this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so
user-space at least is aware something went wrong.
Nonetheless, this is an incorrect code. Replace that -EINVAL with a
proper EPOLLERR to clean that output. As this doesn't change the
behaviour, there's no need to treat this change as a bug fix.
Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com
Cc: stable@vger.kernel.org
Fixes:
|
||
Jakub Kicinski
|
92046e83c0 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZbQV+gAKCRDbK58LschI g2OeAP0VvhZS9SPiS+/AMAFuw2W1BkMrFNbfBTc3nzRnyJSmNAD+NG4CLLJvsKI9 olu7VC20B8pLTGLUGIUSwqnjOC+Kkgc= =wVMl -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-01-26 We've added 107 non-merge commits during the last 4 day(s) which contain a total of 101 files changed, 6009 insertions(+), 1260 deletions(-). The main changes are: 1) Add BPF token support to delegate a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. With addressed changes from Christian and Linus' reviews, from Andrii Nakryiko. 2) Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type, from Kui-Feng Lee. 3) Add support for retrieval of cookies for perf/kprobe multi links, from Jiri Olsa. 4) Bigger batch of prep-work for the BPF verifier to eventually support preserving boundaries and tracking scalars on narrowing fills, from Maxim Mikityanskiy. 5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help with the scenario of SYN floods, from Kuniyuki Iwashima. 6) Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects, from Hou Tao. 7) Extend BPF verifier to track aligned ST stores as imprecise spilled registers, from Yonghong Song. 8) Several fixes to BPF selftests around inline asm constraints and unsupported VLA code generation, from Jose E. Marchesi. 9) Various updates to the BPF IETF instruction set draft document such as the introduction of conformance groups for instructions, from Dave Thaler. 10) Fix BPF verifier to make infinite loop detection in is_state_visited() exact to catch some too lax spill/fill corner cases, from Eduard Zingerman. 11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly instead of implicitly for various register types, from Hao Sun. 12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness in neighbor advertisement at setup time, from Martin KaFai Lau. 13) Change BPF selftests to skip callback tests for the case when the JIT is disabled, from Tiezhu Yang. 14) Add a small extension to libbpf which allows to auto create a map-in-map's inner map, from Andrey Grafin. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits) selftests/bpf: Add missing line break in test_verifier bpf, docs: Clarify definitions of various instructions bpf: Fix error checks against bpf_get_btf_vmlinux(). bpf: One more maintainer for libbpf and BPF selftests selftests/bpf: Incorporate LSM policy to token-based tests selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar selftests/bpf: Add tests for BPF object load with implicit token selftests/bpf: Add BPF object loading tests with explicit token passing libbpf: Wire up BPF token support at BPF object level libbpf: Wire up token_fd into feature probing logic libbpf: Move feature detection code into its own file libbpf: Further decouple feature checking logic from bpf_object libbpf: Split feature detectors definitions from cached results selftests/bpf: Utilize string values for delegate_xxx mount options bpf: Support symbolic BPF FS delegation mount options bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS bpf,selinux: Allocate bpf_security_struct per BPF token selftests/bpf: Add BPF token-enabled tests libbpf: Add BPF token support to bpf_prog_load() API ... ==================== Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Masami Hiramatsu (Google)
|
0958b33ef5 |
tracing/trigger: Fix to return error if failed to alloc snapshot
Fix register_snapshot_trigger() to return error code if it failed to
allocate a snapshot instead of 0 (success). Unless that, it will register
snapshot trigger without an error.
Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2
Fixes:
|
||
Andrii Nakryiko
|
bbc1d24724 |
bpf: Take into account BPF token when fetching helper protos
Instead of performing unconditional system-wide bpf_capable() and perfmon_capable() calls inside bpf_base_func_proto() function (and other similar ones) to determine eligibility of a given BPF helper for a given program, use previously recorded BPF token during BPF_PROG_LOAD command handling to inform the decision. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org |
||
Jiri Olsa
|
9fd112b1f8 |
bpf: Store cookies in kprobe_multi bpf_link_info data
Storing cookies in kprobe_multi bpf_link_info data. The cookies field is optional and if provided it needs to be an array of __u64 with kprobe_multi.count length. Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240119110505.400573-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Petr Pavlu
|
2b44760609 |
tracing: Ensure visibility when inserting an element into tracing_map
Running the following two commands in parallel on a multi-processor AArch64 machine can sporadically produce an unexpected warning about duplicate histogram entries: $ while true; do echo hist:key=id.syscall:val=hitcount > \ /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist sleep 0.001 done $ stress-ng --sysbadaddr $(nproc) The warning looks as follows: [ 2911.172474] ------------[ cut here ]------------ [ 2911.173111] Duplicates detected: 1 [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408 [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E) [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1 [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G E 6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01 [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018 [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408 [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408 [ 2911.185310] sp : ffff8000a1513900 [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001 [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008 [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180 [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8 [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731 [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8 [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000 [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480 [ 2911.194259] Call trace: [ 2911.194626] tracing_map_sort_entries+0x3e0/0x408 [ 2911.195220] hist_show+0x124/0x800 [ 2911.195692] seq_read_iter+0x1d4/0x4e8 [ 2911.196193] seq_read+0xe8/0x138 [ 2911.196638] vfs_read+0xc8/0x300 [ 2911.197078] ksys_read+0x70/0x108 [ 2911.197534] __arm64_sys_read+0x24/0x38 [ 2911.198046] invoke_syscall+0x78/0x108 [ 2911.198553] el0_svc_common.constprop.0+0xd0/0xf8 [ 2911.199157] do_el0_svc+0x28/0x40 [ 2911.199613] el0_svc+0x40/0x178 [ 2911.200048] el0t_64_sync_handler+0x13c/0x158 [ 2911.200621] el0t_64_sync+0x1a8/0x1b0 [ 2911.201115] ---[ end trace 0000000000000000 ]--- The problem appears to be caused by CPU reordering of writes issued from __tracing_map_insert(). The check for the presence of an element with a given key in this function is: val = READ_ONCE(entry->val); if (val && keys_match(key, val->key, map->key_size)) ... The write of a new entry is: elt = get_free_elt(map); memcpy(elt->key, key, map->key_size); entry->val = elt; The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;" stores may become visible in the reversed order on another CPU. This second CPU might then incorrectly determine that a new key doesn't match an already present val->key and subsequently insert a new element, resulting in a duplicate. Fix the problem by adding a write barrier between "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for good measure, also use WRITE_ONCE(entry->val, elt) for publishing the element. The sequence pairs with the mentioned "READ_ONCE(entry->val);" and the "val->key" check which has an address dependency. The barrier is placed on a path executed when adding an element for a new key. Subsequent updates targeting the same key remain unaffected. From the user's perspective, the issue was introduced by commit |
||
Linus Torvalds
|
a2ded784cd |
tracing updates for 6.8:
- Allow kernel trace instance creation to specify what events are created Inside the kernel, a subsystem may create a tracing instance that it can use to send events to user space. This sub-system may not care about the thousands of events that exist in eventfs. Allow the sub-system to specify what sub-systems of events it cares about, and only those events are exposed to this instance. - Allow the ring buffer to be broken up into bigger sub-buffers than just the architecture page size. A new tracefs file called "buffer_subbuf_size_kb" is created. The user can now specify a minimum size the sub-buffer may be in kilobytes. Note, that the implementation currently make the sub-buffer size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user only writes in kilobyte size, and the sub-buffer will be updated to the next size that it will can accommodate it. If the user writes in 10, it will change the size to be 4 pages on x86 (16K), as that is the next available size that can hold 10K pages. - Update the debug output when a corrupt time is detected in the ring buffer. If the ring buffer detects inconsistent timestamps, there's a debug config options that will dump the contents of the meta data of the sub-buffer that is used for debugging. Add some more information to this dump that helps with debugging. - Add more timestamp debugging checks (only triggers when the config is enabled) - Increase the trace_seq iterator to 2 page sizes. - Allow strings written into tracefs_marker to be larger. Up to just under 2 page sizes (based on what trace_seq can hold). - Increase the trace_maker_raw write to be as big as a sub-buffer can hold. - Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has been removed. - More selftests were added. - Some code clean ups as well. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZ8p3BQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6ql2GAQDZg/zlFEiJHyTfWbCIE8pA3T5xbzKo 26TNxIZAxJJZpQEAvGFU5Smy14pG6soEoVMp8B6ZOANbqU8VVamhOL+r+Qw= =0OYG -----END PGP SIGNATURE----- Merge tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Allow kernel trace instance creation to specify what events are created Inside the kernel, a subsystem may create a tracing instance that it can use to send events to user space. This sub-system may not care about the thousands of events that exist in eventfs. Allow the sub-system to specify what sub-systems of events it cares about, and only those events are exposed to this instance. - Allow the ring buffer to be broken up into bigger sub-buffers than just the architecture page size. A new tracefs file called "buffer_subbuf_size_kb" is created. The user can now specify a minimum size the sub-buffer may be in kilobytes. Note, that the implementation currently make the sub-buffer size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user only writes in kilobyte size, and the sub-buffer will be updated to the next size that it will can accommodate it. If the user writes in 10, it will change the size to be 4 pages on x86 (16K), as that is the next available size that can hold 10K pages. - Update the debug output when a corrupt time is detected in the ring buffer. If the ring buffer detects inconsistent timestamps, there's a debug config options that will dump the contents of the meta data of the sub-buffer that is used for debugging. Add some more information to this dump that helps with debugging. - Add more timestamp debugging checks (only triggers when the config is enabled) - Increase the trace_seq iterator to 2 page sizes. - Allow strings written into tracefs_marker to be larger. Up to just under 2 page sizes (based on what trace_seq can hold). - Increase the trace_maker_raw write to be as big as a sub-buffer can hold. - Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has been removed. - More selftests were added. - Some code clean ups as well. * tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits) ring-buffer: Remove stale comment from ring_buffer_size() tracing histograms: Simplify parse_actions() function tracing/selftests: Remove exec permissions from trace_marker.tc test ring-buffer: Use subbuf_order for buffer page masking tracing: Update subbuffer with kilobytes not page order ringbuffer/selftest: Add basic selftest to test changing subbuf order ring-buffer: Add documentation on the buffer_subbuf_order file ring-buffer: Just update the subbuffers when changing their allocation order ring-buffer: Keep the same size when updating the order tracing: Stop the tracing while changing the ring buffer subbuf size tracing: Update snapshot order along with main buffer order ring-buffer: Make sure the spare sub buffer used for reads has same size ring-buffer: Do no swap cpu buffers if order is different ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure ring-buffer: Read and write to ring buffers with custom sub buffer size ring-buffer: Set new size of the ring buffer sub page ring-buffer: Add interface for configuring trace sub buffer size ring-buffer: Page size per ring buffer ring-buffer: Have ring_buffer_print_page_header() be able to access ring_buffer_iter ring-buffer: Check if absolute timestamp goes backwards ... |
||
Linus Torvalds
|
5b890ad456 |
Probes update for v6.8:
- Kprobes trace event to show the actual function name in notrace-symbol warning. Instead of using user specified symbol name, use "%ps" printk format to show the actual symbol at the probe address. Since kprobe event accepts the offset from symbol which is bigger than the symbol size, user specified symbol may not be the actual probed symbol. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmWdZB0bHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b9AcH/R8mNbgAbKlxSXUm0NAG xrUcN9vyb9yaLgvoIEvW+XF6EMaCM6G2kG+wSaJB6xFiPlJgf9FhILjDjHAtV2x1 wXL8r3eLyKvkU3HXfS7RphUTPecgblI16FHZ12x2TkQ41KoRzQf2c7cSQs4B8SHP W5LPqvxxqjbV84iqZPScez99S0ZS0Of3ubmepVEm2LDshfhUVMIUH1vfvEn3vQI7 k5PoNiVRem+rjduERM3I7Zd51K7Lz/5hN56q6ok2vY8hVoRdp0j83Ly36h21ClS9 CtvlzPX0YjaogVd8Gyc3z+vqy61YiNA1q0fRqIhagmfIy/26s1ORaq/0S2gywxXn piA= =mfU0 -----END PGP SIGNATURE----- Merge tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes update from Masami Hiramatsu: - Update the Kprobes trace event to show the actual function name in notrace-symbol warning. Instead of using the user specified symbol name, use "%ps" printk format to show the actual symbol at the probe address. Since kprobe event accepts the offset from symbol which is bigger than the symbol size, the user specified symbol may not be the actual probed symbol. * tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: trace/kprobe: Display the actual notrace function when rejecting a probe |
||
Linus Torvalds
|
3e7aeb78ab |
Networking changes for 6.8.
Core & protocols ---------------- - Analyze and reorganize core networking structs (socks, netdev, netns, mibs) to optimize cacheline consumption and set up build time warnings to safeguard against future header changes. This improves TCP performances with many concurrent connections up to 40%. - Add page-pool netlink-based introspection, exposing the memory usage and recycling stats. This helps indentify bad PP users and possible leaks. - Refine TCP/DCCP source port selection to no longer favor even source port at connect() time when IP_LOCAL_PORT_RANGE is set. This lowers the time taken by connect() for hosts having many active connections to the same destination. - Refactor the TCP bind conflict code, shrinking related socket structs. - Refactor TCP SYN-Cookie handling, as a preparation step to allow arbitrary SYN-Cookie processing via eBPF. - Tune optmem_max for 0-copy usage, increasing the default value to 128KB and namespecifying it. - Allow coalescing for cloned skbs coming from page pools, improving RX performances with some common configurations. - Reduce extension header parsing overhead at GRO time. - Add bridge MDB bulk deletion support, allowing user-space to request the deletion of matching entries. - Reorder nftables struct members, to keep data accessed by the datapath first. - Introduce TC block ports tracking and use. This allows supporting multicast-like behavior at the TC layer. - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and classifiers (RSVP and tcindex). - More data-race annotations. - Extend the diag interface to dump TCP bound-only sockets. - Conditional notification of events for TC qdisc class and actions. - Support for WPAN dynamic associations with nearby devices, to form a sub-network using a specific PAN ID. - Implement SMCv2.1 virtual ISM device support. - Add support for Batman-avd mulicast packet type. BPF --- - Tons of verifier improvements: - BPF register bounds logic and range support along with a large test suite - log improvements - complete precision tracking support for register spills - track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs - support for user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience - support tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like - several fixes - Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload. - Fix kCFI bugs in BPF all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. - Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques. - Support uid/gid options when mounting bpffs. - Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id. - Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext. - Add BPF link_info support for uprobe multi link along with bpftool integration for the latter. - Support for VLAN tag in XDP hints. - Remove deprecated bpfilter kernel leftovers given the project is developed in user-space (https://github.com/facebook/bpfilter). Misc ---- - Support for parellel TC self-tests execution. - Increase MPTCP self-tests coverage. - Updated the bridge documentation, including several so-far undocumented features. - Convert all the net self-tests to run in unique netns, to avoid random failures due to conflict and allow concurrent runs. - Add TCP-AO self-tests. - Add kunit tests for both cfg80211 and mac80211. - Autogenerate Netlink families documentation from YAML spec. - Add yml-gen support for fixed headers and recursive nests, the tool can now generate user-space code for all genetlink families for which we have specs. - A bunch of additional module descriptions fixes. - Catch incorrect freeing of pages belonging to a page pool. Driver API ---------- - Rust abstractions for network PHY drivers; do not cover yet the full C API, but already allow implementing functional PHY drivers in rust. - Introduce queue and NAPI support in the netdev Netlink interface, allowing complete access to the device <> NAPIs <> queues relationship. - Introduce notifications filtering for devlink to allow control application scale to thousands of instances. - Improve PHY validation, requesting rate matching information for each ethtool link mode supported by both the PHY and host. - Add support for ethtool symmetric-xor RSS hash. - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD platform. - Expose pin fractional frequency offset value over new DPLL generic netlink attribute. - Convert older drivers to platform remove callback returning void. - Add support for PHY package MMD read/write. New hardware / drivers ---------------------- - Ethernet: - Octeon CN10K devices - Broadcom 5760X P7 - Qualcomm SM8550 SoC - Texas Instrument DP83TG720S PHY - Bluetooth: - IMC Networks Bluetooth radio Removed ------- - WiFi: - libertas 16-bit PCMCIA support - Atmel at76c50x drivers - HostAP ISA/PCMCIA style 802.11b driver - zd1201 802.11b USB dongles - Orinoco ISA/PCMCIA 802.11b driver - Aviator/Raytheon driver - Planet WL3501 driver - RNDIS USB 802.11b driver Drivers ------- - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - allow one by one port representors creation and removal - add temperature and clock information reporting - add get/set for ethtool's header split ringparam - add again FW logging - adds support switchdev hardware packet mirroring - iavf: implement symmetric-xor RSS hash - igc: add support for concurrent physical and free-running timers - i40e: increase the allowable descriptors - nVidia/Mellanox: - Preparation for Socket-Direct multi-dev netdev. That will allow in future releases combining multiple PFs devices attached to different NUMA nodes under the same netdev - Broadcom (bnxt): - TX completion handling improvements - add basic ntuple filter support - reduce MSIX vectors usage for MQPRIO offload - add VXLAN support, USO offload and TX coalesce completion for P7 - Marvell Octeon EP: - xmit-more support - add PF-VF mailbox support and use it for FW notifications for VFs - Wangxun (ngbe/txgbe): - implement ethtool functions to operate pause param, ring param, coalesce channel number and msglevel - Netronome/Corigine (nfp): - add flow-steering support - support UDP segmentation offload - Ethernet NICs embedded, slower, virtual: - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver - stmmac: add support for HW-accelerated VLAN stripping - TI AM654x sw: add mqprio, frame preemption & coalescing - gve: add support for non-4k page sizes. - virtio-net: support dynamic coalescing moderation - nVidia/Mellanox Ethernet datacenter switches: - allow firmware upgrade without a reboot - more flexible support for bridge flooding via the compressed FID flooding mode - Ethernet embedded switches: - Microchip: - fine-tune flow control and speed configurations in KSZ8xxx - KSZ88X3: enable setting rmii reference - Renesas: - add jumbo frames support - Marvell: - 88E6xxx: add "eth-mac" and "rmon" stats support - Ethernet PHYs: - aquantia: add firmware load support - at803x: refactor the driver to simplify adding support for more chip variants - NXP C45 TJA11xx: Add MACsec offload support - Wifi: - MediaTek (mt76): - NVMEM EEPROM improvements - mt7996 Extremely High Throughput (EHT) improvements - mt7996 Wireless Ethernet Dispatcher (WED) support - mt7996 36-bit DMA support - Qualcomm (ath12k): - support for a single MSI vector - WCN7850: support AP mode - Intel (iwlwifi): - new debugfs file fw_dbg_clear - allow concurrent P2P operation on DFS channels - Bluetooth: - QCA2066: support HFP offload - ISO: more broadcast-related improvements - NXP: better recovery in case receiver/transmitter get out of sync Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35 98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF ucqPEcZB5cRE =1bW6 -----END PGP SIGNATURE----- Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "The most interesting thing is probably the networking structs reorganization and a significant amount of changes is around self-tests. Core & protocols: - Analyze and reorganize core networking structs (socks, netdev, netns, mibs) to optimize cacheline consumption and set up build time warnings to safeguard against future header changes This improves TCP performances with many concurrent connections up to 40% - Add page-pool netlink-based introspection, exposing the memory usage and recycling stats. This helps indentify bad PP users and possible leaks - Refine TCP/DCCP source port selection to no longer favor even source port at connect() time when IP_LOCAL_PORT_RANGE is set. This lowers the time taken by connect() for hosts having many active connections to the same destination - Refactor the TCP bind conflict code, shrinking related socket structs - Refactor TCP SYN-Cookie handling, as a preparation step to allow arbitrary SYN-Cookie processing via eBPF - Tune optmem_max for 0-copy usage, increasing the default value to 128KB and namespecifying it - Allow coalescing for cloned skbs coming from page pools, improving RX performances with some common configurations - Reduce extension header parsing overhead at GRO time - Add bridge MDB bulk deletion support, allowing user-space to request the deletion of matching entries - Reorder nftables struct members, to keep data accessed by the datapath first - Introduce TC block ports tracking and use. This allows supporting multicast-like behavior at the TC layer - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and classifiers (RSVP and tcindex) - More data-race annotations - Extend the diag interface to dump TCP bound-only sockets - Conditional notification of events for TC qdisc class and actions - Support for WPAN dynamic associations with nearby devices, to form a sub-network using a specific PAN ID - Implement SMCv2.1 virtual ISM device support - Add support for Batman-avd mulicast packet type BPF: - Tons of verifier improvements: - BPF register bounds logic and range support along with a large test suite - log improvements - complete precision tracking support for register spills - track aligned STACK_ZERO cases as imprecise spilled registers. This improves the verifier "instructions processed" metric from single digit to 50-60% for some programs - support for user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience - support tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like - several fixes - Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload - Fix kCFI bugs in BPF all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y - Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques - Support uid/gid options when mounting bpffs - Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id - Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext - Add BPF link_info support for uprobe multi link along with bpftool integration for the latter - Support for VLAN tag in XDP hints - Remove deprecated bpfilter kernel leftovers given the project is developed in user-space (https://github.com/facebook/bpfilter) Misc: - Support for parellel TC self-tests execution - Increase MPTCP self-tests coverage - Updated the bridge documentation, including several so-far undocumented features - Convert all the net self-tests to run in unique netns, to avoid random failures due to conflict and allow concurrent runs - Add TCP-AO self-tests - Add kunit tests for both cfg80211 and mac80211 - Autogenerate Netlink families documentation from YAML spec - Add yml-gen support for fixed headers and recursive nests, the tool can now generate user-space code for all genetlink families for which we have specs - A bunch of additional module descriptions fixes - Catch incorrect freeing of pages belonging to a page pool Driver API: - Rust abstractions for network PHY drivers; do not cover yet the full C API, but already allow implementing functional PHY drivers in rust - Introduce queue and NAPI support in the netdev Netlink interface, allowing complete access to the device <> NAPIs <> queues relationship - Introduce notifications filtering for devlink to allow control application scale to thousands of instances - Improve PHY validation, requesting rate matching information for each ethtool link mode supported by both the PHY and host - Add support for ethtool symmetric-xor RSS hash - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD platform - Expose pin fractional frequency offset value over new DPLL generic netlink attribute - Convert older drivers to platform remove callback returning void - Add support for PHY package MMD read/write New hardware / drivers: - Ethernet: - Octeon CN10K devices - Broadcom 5760X P7 - Qualcomm SM8550 SoC - Texas Instrument DP83TG720S PHY - Bluetooth: - IMC Networks Bluetooth radio Removed: - WiFi: - libertas 16-bit PCMCIA support - Atmel at76c50x drivers - HostAP ISA/PCMCIA style 802.11b driver - zd1201 802.11b USB dongles - Orinoco ISA/PCMCIA 802.11b driver - Aviator/Raytheon driver - Planet WL3501 driver - RNDIS USB 802.11b driver Driver updates: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - allow one by one port representors creation and removal - add temperature and clock information reporting - add get/set for ethtool's header split ringparam - add again FW logging - adds support switchdev hardware packet mirroring - iavf: implement symmetric-xor RSS hash - igc: add support for concurrent physical and free-running timers - i40e: increase the allowable descriptors - nVidia/Mellanox: - Preparation for Socket-Direct multi-dev netdev. That will allow in future releases combining multiple PFs devices attached to different NUMA nodes under the same netdev - Broadcom (bnxt): - TX completion handling improvements - add basic ntuple filter support - reduce MSIX vectors usage for MQPRIO offload - add VXLAN support, USO offload and TX coalesce completion for P7 - Marvell Octeon EP: - xmit-more support - add PF-VF mailbox support and use it for FW notifications for VFs - Wangxun (ngbe/txgbe): - implement ethtool functions to operate pause param, ring param, coalesce channel number and msglevel - Netronome/Corigine (nfp): - add flow-steering support - support UDP segmentation offload - Ethernet NICs embedded, slower, virtual: - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver - stmmac: add support for HW-accelerated VLAN stripping - TI AM654x sw: add mqprio, frame preemption & coalescing - gve: add support for non-4k page sizes. - virtio-net: support dynamic coalescing moderation - nVidia/Mellanox Ethernet datacenter switches: - allow firmware upgrade without a reboot - more flexible support for bridge flooding via the compressed FID flooding mode - Ethernet embedded switches: - Microchip: - fine-tune flow control and speed configurations in KSZ8xxx - KSZ88X3: enable setting rmii reference - Renesas: - add jumbo frames support - Marvell: - 88E6xxx: add "eth-mac" and "rmon" stats support - Ethernet PHYs: - aquantia: add firmware load support - at803x: refactor the driver to simplify adding support for more chip variants - NXP C45 TJA11xx: Add MACsec offload support - Wifi: - MediaTek (mt76): - NVMEM EEPROM improvements - mt7996 Extremely High Throughput (EHT) improvements - mt7996 Wireless Ethernet Dispatcher (WED) support - mt7996 36-bit DMA support - Qualcomm (ath12k): - support for a single MSI vector - WCN7850: support AP mode - Intel (iwlwifi): - new debugfs file fw_dbg_clear - allow concurrent P2P operation on DFS channels - Bluetooth: - QCA2066: support HFP offload - ISO: more broadcast-related improvements - NXP: better recovery in case receiver/transmitter get out of sync" * tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits) lan78xx: remove redundant statement in lan78xx_get_eee lan743x: remove redundant statement in lan743x_ethtool_get_eee bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer() bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel() bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter() tcp: Revert no longer abort SYN_SENT when receiving some ICMP Revert "mlx5 updates 2023-12-20" Revert "net: stmmac: Enable Per DMA Channel interrupt" ipvlan: Remove usage of the deprecated ida_simple_xx() API ipvlan: Fix a typo in a comment net/sched: Remove ipt action tests net: stmmac: Use interrupt mode INTM=1 for per channel irq net: stmmac: Add support for TX/RX channel interrupt net: stmmac: Make MSI interrupt routine generic dt-bindings: net: snps,dwmac: per channel irq net: phy: at803x: make read_status more generic net: phy: at803x: add support for cdt cross short test for qca808x net: phy: at803x: refactor qca808x cable test get status function net: phy: at803x: generalize cdt fault length function net: ethernet: cortina: Drop TSO support ... |
||
Linus Torvalds
|
120a201bd2 |
hardening updates for v6.8-rc1
- Introduce the param_unknown_fn type and other clean ups (Andy Shevchenko) - Various __counted_by annotations (Christophe JAILLET, Gustavo A. R. Silva, Kees Cook) - Add KFENCE test to LKDTM (Stephen Boyd) - Various strncpy() refactorings (Justin Stitt) - Fix qnx4 to avoid writing into the smaller of two overlapping buffers - Various strlcpy() refactorings -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWcOsQWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJoiDD/9gNhalNG+6MNF5TDwSvO9X7pvL bQ6D3clByRxYjnJ4dMQ7p3s+rJ937uQt9PezIWHgRoldjQy3x7AJ5BxkhjeMlD2B YLbfdVYPy09X0Ewk1Efvfm/ta6tJpBGYF7Bc7LIneZrdQ6gemBpLW1PNZAFYzcWX oDjV+M1NytxaiF0aebxPZvZ1W+NGQ105Sxvj5MheDoezyO/j0CTe+ZYtCzFguFY0 8SPpR5FG4AFidb8GHd5Ndv0trVWjF1jat0FUFgEFOCE0fJNWLVR0Bbr2MtXiG7wL LF7IZ/Mn+mi+O3BmcD6JiaYf9EPlMUXCyqc8NvsnoWGqhWhWmQPCInZVrpplMUNK V/UHVMkmjDs4f/lAHBJoJHDK6fmOD+cAFaNMOltfErcjV4s+lEo6vHoiKl8hfPnH EzpQaK3funGroVYwTc35e07NrJJHCzqIUhZ0FJO7ByuOE2tIomiVo9Xy9gy54iCT qzC7zkrZ0MKqui4qiUY9FWayRRYLX4qNxELm4yie6Pzmk8943hNOaDofcyKWuZFC eqvhIkvqb4LasLrzCBk+ehA2KWSRmTrR6E9IygwbBXUTsvn2yj2RRYeAlGQNBTBZ adgSXQpRBmtKYqyihWLhP4QcunknEiQdDS3lS2qJmPH33Iv3jGH4yS6BNIBufMGL PoC2UxSfGd+YT079fw== =1Wxx -----END PGP SIGNATURE----- Merge tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Introduce the param_unknown_fn type and other clean ups (Andy Shevchenko) - Various __counted_by annotations (Christophe JAILLET, Gustavo A. R. Silva, Kees Cook) - Add KFENCE test to LKDTM (Stephen Boyd) - Various strncpy() refactorings (Justin Stitt) - Fix qnx4 to avoid writing into the smaller of two overlapping buffers - Various strlcpy() refactorings * tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: qnx4: Use get_directory_fname() in qnx4_match() qnx4: Extract dir entry filename processing into helper atags_proc: Add __counted_by for struct buffer and use struct_size() tracing/uprobe: Replace strlcpy() with strscpy() params: Fix multi-line comment style params: Sort headers params: Use size_add() for kmalloc() params: Do not go over the limit when getting the string length params: Introduce the param_unknown_fn type lkdtm: Add kfence read after free crash type nvme-fc: replace deprecated strncpy with strscpy nvdimm/btt: replace deprecated strncpy with strscpy nvme-fabrics: replace deprecated strncpy with strscpy drm/modes: replace deprecated strncpy with strscpy_pad afs: Add __counted_by for struct afs_acl and use struct_size() VMCI: Annotate struct vmci_handle_arr with __counted_by i40e: Annotate struct i40e_qvlist_info with __counted_by HID: uhid: replace deprecated strncpy with strscpy samples: Replace strlcpy() with strscpy() SUNRPC: Replace strlcpy() with strscpy() |
||
Breno Leitao
|
aefb2f2e61 |
x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINE
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ariel Miculas <amiculas@cisco.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org |
||
Steven Rostedt (Google)
|
25742aeb13 |
ring-buffer: Remove stale comment from ring_buffer_size()
It's been 11 years since the ring_buffer_size() function was updated to use the nr_pages from the buffer->buffers[cpu] structure instead of using the buffer->nr_pages that no longer exists. The comment in the code is more of what a change log should have and is pretty much useless for development. It's saying how things worked back in 2012 that bares no purpose on today's code. Remove it. Link: https://lore.kernel.org/linux-trace-kernel/84d3b41a72bd43dbb9d44921ef535c92@AcuMS.aculab.com/ Link: https://lore.kernel.org/linux-trace-kernel/20231220081028.7cd7e8e2@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reported-by: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Linus Torvalds
|
5db8752c3b |
vfs-6.8.iov_iter
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzBQAKCRCRxhvAZXjc ot+3AQCZw1PBD4azVxFMWH76qwlAGoVIFug4+ogKU/iUa4VLygEA2FJh1vLJw5iI LpgBEIUTPVkwtzinAW94iJJo1Vr7NAI= =p6PB -----END PGP SIGNATURE----- Merge tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iov_iter cleanups from Christian Brauner: "This contains a minor cleanup. The patches drop an unused argument from import_single_range() allowing to replace import_single_range() with import_ubuf() and dropping import_single_range() completely" * tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iov_iter: replace import_single_range() with import_ubuf() iov_iter: remove unused 'iov' argument from import_single_range() |
||
Steven Rostedt (Google)
|
4f1991a92c |
tracing histograms: Simplify parse_actions() function
The parse_actions() function uses 'len = str_has_prefix()' to test which action is in the string being parsed. But then it goes and repeats the logic for each different action. This logic can be simplified and duplicate code can be removed as 'len' contains the length of the found prefix which should be used for all actions. Link: https://lore.kernel.org/all/20240107112044.6702cb66@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240107203258.37e26d2b@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andy Shevchenko <andy@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Jakub Kicinski
|
e63c1822ac |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c |
||
Linus Torvalds
|
453f5db061 |
tracing fixes for v6.7-rc7:
- Fix readers that are blocked on the ring buffer when buffer_percent is 100%. They are supposed to wake up when the buffer is full, but because the sub-buffer that the writer is on is never considered "dirty" in the calculation, dirty pages will never equal nr_pages. Add +1 to the dirty count in order to count for the sub-buffer that the writer is on. - When a reader is blocked on the "snapshot_raw" file, it is to be woken up when a snapshot is done and be able to read the snapshot buffer. But because the snapshot swaps the buffers (the main one with the snapshot one), and the snapshot reader is waiting on the old snapshot buffer, it was not woken up (because it is now on the main buffer after the swap). Worse yet, when it reads the buffer after a snapshot, it's not reading the snapshot buffer, it's reading the live active main buffer. Fix this by forcing a wakeup of all readers on the snapshot buffer when a new snapshot happens, and then update the buffer that the reader is reading to be back on the snapshot buffer. - Fix the modification of the direct_function hash. There was a race when new functions were added to the direct_function hash as when it moved function entries from the old hash to the new one, a direct function trace could be hit and not see its entry. This is fixed by allocating the new hash, copy all the old entries onto it as well as the new entries, and then use rcu_assign_pointer() to update the new direct_function hash with it. This also fixes a memory leak in that code. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZAzTxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qs9IAP9e6wZ74aEjMED9nsbC49EpyCNTqa72 y0uDS/p9ppv52gD7Be+l+kJQzYNh6bZU0+B19hNC2QVn38jb7sOadfO/1Q8= =NDkf -----END PGP SIGNATURE----- Merge tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix readers that are blocked on the ring buffer when buffer_percent is 100%. They are supposed to wake up when the buffer is full, but because the sub-buffer that the writer is on is never considered "dirty" in the calculation, dirty pages will never equal nr_pages. Add +1 to the dirty count in order to count for the sub-buffer that the writer is on. - When a reader is blocked on the "snapshot_raw" file, it is to be woken up when a snapshot is done and be able to read the snapshot buffer. But because the snapshot swaps the buffers (the main one with the snapshot one), and the snapshot reader is waiting on the old snapshot buffer, it was not woken up (because it is now on the main buffer after the swap). Worse yet, when it reads the buffer after a snapshot, it's not reading the snapshot buffer, it's reading the live active main buffer. Fix this by forcing a wakeup of all readers on the snapshot buffer when a new snapshot happens, and then update the buffer that the reader is reading to be back on the snapshot buffer. - Fix the modification of the direct_function hash. There was a race when new functions were added to the direct_function hash as when it moved function entries from the old hash to the new one, a direct function trace could be hit and not see its entry. This is fixed by allocating the new hash, copy all the old entries onto it as well as the new entries, and then use rcu_assign_pointer() to update the new direct_function hash with it. This also fixes a memory leak in that code. - Fix eventfs ownership * tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix modification of direct_function hash while in use tracing: Fix blocked reader of snapshot buffer ring-buffer: Fix wake ups when buffer_percent is set to 100 eventfs: Fix file and directory uid and gid ownership |
||
Steven Rostedt (Google)
|
d05cb47066 |
ftrace: Fix modification of direct_function hash while in use
Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
if the number of new entries are added is large enough to cause two
allocations in the loop:
for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
new = ftrace_add_rec_direct(entry->ip, addr, &free_hash);
if (!new)
goto out_remove;
entry->direct = addr;
}
}
Where ftrace_add_rec_direct() has:
if (ftrace_hash_empty(direct_functions) ||
direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
struct ftrace_hash *new_hash;
int size = ftrace_hash_empty(direct_functions) ? 0 :
direct_functions->count + 1;
if (size < 32)
size = 32;
new_hash = dup_hash(direct_functions, size);
if (!new_hash)
return NULL;
*free_hash = direct_functions;
direct_functions = new_hash;
}
The "*free_hash = direct_functions;" can happen twice, losing the previous
allocation of direct_functions.
But this also exposed a more serious bug.
The modification of direct_functions above is not safe. As
direct_functions can be referenced at any time to find what direct caller
it should call, the time between:
new_hash = dup_hash(direct_functions, size);
and
direct_functions = new_hash;
can have a race with another CPU (or even this one if it gets interrupted),
and the entries being moved to the new hash are not referenced.
That's because the "dup_hash()" is really misnamed and is really a
"move_hash()". It moves the entries from the old hash to the new one.
Now even if that was changed, this code is not proper as direct_functions
should not be updated until the end. That is the best way to handle
function reference changes, and is the way other parts of ftrace handles
this.
The following is done:
1. Change add_hash_entry() to return the entry it created and inserted
into the hash, and not just return success or not.
2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
the former.
3. Allocate a "new_hash" at the start that is made for holding both the
new hash entries as well as the existing entries in direct_functions.
4. Copy (not move) the direct_function entries over to the new_hash.
5. Copy the entries of the added hash to the new_hash.
6. If everything succeeds, then use rcu_pointer_assign() to update the
direct_functions with the new_hash.
This simplifies the code and fixes both the memory leak as well as the
race condition mentioned above.
Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes:
|
||
Steven Rostedt (Google)
|
39a7dc23a1 |
tracing: Fix blocked reader of snapshot buffer
If an application blocks on the snapshot or snapshot_raw files, expecting
to be woken up when a snapshot occurs, it will not happen. Or it may
happen with an unexpected result.
That result is that the application will be reading the main buffer
instead of the snapshot buffer. That is because when the snapshot occurs,
the main and snapshot buffers are swapped. But the reader has a descriptor
still pointing to the buffer that it originally connected to.
This is fine for the main buffer readers, as they may be blocked waiting
for a watermark to be hit, and when a snapshot occurs, the data that the
main readers want is now on the snapshot buffer.
But for waiters of the snapshot buffer, they are waiting for an event to
occur that will trigger the snapshot and they can then consume it quickly
to save the snapshot before the next snapshot occurs. But to do this, they
need to read the new snapshot buffer, not the old one that is now
receiving new data.
Also, it does not make sense to have a watermark "buffer_percent" on the
snapshot buffer, as the snapshot buffer is static and does not receive new
data except all at once.
Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes:
|
||
Steven Rostedt (Google)
|
623b1f896f |
ring-buffer: Fix wake ups when buffer_percent is set to 100
The tracefs file "buffer_percent" is to allow user space to set a
water-mark on how much of the tracing ring buffer needs to be filled in
order to wake up a blocked reader.
0 - is to wait until any data is in the buffer
1 - is to wait for 1% of the sub buffers to be filled
50 - would be half of the sub buffers are filled with data
100 - is not to wake the waiter until the ring buffer is completely full
Unfortunately the test for being full was:
dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
return (dirty * 100) > (full * nr_pages);
Where "full" is the value for "buffer_percent".
There is two issues with the above when full == 100.
1. dirty * 100 > 100 * nr_pages will never be true
That is, the above is basically saying that if the user sets
buffer_percent to 100, more pages need to be dirty than exist in the
ring buffer!
2. The page that the writer is on is never considered dirty, as dirty
pages are only those that are full. When the writer goes to a new
sub-buffer, it clears the contents of that sub-buffer.
That is, even if the check was ">=" it would still not be equal as the
most pages that can be considered "dirty" is nr_pages - 1.
To fix this, add one to dirty and use ">=" in the compare.
Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes:
|
||
Paolo Abeni
|
56794e5358 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |
||
Linus Torvalds
|
13b734465a |
Tracing fixes for 6.7:
- Fix another kerneldoc warning - Fix eventfs files to inherit the ownership of its parent directory. The dynamic creating of dentries in eventfs did not take into account if the tracefs file system was mounted with a gid/uid, and would still default to the gid/uid of root. This is a regression. - Fix warning when synthetic event testing is enabled along with startup event tracing testing is enabled -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZYRYjhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qs0aAQCXWcBeDEWsi8VxAOBU5Q6isvXn2koM +xSX6LJPh6hFVAD+Pc3oLgvyE5IyqNUM9RYtpwPVMhpAsyE9FIz3TWarEww= =LY0i -----END PGP SIGNATURE----- Merge tag 'trace-v6.7-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix another kerneldoc warning - Fix eventfs files to inherit the ownership of its parent directory. The dynamic creation of dentries in eventfs did not take into account if the tracefs file system was mounted with a gid/uid, and would still default to the gid/uid of root. This is a regression. - Fix warning when synthetic event testing is enabled along with startup event tracing testing is enabled * tag 'trace-v6.7-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing / synthetic: Disable events after testing in synth_event_gen_test_init() eventfs: Have event files and directories default to parent uid and gid tracing/synthetic: fix kernel-doc warnings |
||
Steven Rostedt (Google)
|
3cb3091138 |
ring-buffer: Use subbuf_order for buffer page masking
The comparisons to PAGE_SIZE were all converted to use the
buffer->subbuf_order, but the use of PAGE_MASK was missed.
Convert all the PAGE_MASK usages over to:
(PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1
Link: https://lore.kernel.org/linux-trace-kernel/20231219173800.66eefb7a@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes:
|
||
Steven Rostedt (Google)
|
2f84b39f48 |
tracing: Update subbuffer with kilobytes not page order
Using page order for deciding what the size of the ring buffer sub buffers are is exposing a bit too much of the implementation. Although the sub buffers are only allocated in orders of pages, allow the user to specify the minimum size of each sub-buffer via kilobytes like they can with the buffer size itself. If the user specifies 3 via: echo 3 > buffer_subbuf_size_kb Then the sub-buffer size will round up to 4kb (on a 4kb page size system). If they specify: echo 6 > buffer_subbuf_size_kb The sub-buffer size will become 8kb. and so on. Link: https://lore.kernel.org/linux-trace-kernel/20231219185631.809766769@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
8e7b58c27b |
ring-buffer: Just update the subbuffers when changing their allocation order
The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu
cpu_buffers with the new subbuffers with the updated order, and if they
all successfully were created, then they the ring_buffer's per_cpu buffers
would be freed and replaced by them.
The problem is that the freed per_cpu buffers contains state that would be
lost. Running the following commands:
1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order
2. # echo 0 > /sys/kernel/tracing/tracing_cpumask
3. # echo 1 > /sys/kernel/tracing/snapshot
4. # echo ff > /sys/kernel/tracing/tracing_cpumask
5. # echo test > /sys/kernel/tracing/trace_marker
Would result in:
-bash: echo: write error: Bad file descriptor
That's because the state of the per_cpu buffers of the snapshot buffer is
lost when the order is changed (the order of a freed snapshot buffer goes
to 0 to save memory, and when the snapshot buffer is allocated again, it
goes back to what the main buffer is).
In operation 2, the snapshot buffers were set to "disable" (as all the
ring buffers CPUs were disabled).
In operation 3, the snapshot is allocated and a call to
ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the
"record_disable" count.
When it was enabled again, the atomic_dec(&cpu_buffer->record_disable) was
decrementing a zero, setting it to -1. Writing 1 into the snapshot would
swap the snapshot buffer with the main buffer, so now the main buffer is
"disabled", and nothing can write to the ring buffer anymore.
Instead of creating new per_cpu buffers and losing the state of the old
buffers, basically do what the resize does and just allocate new subbuf
pages into the new_pages link list of the per_cpu buffer and if they all
succeed, then replace the old sub buffers with the new ones. This keeps
the per_cpu buffer descriptor in tact and by doing so, keeps its state.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.944104939@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes:
|
||
Steven Rostedt (Google)
|
353cc21937 |
ring-buffer: Keep the same size when updating the order
The function ring_buffer_subbuf_order_set() just updated the sub-buffers
to the new size, but this also changes the size of the buffer in doing so.
As the size is determined by nr_pages * subbuf_size. If the subbuf_size is
increased without decreasing the nr_pages, this causes the total size of
the buffer to increase.
This broke the latency tracers as the snapshot needs to be the same size
as the main buffer. The size of the snapshot buffer is only expanded when
needed, and because the order is still the same, the size becomes out of
sync with the main buffer, as the main buffer increased in size without
the tracing system knowing.
Calculate the nr_pages to allocate with the new subbuf_size to be
buffer_size / new_subbuf_size.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.649397785@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes:
|
||
Steven Rostedt (Google)
|
fa4b54af5b |
tracing: Stop the tracing while changing the ring buffer subbuf size
Because the main buffer and the snapshot buffer need to be the same for
some tracers, otherwise it will fail and disable all tracing, the tracers
need to be stopped while updating the sub buffer sizes so that the tracers
see the main and snapshot buffers with the same sub buffer size.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.353222794@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes:
|
||
Steven Rostedt (Google)
|
aa067682ad |
tracing: Update snapshot order along with main buffer order
When updating the order of the sub buffers for the main buffer, make sure that if the snapshot buffer exists, that it gets its order updated as well. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.054668186@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
4e958db34f |
ring-buffer: Make sure the spare sub buffer used for reads has same size
Now that the ring buffer specifies the size of its sub buffers, they all need to be the same size. When doing a read, a swap is done with a spare page. Make sure they are the same size before doing the swap, otherwise the read will fail. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.763664788@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
b81e03a249 |
ring-buffer: Do no swap cpu buffers if order is different
As all the subbuffer order (subbuffer sizes) must be the same throughout the ring buffer, check the order of the buffers that are doing a CPU buffer swap in ring_buffer_swap_cpu() to make sure they are the same. If the are not the same, then fail to do the swap, otherwise the ring buffer will think the CPU buffer has a specific subbuffer size when it does not. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.467894710@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
22887dfba0 |
ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure
On failure to allocate ring buffer pages, the pointer to the CPU buffer
pages is freed, but the pages that were allocated previously were not.
Make sure they are freed too.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.179352802@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes:
|
||
Steven Rostedt (Google)
|
88b30c7f5d |
tracing / synthetic: Disable events after testing in synth_event_gen_test_init()
The synth_event_gen_test module can be built in, if someone wants to run
the tests at boot up and not have to load them.
The synth_event_gen_test_init() function creates and enables the synthetic
events and runs its tests.
The synth_event_gen_test_exit() disables the events it created and
destroys the events.
If the module is builtin, the events are never disabled. The issue is, the
events should be disable after the tests are run. This could be an issue
if the rest of the boot up tests are enabled, as they expect the events to
be in a known state before testing. That known state happens to be
disabled.
When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
a warning will trigger:
Running tests on trace events:
Testing event create_synth_test:
Enabled event during self test!
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 event_trace_self_tests+0x1c2/0x480
Modules linked in:
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:event_trace_self_tests+0x1c2/0x480
Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
RSP: 0000:ffffc9000001fdc0 EFLAGS: 00010246
RAX: 0000000000000029 RBX: ffff88810399ca80 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb9f19478 RDI: ffff88823c734e64
RBP: ffff88810399f300 R08: 0000000000000000 R09: fffffbfff79eb32a
R10: ffffffffbcf59957 R11: 0000000000000001 R12: ffff888104068090
R13: ffffffffbc89f0a0 R14: ffffffffbc8a0f08 R15: 0000000000000078
FS: 0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001f6282001 CR4: 0000000000170ef0
Call Trace:
<TASK>
? __warn+0xa5/0x200
? event_trace_self_tests+0x1c2/0x480
? report_bug+0x1f6/0x220
? handle_bug+0x6f/0x90
? exc_invalid_op+0x17/0x50
? asm_exc_invalid_op+0x1a/0x20
? tracer_preempt_on+0x78/0x1c0
? event_trace_self_tests+0x1c2/0x480
? __pfx_event_trace_self_tests_init+0x10/0x10
event_trace_self_tests_init+0x27/0xe0
do_one_initcall+0xd6/0x3c0
? __pfx_do_one_initcall+0x10/0x10
? kasan_set_track+0x25/0x30
? rcu_is_watching+0x38/0x60
kernel_init_freeable+0x324/0x450
? __pfx_kernel_init+0x10/0x10
kernel_init+0x1f/0x1e0
? _raw_spin_unlock_irq+0x33/0x50
ret_from_fork+0x34/0x60
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
This is because the synth_event_gen_test_init() left the synthetic events
that it created enabled. By having it disable them after testing, the
other selftests will run fine.
Link: https://lore.kernel.org/linux-trace-kernel/20231220111525.2f0f49b0@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes:
|
||
Randy Dunlap
|
7beb82b7d5 |
tracing/synthetic: fix kernel-doc warnings
scripts/kernel-doc warns about using @args: for variadic arguments to functions. Documentation/doc-guide/kernel-doc.rst says that this should be written as @...: instead, so update the source code to match that, preventing the warnings. trace_events_synth.c:1165: warning: Excess function parameter 'args' description in '__synth_event_gen_cmd_start' trace_events_synth.c:1714: warning: Excess function parameter 'args' description in 'synth_event_trace' Link: https://lore.kernel.org/linux-trace-kernel/20231220061226.30962-1-rdunlap@infradead.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: |
||
Tzvetomir Stoyanov (VMware)
|
bce761d757 |
ring-buffer: Read and write to ring buffers with custom sub buffer size
As the size of the ring sub buffer page can be changed dynamically, the logic that reads and writes to the buffer should be fixed to take that into account. Some internal ring buffer APIs are changed: ring_buffer_alloc_read_page() ring_buffer_free_read_page() ring_buffer_read_page() A new API is introduced: ring_buffer_read_page_data() Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-6-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.875145995@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> [ Fixed kerneldoc on data_page parameter in ring_buffer_free_read_page() ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Tzvetomir Stoyanov (VMware)
|
f9b94daa54 |
ring-buffer: Set new size of the ring buffer sub page
There are two approaches when changing the size of the ring buffer sub page: 1. Destroying all pages and allocating new pages with the new size. 2. Allocating new pages, copying the content of the old pages before destroying them. The first approach is easier, it is selected in the proposed implementation. Changing the ring buffer sub page size is supposed to not happen frequently. Usually, that size should be set only once, when the buffer is not in use yet and is supposed to be empty. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.588995543@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Tzvetomir Stoyanov (VMware)
|
2808e31ec1 |
ring-buffer: Add interface for configuring trace sub buffer size
The trace ring buffer sub page size can be configured, per trace instance. A new ftrace file "buffer_subbuf_order" is added to get and set the size of the ring buffer sub page for current trace instance. The size must be an order of system page size, that's why the new interface works with system page order, instead of absolute page size: 0 means the ring buffer sub page is equal to 1 system page and so forth: 0 - 1 system page 1 - 2 system pages 2 - 4 system pages ... The ring buffer sub page size is limited between 1 and 128 system pages. The default value is 1 system page. New ring buffer APIs are introduced: ring_buffer_subbuf_order_set() ring_buffer_subbuf_order_get() ring_buffer_subbuf_size_get() Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-4-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.298324722@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Tzvetomir Stoyanov (VMware)
|
139f840021 |
ring-buffer: Page size per ring buffer
Currently the size of one sub buffer page is global for all buffers and it is hard coded to one system page. In order to introduce configurable ring buffer sub page size, the internal logic should be refactored to work with sub page size per ring buffer. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-3-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.009147038@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Tzvetomir Stoyanov (VMware)
|
d5cfbdfc96 |
ring-buffer: Have ring_buffer_print_page_header() be able to access ring_buffer_iter
In order to introduce sub-buffer size per ring buffer, some internal refactoring is needed. As ring_buffer_print_page_header() will depend on the trace_buffer structure, it is moved after the structure definition. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-2-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185627.723857541@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Linus Torvalds
|
55cb5f4368 |
tracing fix for 6.7-rc6
While working on the ring buffer, I found one more bug with the timestamp code, and the fix for this removed the need for the final 64-bit cmpxchg! The ring buffer events hold a "delta" from the previous event. If it is determined that the delta can not be calculated, it falls back to adding an absolute timestamp value. The way to know if the delta can be used is via two stored timestamps in the per-cpu buffer meta data: before_stamp and write_stamp The before_stamp is written by every event before it tries to allocate its space on the ring buffer. The write_stamp is written after it allocates its space and knows that nothing came in after it read the previous before_stamp and write_stamp and the two matched. A previous fix |
||
Andrii Nakryiko
|
d17aff807f |
Revert BPF token-related functionality
This patch includes the following revert (one conflicting BPF FS patch and three token patch sets, represented by merge commits): - revert |
||
Steven Rostedt (Google)
|
f50345b49b |
ring-buffer: Check if absolute timestamp goes backwards
The check_buffer() which checks the timestamps of the ring buffer sub-buffer page, when enabled, only checks if the adding of deltas of the events from the last absolute timestamp or the timestamp of the sub-buffer page adds up to the current event. What it does not check is if the absolute timestamp causes the time of the events to go backwards, as that can cause issues elsewhere. Test for the timestamp going backwards too. This also fixes a slight issue where if the warning triggers at boot up (because of the resetting of the tsc), it will disable all further checks, even those that are after boot Have it continue checking if the warning was ignored during boot up. Link: https://lore.kernel.org/linux-trace-kernel/20231219074732.18b092d4@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
d40dbb617a |
ring-buffer: Add interrupt information to dump of data sub-buffer
When the ring buffer timestamp verifier triggers, it dumps the content of the sub-buffer. But currently it only dumps the timestamps and the offset of the data as well as the deltas. It would be even more informative if the event data also showed the interrupt context level it was in. That is, if each event showed that the event was written in normal, softirq, irq or NMI context. Then a better idea about how the events may have been interrupted from each other. As the payload of the ring buffer is really a black box of the ring buffer, just assume that if the payload is larger than a trace entry, that it is a trace entry. As trace entries have the interrupt context information saved in a flags field, look at that location and report the output of the flags. If the payload is not a trace entry, there's no way to really know, and the information will be garbage. But that's OK, because this is for debugging only (this output is not used in production as the buffer check that calls it causes a huge overhead to the tracing). This information, when available, is crucial for debugging timestamp issues. If it's garbage, it will also be pretty obvious that its garbage too. As this output usually happens in kselftests of the tracing code, the user will know what the payload is at the time. Link: https://lore.kernel.org/linux-trace-kernel/20231219074542.6f304601@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
c84897c0ff |
ring-buffer: Remove 32bit timestamp logic
Each event has a 27 bit timestamp delta that is used to hold the delta from the last event. If the time between events is greater than 2^27, then a timestamp is added that holds a 59 bit absolute timestamp. Until |
||
Steven Rostedt (Google)
|
76ca20c748 |
tracing: Increase size of trace_marker_raw to max ring buffer entry
There's no reason to give an arbitrary limit to the size of a raw trace marker. Just let it be as big as the size that is allowed by the ring buffer itself. And there's also no reason to artificially break up the write to TRACE_BUF_SIZE, as that's not even used. Link: https://lore.kernel.org/linux-trace-kernel/20231213104218.2efc70c1@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
9482341d9b |
tracing: Have trace_marker break up by lines by size of trace_seq
If a trace_marker write is bigger than what trace_seq can hold, then it will print "LINE TOO BIG" message and not what was written. Instead, check if the write is bigger than the trace_seq and break it up by that size. Ideally, we could make the trace_seq dynamic that could hold this. But that's for another time. Link: https://lore.kernel.org/linux-trace-kernel/20231212190422.1eaf224f@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
40fc60e36c |
trace_seq: Increase the buffer size to almost two pages
Now that trace_marker can hold more than 1KB string, and can write as much as the ring buffer can hold, the trace_seq is not big enough to hold writes: ~# a="1234567890" ~# cnt=4080 ~# s="" ~# while [ $cnt -gt 10 ]; do ~# s="${s}${a}" ~# cnt=$((cnt-10)) ~# done ~# echo $s > trace_marker ~# cat trace # tracer: nop # # entries-in-buffer/entries-written: 2/2 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <...>-860 [002] ..... 105.543465: tracing_mark_write[LINE TOO BIG] <...>-860 [002] ..... 105.543496: tracing_mark_write: 789012345678901234567890 By increasing the trace_seq buffer to almost two pages, it can now print out the first line. This also subtracts the rest of the trace_seq fields from the buffer, so that the entire trace_seq is now PAGE_SIZE aligned. Link: https://lore.kernel.org/linux-trace-kernel/20231209175220.19867af4@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
8ec90be7f1 |
tracing: Allow for max buffer data size trace_marker writes
Allow a trace write to be as big as the ring buffer tracing data will allow. Currently, it only allows writes of 1KB in size, but there's no reason that it cannot allow what the ring buffer can hold. Link: https://lore.kernel.org/linux-trace-kernel/20231212131901.5f501e72@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
0b9036efd8 |
ring-buffer: Add offset of events in dump on mismatch
On bugs that have the ring buffer timestamp get out of sync, the config CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS, that checks for it and if it is detected it causes a dump of the bad sub buffer. It shows each event and their timestamp as well as the delta in the event. But it's also good to see the offset into the subbuffer for that event to know if how close to the end it is. Also print where the last event actually ended compared to where it was expected to end. Link: https://lore.kernel.org/linux-trace-kernel/20231211131623.59eaebd2@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
d23569979c |
tracing: Allow creating instances with specified system events
A trace instance may only need to enable specific events. As the eventfs directory of an instance currently creates all events which adds overhead, allow internal instances to be created with just the events in systems that they care about. This currently only deals with systems and not individual events, but this should bring down the overhead of creating instances for specific use cases quite bit. The trace_array_get_by_name() now has another parameter "systems". This parameter is a const string pointer of a comma/space separated list of event systems that should be created by the trace_array. (Note if the trace_array already exists, this parameter is ignored). The list of systems is saved and if a module is loaded, its events will not be added unless the system for those events also match the systems string. Link: https://lore.kernel.org/linux-trace-kernel/20231213093701.03fddec0@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Sean Paul <seanpaul@chromium.org> Cc: Arun Easi <aeasi@marvell.com> Cc: Daniel Wagner <dwagner@suse.de> Tested-by: Dmytro Maluka <dmaluka@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
b803d7c664 |
ring-buffer: Fix slowpath of interrupted event
To synchronize the timestamps with the ring buffer reservation, there are
two timestamps that are saved in the buffer meta data.
1. before_stamp
2. write_stamp
When the two are equal, the write_stamp is considered valid, as in, it may
be used to calculate the delta of the next event as the write_stamp is the
timestamp of the previous reserved event on the buffer.
This is done by the following:
/*A*/ w = current position on the ring buffer
before = before_stamp
after = write_stamp
ts = read current timestamp
if (before != after) {
write_stamp is not valid, force adding an absolute
timestamp.
}
/*B*/ before_stamp = ts
/*C*/ write = local_add_return(event length, position on ring buffer)
if (w == write - event length) {
/* Nothing interrupted between A and C */
/*E*/ write_stamp = ts;
delta = ts - after
/*
* If nothing interrupted again,
* before_stamp == write_stamp and write_stamp
* can be used to calculate the delta for
* events that come in after this one.
*/
} else {
/*
* The slow path!
* Was interrupted between A and C.
*/
This is the place that there's a bug. We currently have:
after = write_stamp
ts = read current timestamp
/*F*/ if (write == current position on the ring buffer &&
after < ts && cmpxchg(write_stamp, after, ts)) {
delta = ts - after;
} else {
delta = 0;
}
The assumption is that if the current position on the ring buffer hasn't
moved between C and F, then it also was not interrupted, and that the last
event written has a timestamp that matches the write_stamp. That is the
write_stamp is valid.
But this may not be the case:
If a task context event was interrupted by softirq between B and C.
And the softirq wrote an event that got interrupted by a hard irq between
C and E.
and the hard irq wrote an event (does not need to be interrupted)
We have:
/*B*/ before_stamp = ts of normal context
---> interrupted by softirq
/*B*/ before_stamp = ts of softirq context
---> interrupted by hardirq
/*B*/ before_stamp = ts of hard irq context
/*E*/ write_stamp = ts of hard irq context
/* matches and write_stamp valid */
<----
/*E*/ write_stamp = ts of softirq context
/* No longer matches before_stamp, write_stamp is not valid! */
<---
w != write - length, go to slow path
// Right now the order of events in the ring buffer is:
//
// |-- softirq event --|-- hard irq event --|-- normal context event --|
//
after = write_stamp (this is the ts of softirq)
ts = read current timestamp
if (write == current position on the ring buffer [true] &&
after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
delta = ts - after [Wrong!]
The delta is to be between the hard irq event and the normal context
event, but the above logic made the delta between the softirq event and
the normal context event, where the hard irq event is between the two. This
will shift all the remaining event timestamps on the sub-buffer
incorrectly.
The write_stamp is only valid if it matches the before_stamp. The cmpxchg
does nothing to help this.
Instead, the following logic can be done to fix this:
before = before_stamp
ts = read current timestamp
before_stamp = ts
after = write_stamp
if (write == current position on the ring buffer &&
after == before && after < ts) {
delta = ts - after
} else {
delta = 0;
}
The above will only use the write_stamp if it still matches before_stamp
and was tested to not have changed since C.
As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
This means the 32-bit rb_time_t workaround can finally be removed. But
that's for a later time.
Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Fixes:
|
||
Jakub Kicinski
|
c49b292d03 |
netdev
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7 uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1 TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861 xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw= =stU2 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-12-18 This PR is larger than usual and contains changes in various parts of the kernel. The main changes are: 1) Fix kCFI bugs in BPF, from Peter Zijlstra. End result: all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. 2) Introduce BPF token object, from Andrii Nakryiko. It adds an ability to delegate a subset of BPF features from privileged daemon (e.g., systemd) through special mount options for userns-bound BPF FS to a trusted unprivileged application. The design accommodates suggestions from Christian Brauner and Paul Moore. Example: $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp 3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei. - Complete precision tracking support for register spills - Fix verification of possibly-zero-sized stack accesses - Fix access to uninit stack slots - Track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs. - Fix verifier retval logic 4) Support for VLAN tag in XDP hints, from Larysa Zaremba. 5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu. End result: better memory utilization and lower I$ miss for calls to BPF via BPF trampoline. 6) Fix race between BPF prog accessing inner map and parallel delete, from Hou Tao. 7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu. It allows BPF interact with IPSEC infra. The intent is to support software RSS (via XDP) for the upcoming ipsec pcpu work. Experiments on AWS demonstrate single tunnel pcpu ipsec reaching line rate on 100G ENA nics. 8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao. 9) BPF file verification via fsverity, from Song Liu. It allows BPF progs get fsverity digest. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits) bpf: Ensure precise is reset to false in __mark_reg_const_zero() selftests/bpf: Add more uprobe multi fail tests bpf: Fail uprobe multi link with negative offset selftests/bpf: Test the release of map btf s390/bpf: Fix indirect trampoline generation selftests/bpf: Temporarily disable dummy_struct_ops test on s390 x86/cfi,bpf: Fix bpf_exception_cb() signature bpf: Fix dtor CFI cfi: Add CFI_NOSEAL() x86/cfi,bpf: Fix bpf_struct_ops CFI x86/cfi,bpf: Fix bpf_callback_t CFI x86/cfi,bpf: Fix BPF JIT call cfi: Flip headers selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment bpf: Limit the number of kprobes when attaching program to multiple kprobes bpf: Limit the number of uprobes when attaching program to multiple uprobes bpf: xdp: Register generic_kfunc_set with XDP programs selftests/bpf: utilize string values for delegate_xxx mount options ... ==================== Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Jiri Olsa
|
3983c00281 |
bpf: Fail uprobe multi link with negative offset
Currently the __uprobe_register will return 0 (success) when called with negative offset. The reason is that the call to register_for_each_vma and then build_map_info won't return error for negative offset. They just won't do anything - no matching vma is found so there's no registered breakpoint for the uprobe. I don't think we can change the behaviour of __uprobe_register and fail for negative uprobe offset, because apps might depend on that already. But I think we can still make the change and check for it on bpf multi link syscall level. Also moving the __get_user call and check for the offsets to the top of loop, to fail early without extra __get_user calls for ref_ctr_offset and cookie arrays. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20231217215538.3361991-2-jolsa@kernel.org |
||
Naveen N Rao
|
9c556b7c3f |
trace/kprobe: Display the actual notrace function when rejecting a probe
Trying to probe update_sd_lb_stats() using perf results in the below message in the kernel log: trace_kprobe: Could not probe notrace function _text This is because 'perf probe' specifies the kprobe location as an offset from '_text': $ sudo perf probe -D update_sd_lb_stats p:probe/update_sd_lb_stats _text+1830728 However, the error message is misleading and doesn't help convey the actual notrace function that is being probed. Fix this by looking up the actual function name that is being probed. With this fix, we now get the below message in the kernel log: trace_kprobe: Could not probe notrace function update_sd_lb_stats.constprop.0 Link: https://lore.kernel.org/all/20231214051702.1687300-1-naveen@kernel.org/ Signed-off-by: Naveen N Rao <naveen@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Linus Torvalds
|
3b8a9b2e68 |
Tracing fixes for v6.7-rc5:
- Fix eventfs to check creating new files for events with names greater than NAME_MAX. The eventfs lookup needs to check the return result of simple_lookup(). - Fix the ring buffer to check the proper max data size. Events must be able to fit on the ring buffer sub-buffer, if it cannot, then it fails to be written and the logic to add the event is avoided. The code to check if an event can fit failed to add the possible absolute timestamp which may make the event not be able to fit. This causes the ring buffer to go into an infinite loop trying to find a sub-buffer that would fit the event. Luckily, there's a check that will bail out if it looped over a 1000 times and it also warns. The real fix is not to add the absolute timestamp to an event that is starting at the beginning of a sub-buffer because it uses the sub-buffer timestamp. By avoiding the timestamp at the start of the sub-buffer allows events that pass the first check to always find a sub-buffer that it can fit on. - Have large events that do not fit on a trace_seq to print "LINE TOO BIG" like it does for the trace_pipe instead of what it does now which is to silently drop the output. - Fix a memory leak of forgetting to free the spare page that is saved by a trace instance. - Update the size of the snapshot buffer when the main buffer is updated if the snapshot buffer is allocated. - Fix ring buffer timestamp logic by removing all the places that tried to put the before_stamp back to the write stamp so that the next event doesn't add an absolute timestamp. But each of these updates added a race where by making the two timestamp equal, it was validating the write_stamp so that it can be incorrectly used for calculating the delta of an event. - There's a temp buffer used for printing the event that was using the event data size for allocation when it needed to use the size of the entire event (meta-data and payload data) - For hardening, use "%.*s" for printing the trace_marker output, to limit the amount that is printed by the size of the event. This was discovered by development that added a bug that truncated the '\0' and caused a crash. - Fix a use-after-free bug in the use of the histogram files when an instance is being removed. - Remove a useless update in the rb_try_to_discard of the write_stamp. The before_stamp was already changed to force the next event to add an absolute timestamp that the write_stamp is not used. But the write_stamp is modified again using an unneeded 64-bit cmpxchg. - Fix several races in the 32-bit implementation of the rb_time_cmpxchg() that does a 64-bit cmpxchg. - While looking at fixing the 64-bit cmpxchg, I noticed that because the ring buffer uses normal cmpxchg, and this can be done in NMI context, there's some architectures that do not have a working cmpxchg in NMI context. For these architectures, fail recording events that happen in NMI context. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZX0nChQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qlOMAQD3iegTcceQl9lAsroa3tb3xdweC1GP 51MsX5athxSyoQEAutI/2pBCtLFXgTLMHAMd5F23EM1U9rha7W0myrnvKQY= =d3bS -----END PGP SIGNATURE----- Merge tag 'trace-v6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix eventfs to check creating new files for events with names greater than NAME_MAX. The eventfs lookup needs to check the return result of simple_lookup(). - Fix the ring buffer to check the proper max data size. Events must be able to fit on the ring buffer sub-buffer, if it cannot, then it fails to be written and the logic to add the event is avoided. The code to check if an event can fit failed to add the possible absolute timestamp which may make the event not be able to fit. This causes the ring buffer to go into an infinite loop trying to find a sub-buffer that would fit the event. Luckily, there's a check that will bail out if it looped over a 1000 times and it also warns. The real fix is not to add the absolute timestamp to an event that is starting at the beginning of a sub-buffer because it uses the sub-buffer timestamp. By avoiding the timestamp at the start of the sub-buffer allows events that pass the first check to always find a sub-buffer that it can fit on. - Have large events that do not fit on a trace_seq to print "LINE TOO BIG" like it does for the trace_pipe instead of what it does now which is to silently drop the output. - Fix a memory leak of forgetting to free the spare page that is saved by a trace instance. - Update the size of the snapshot buffer when the main buffer is updated if the snapshot buffer is allocated. - Fix ring buffer timestamp logic by removing all the places that tried to put the before_stamp back to the write stamp so that the next event doesn't add an absolute timestamp. But each of these updates added a race where by making the two timestamp equal, it was validating the write_stamp so that it can be incorrectly used for calculating the delta of an event. - There's a temp buffer used for printing the event that was using the event data size for allocation when it needed to use the size of the entire event (meta-data and payload data) - For hardening, use "%.*s" for printing the trace_marker output, to limit the amount that is printed by the size of the event. This was discovered by development that added a bug that truncated the '\0' and caused a crash. - Fix a use-after-free bug in the use of the histogram files when an instance is being removed. - Remove a useless update in the rb_try_to_discard of the write_stamp. The before_stamp was already changed to force the next event to add an absolute timestamp that the write_stamp is not used. But the write_stamp is modified again using an unneeded 64-bit cmpxchg. - Fix several races in the 32-bit implementation of the rb_time_cmpxchg() that does a 64-bit cmpxchg. - While looking at fixing the 64-bit cmpxchg, I noticed that because the ring buffer uses normal cmpxchg, and this can be done in NMI context, there's some architectures that do not have a working cmpxchg in NMI context. For these architectures, fail recording events that happen in NMI context. * tag 'trace-v6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI ring-buffer: Have rb_time_cmpxchg() set the msb counter too ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg() ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs ring-buffer: Remove useless update to write_stamp in rb_try_to_discard() ring-buffer: Do not try to put back write_stamp tracing: Fix uaf issue when open the hist or hist_debug file tracing: Add size check when printing trace_marker output ring-buffer: Have saved event hold the entire event ring-buffer: Do not update before stamp when switching sub-buffers tracing: Update snapshot buffer on resize if it is allocated ring-buffer: Fix memory leak of free page eventfs: Fix events beyond NAME_MAX blocking tasks tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing ring-buffer: Fix writing to the buffer with max_data_size |
||
Hou Tao
|
d6d1e6c17c |
bpf: Limit the number of kprobes when attaching program to multiple kprobes
An abnormally big cnt may also be assigned to kprobe_multi.cnt when
attaching multiple kprobes. It will trigger the following warning in
kvmalloc_node():
if (unlikely(size > INT_MAX)) {
WARN_ON_ONCE(!(flags & __GFP_NOWARN));
return NULL;
}
Fix the warning by limiting the maximal number of kprobes in
bpf_kprobe_multi_link_attach(). If the number of kprobes is greater than
MAX_KPROBE_MULTI_CNT, the attachment will fail and return -E2BIG.
Fixes:
|
||
Hou Tao
|
8b2efe51ba |
bpf: Limit the number of uprobes when attaching program to multiple uprobes
An abnormally big cnt may be passed to link_create.uprobe_multi.cnt,
and it will trigger the following warning in kvmalloc_node():
if (unlikely(size > INT_MAX)) {
WARN_ON_ONCE(!(flags & __GFP_NOWARN));
return NULL;
}
Fix the warning by limiting the maximal number of uprobes in
bpf_uprobe_multi_link_attach(). If the number of uprobes is greater than
MAX_UPROBE_MULTI_CNT, the attachment will return -E2BIG.
Fixes:
|
||
Steven Rostedt (Google)
|
712292308a |
ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
As the ring buffer recording requires cmpxchg() to work, if the architecture does not support cmpxchg in NMI, then do not do any recording within an NMI. Link: https://lore.kernel.org/linux-trace-kernel/20231213175403.6fc18540@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
0aa0e5289c |
ring-buffer: Have rb_time_cmpxchg() set the msb counter too
The rb_time_cmpxchg() on 32-bit architectures requires setting three
32-bit words to represent the 64-bit timestamp, with some salt for
synchronization. Those are: msb, top, and bottom
The issue is, the rb_time_cmpxchg() did not properly salt the msb portion,
and the msb that was written was stale.
Link: https://lore.kernel.org/linux-trace-kernel/20231215084114.20899342@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Mathieu Desnoyers
|
dec890089b |
ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
The following race can cause rb_time_read() to observe a corrupted time
stamp:
rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;
<interrupted before updating bottom>
__rb_time_read()
[...]
do {
c = local_read(&t->cnt);
top = local_read(&t->top);
bottom = local_read(&t->bottom);
msb = local_read(&t->msb);
} while (c != local_read(&t->cnt));
*cnt = rb_time_cnt(top);
/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
^ this check fails to catch that "bottom" is still not updated.
So the old "bottom" value is returned, which is wrong.
Fix this by checking that all three of msb, top, and bottom 2-bit cnt
values match.
The reason to favor checking all three fields over requiring a specific
update order for both rb_time_set() and rb_time_cmpxchg() is because
checking all three fields is more robust to handle partial failures of
rb_time_cmpxchg() when interrupted by nested rb_time_set().
Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212193049.680122-1-mathieu.desnoyers@efficios.com
Fixes:
|
||
Steven Rostedt (Google)
|
fff88fa0fb |
ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:
static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
{
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
/* The cmpxchg always fails if it interrupted an update */
if (!__rb_time_read(t, &val, &cnt2))
return false;
if (val != expect)
return false;
<<<< interrupted here!
cnt = local_read(&t->cnt);
The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!
The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.
Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212115301.7a9c9a64@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes:
|
||
Steven Rostedt (Google)
|
083e9f65bd |
ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
When filtering is enabled, a temporary buffer is created to place the
content of the trace event output so that the filter logic can decide
from the trace event output if the trace event should be filtered out or
not. If it is to be filtered out, the content in the temporary buffer is
simply discarded, otherwise it is written into the trace buffer.
But if an interrupt were to come in while a previous event was using that
temporary buffer, the event written by the interrupt would actually go
into the ring buffer itself to prevent corrupting the data on the
temporary buffer. If the event is to be filtered out, the event in the
ring buffer is discarded, or if it fails to discard because another event
were to have already come in, it is turned into padding.
The update to the write_stamp in the rb_try_to_discard() happens after a
fix was made to force the next event after the discard to use an absolute
timestamp by setting the before_stamp to zero so it does not match the
write_stamp (which causes an event to use the absolute timestamp).
But there's an effort in rb_try_to_discard() to put back the write_stamp
to what it was before the event was added. But this is useless and
wasteful because nothing is going to be using that write_stamp for
calculations as it still will not match the before_stamp.
Remove this useless update, and in doing so, we remove another
cmpxchg64()!
Also update the comments to reflect this change as well as remove some
extra white space in another comment.
Link: https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes:
|
||
Steven Rostedt (Google)
|
dd93942570 |
ring-buffer: Do not try to put back write_stamp
If an update to an event is interrupted by another event between the time
the initial event allocated its buffer and where it wrote to the
write_stamp, the code try to reset the write stamp back to the what it had
just overwritten. It knows that it was overwritten via checking the
before_stamp, and if it didn't match what it wrote to the before_stamp
before it allocated its space, it knows it was overwritten.
To put back the write_stamp, it uses the before_stamp it read. The problem
here is that by writing the before_stamp to the write_stamp it makes the
two equal again, which means that the write_stamp can be considered valid
as the last timestamp written to the ring buffer. But this is not
necessarily true. The event that interrupted the event could have been
interrupted in a way that it was interrupted as well, and can end up
leaving with an invalid write_stamp. But if this happens and returns to
this context that uses the before_stamp to update the write_stamp again,
it can possibly incorrectly make it valid, causing later events to have in
correct time stamps.
As it is OK to leave this function with an invalid write_stamp (one that
doesn't match the before_stamp), there's no reason to try to make it valid
again in this case. If this race happens, then just leave with the invalid
write_stamp and the next event to come along will just add a absolute
timestamp and validate everything again.
Bonus points: This gets rid of another cmpxchg64!
Link: https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes:
|
||
Jakub Kicinski
|
8f674972d6 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/iavf/iavf_ethtool.c |
||
Zheng Yejian
|
1cc111b9cd |
tracing: Fix uaf issue when open the hist or hist_debug file
KASAN report following issue. The root cause is when opening 'hist' file of an instance and accessing 'trace_event_file' in hist_show(), but 'trace_event_file' has been freed due to the instance being removed. 'hist_debug' file has the same problem. To fix it, call tracing_{open,release}_file_tr() in file_operations callback to have the ref count and avoid 'trace_event_file' being freed. BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278 Read of size 8 at addr ffff242541e336b8 by task head/190 CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x98/0xf8 show_stack+0x1c/0x30 dump_stack_lvl+0x44/0x58 print_report+0xf0/0x5a0 kasan_report+0x80/0xc0 __asan_report_load8_noabort+0x1c/0x28 hist_show+0x11e0/0x1278 seq_read_iter+0x344/0xd78 seq_read+0x128/0x1c0 vfs_read+0x198/0x6c8 ksys_read+0xf4/0x1e0 __arm64_sys_read+0x70/0xa8 invoke_syscall+0x70/0x260 el0_svc_common.constprop.0+0xb0/0x280 do_el0_svc+0x44/0x60 el0_svc+0x34/0x68 el0t_64_sync_handler+0xb8/0xc0 el0t_64_sync+0x168/0x170 Allocated by task 188: kasan_save_stack+0x28/0x50 kasan_set_track+0x28/0x38 kasan_save_alloc_info+0x20/0x30 __kasan_slab_alloc+0x6c/0x80 kmem_cache_alloc+0x15c/0x4a8 trace_create_new_event+0x84/0x348 __trace_add_new_event+0x18/0x88 event_trace_add_tracer+0xc4/0x1a0 trace_array_create_dir+0x6c/0x100 trace_array_create+0x2e8/0x568 instance_mkdir+0x48/0x80 tracefs_syscall_mkdir+0x90/0xe8 vfs_mkdir+0x3c4/0x610 do_mkdirat+0x144/0x200 __arm64_sys_mkdirat+0x8c/0xc0 invoke_syscall+0x70/0x260 el0_svc_common.constprop.0+0xb0/0x280 do_el0_svc+0x44/0x60 el0_svc+0x34/0x68 el0t_64_sync_handler+0xb8/0xc0 el0t_64_sync+0x168/0x170 Freed by task 191: kasan_save_stack+0x28/0x50 kasan_set_track+0x28/0x38 kasan_save_free_info+0x34/0x58 __kasan_slab_free+0xe4/0x158 kmem_cache_free+0x19c/0x508 event_file_put+0xa0/0x120 remove_event_file_dir+0x180/0x320 event_trace_del_tracer+0xb0/0x180 __remove_instance+0x224/0x508 instance_rmdir+0x44/0x78 tracefs_syscall_rmdir+0xbc/0x140 vfs_rmdir+0x1cc/0x4c8 do_rmdir+0x220/0x2b8 __arm64_sys_unlinkat+0xc0/0x100 invoke_syscall+0x70/0x260 el0_svc_common.constprop.0+0xb0/0x280 do_el0_svc+0x44/0x60 el0_svc+0x34/0x68 el0t_64_sync_handler+0xb8/0xc0 el0t_64_sync+0x168/0x170 Link: https://lore.kernel.org/linux-trace-kernel/20231214012153.676155-1-zhengyejian1@huawei.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
60be76eeab |
tracing: Add size check when printing trace_marker output
If for some reason the trace_marker write does not have a nul byte for the string, it will overflow the print: trace_seq_printf(s, ": %s", field->buf); The field->buf could be missing the nul byte. To prevent overflow, add the max size that the buf can be by using the event size and the field location. int max = iter->ent_size - offsetof(struct print_entry, buf); trace_seq_printf(s, ": %*.s", max, field->buf); Link: https://lore.kernel.org/linux-trace-kernel/20231212084444.4619b8ce@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
b049525855 |
ring-buffer: Have saved event hold the entire event
For the ring buffer iterator (non-consuming read), the event needs to be
copied into the iterator buffer to make sure that a writer does not
overwrite it while the user is reading it. If a write happens during the
copy, the buffer is simply discarded.
But the temp buffer itself was not big enough. The allocation of the
buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can
be passed into the ring buffer and saved. But the temp buffer needs to
hold the meta data as well. That would be BUF_PAGE_SIZE and not
BUF_MAX_DATA_SIZE.
Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Steven Rostedt (Google)
|
9e45e39dc2 |
ring-buffer: Do not update before stamp when switching sub-buffers
The ring buffer timestamps are synchronized by two timestamp placeholders.
One is the "before_stamp" and the other is the "write_stamp" (sometimes
referred to as the "after stamp" but only in the comments. These two
stamps are key to knowing how to handle nested events coming in with a
lockless system.
When moving across sub-buffers, the before stamp is updated but the write
stamp is not. There's an effort to put back the before stamp to something
that seems logical in case there's nested events. But as the current event
is about to cross sub-buffers, and so will any new nested event that happens,
updating the before stamp is useless, and could even introduce new race
conditions.
The first event on a sub-buffer simply uses the sub-buffer's timestamp
and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not
used in the algorithm in this case. There's no reason to try to fix the
before_stamp when this happens.
As a bonus, it removes a cmpxchg() when crossing sub-buffers!
Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Steven Rostedt (Google)
|
d06aff1cb1 |
tracing: Update snapshot buffer on resize if it is allocated
The snapshot buffer is to mimic the main buffer so that when a snapshot is
needed, the snapshot and main buffer are swapped. When the snapshot buffer
is allocated, it is set to the minimal size that the ring buffer may be at
and still functional. When it is allocated it becomes the same size as the
main ring buffer, and when the main ring buffer changes in size, it should
do.
Currently, the resize only updates the snapshot buffer if it's used by the
current tracer (ie. the preemptirqsoff tracer). But it needs to be updated
anytime it is allocated.
When changing the size of the main buffer, instead of looking to see if
the current tracer is utilizing the snapshot buffer, just check if it is
allocated to know if it should be updated or not.
Also fix typo in comment just above the code change.
Link: https://lore.kernel.org/linux-trace-kernel/20231210225447.48476a6a@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Steven Rostedt (Google)
|
17d8017581 |
ring-buffer: Fix memory leak of free page
Reading the ring buffer does a swap of a sub-buffer within the ring buffer
with a empty sub-buffer. This allows the reader to have full access to the
content of the sub-buffer that was swapped out without having to worry
about contention with the writer.
The readers call ring_buffer_alloc_read_page() to allocate a page that
will be used to swap with the ring buffer. When the code is finished with
the reader page, it calls ring_buffer_free_read_page(). Instead of freeing
the page, it stores it as a spare. Then next call to
ring_buffer_alloc_read_page() will return this spare instead of calling
into the memory management system to allocate a new page.
Unfortunately, on freeing of the ring buffer, this spare page is not
freed, and causes a memory leak.
Link: https://lore.kernel.org/linux-trace-kernel/20231210221250.7b9cc83c@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Steven Rostedt (Google)
|
b55b0a0d7c |
tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
If a large event was added to the ring buffer that is larger than what the trace_seq can handle, it just drops the output: ~# cat /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 2/2 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <...>-859 [001] ..... 141.118951: tracing_mark_write <...>-859 [001] ..... 141.148201: tracing_mark_write: 78901234 Instead, catch this case and add some context: ~# cat /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 2/2 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <...>-852 [001] ..... 121.550551: tracing_mark_write[LINE TOO BIG] <...>-852 [001] ..... 121.550581: tracing_mark_write: 78901234 This now emulates the same output as trace_pipe. Link: https://lore.kernel.org/linux-trace-kernel/20231209171058.78c1a026@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
b3ae7b67b8 |
ring-buffer: Fix writing to the buffer with max_data_size
The maximum ring buffer data size is the maximum size of data that can be recorded on the ring buffer. Events must be smaller than the sub buffer data size minus any meta data. This size is checked before trying to allocate from the ring buffer because the allocation assumes that the size will fit on the sub buffer. The maximum size was calculated as the size of a sub buffer page (which is currently PAGE_SIZE minus the sub buffer header) minus the size of the meta data of an individual event. But it missed the possible adding of a time stamp for events that are added long enough apart that the event meta data can't hold the time delta. When an event is added that is greater than the current BUF_MAX_DATA_SIZE minus the size of a time stamp, but still less than or equal to BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking for a page that can hold the event. Luckily, there's a check for this loop and after 1000 iterations and a warning is emitted and the ring buffer is disabled. But this should never happen. This can happen when a large event is added first, or after a long period where an absolute timestamp is prefixed to the event, increasing its size by 8 bytes. This passes the check and then goes into the algorithm that causes the infinite loop. For events that are the first event on the sub-buffer, it does not need to add a timestamp, because the sub-buffer itself contains an absolute timestamp, and adding one is redundant. The fix is to check if the event is to be the first event on the sub-buffer, and if it is, then do not add a timestamp. This also fixes 32 bit adding a timestamp when a read of before_stamp or write_stamp is interrupted. There's still no need to add that timestamp if the event is going to be the first event on the sub buffer. Also, if the buffer has "time_stamp_abs" set, then also check if the length plus the timestamp is greater than the BUF_MAX_DATA_SIZE. Link: https://lore.kernel.org/all/20231212104549.58863438@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6c13@gandalf.local.home Link: https://lore.kernel.org/linux-trace-kernel/20231212111617.39e02849@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: |
||
Linus Torvalds
|
17894c2a7a |
tracing fixes for v6.7-rc4:
- Snapshot buffer issues 1. When instances started allowing latency tracers, it uses a snapshot buffer (another buffer that is not written to but swapped with the main buffer that is). The snapshot buffer needs to be the same size as the main buffer. But when the snapshot buffers were added to instances, the code to make the snapshot equal to the main buffer still was only doing it for the main buffer and not the instances. 2. Need to stop the current tracer when resizing the buffers. Otherwise there can be a race if the tracer decides to make a snapshot between resizing the main buffer and the snapshot buffer. 3. When a tracer is "stopped" in disables both the main buffer and the snapshot buffer. This needs to be done for instances and not only the main buffer, now that instances also have a snapshot buffer. - Buffered event for filtering issues When filtering is enabled, because events can be dropped often, it is quicker to copy the event into a temp buffer and write that into the main buffer if it is not filtered or just drop the event if it is, than to write the event into the ring buffer and then try to discard it. This temp buffer is allocated and needs special synchronization to do so. But there were some issues with that: 1. When disabling the filter and freeing the buffer, a call to all CPUs is required to stop each per_cpu usage. But the code called smp_call_function_many() which does not include the current CPU. If the task is migrated to another CPU when it enables the CPUs via smp_call_function_many(), it will not enable the one it is currently on and this causes issues later on. Use on_each_cpu_mask() instead, which includes the current CPU. 2. When the allocation of the buffered event fails, it can give a warning. But the buffered event is just an optimization (it's still OK to write to the ring buffer and free it). Do not WARN in this case. 3. The freeing of the buffer event requires synchronization. First a counter is decremented to zero so that no new uses of it will happen. Then it sets the buffered event to NULL, and finally it frees the buffered event. There's a synchronize_rcu() between the counter decrement and the setting the variable to NULL, but only a smp_wmb() between that and the freeing of the buffer. It is theoretically possible that a user missed seeing the decrement, but will use the buffer after it is free. Another synchronize_rcu() is needed in place of that smp_wmb(). - ring buffer timestamps on 32 bit machines The ring buffer timestamp on 32 bit machines has to break the 64 bit number into multiple values as cmpxchg is required on it, and a 64 bit cmpxchg on 32 bit architectures is very slow. The code use to just use two 32 bit values and make it a 60 bit timestamp where the other 4 bits were used as counters for synchronization. It later came known that the timestamp on 32 bit still need all 64 bits in some cases. So 3 words were created to handle the 64 bits. But issues arised with this: 1. The synchronization logic still only compared the counter with the first two, but not with the third number, so the synchronization could fail unknowingly. 2. A check on discard of an event could race if an event happened between the discard and updating one of the counters. The counter needs to be updated (forcing an absolute timestamp and not to use a delta) before the actual discard happens. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZXIP5hQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qmJxAQDXBZwBUFQjWqZHLJn0S9aaz5FggkeR RmlsOMND0PXcjwD+N6U905i553ehu3SSyOP+5svoi0hyCB2qhj3ZF0LzZQU= =us1V -----END PGP SIGNATURE----- Merge tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Snapshot buffer issues: 1. When instances started allowing latency tracers, it uses a snapshot buffer (another buffer that is not written to but swapped with the main buffer that is). The snapshot buffer needs to be the same size as the main buffer. But when the snapshot buffers were added to instances, the code to make the snapshot equal to the main buffer still was only doing it for the main buffer and not the instances. 2. Need to stop the current tracer when resizing the buffers. Otherwise there can be a race if the tracer decides to make a snapshot between resizing the main buffer and the snapshot buffer. 3. When a tracer is "stopped" in disables both the main buffer and the snapshot buffer. This needs to be done for instances and not only the main buffer, now that instances also have a snapshot buffer. - Buffered event for filtering issues: When filtering is enabled, because events can be dropped often, it is quicker to copy the event into a temp buffer and write that into the main buffer if it is not filtered or just drop the event if it is, than to write the event into the ring buffer and then try to discard it. This temp buffer is allocated and needs special synchronization to do so. But there were some issues with that: 1. When disabling the filter and freeing the buffer, a call to all CPUs is required to stop each per_cpu usage. But the code called smp_call_function_many() which does not include the current CPU. If the task is migrated to another CPU when it enables the CPUs via smp_call_function_many(), it will not enable the one it is currently on and this causes issues later on. Use on_each_cpu_mask() instead, which includes the current CPU. 2.When the allocation of the buffered event fails, it can give a warning. But the buffered event is just an optimization (it's still OK to write to the ring buffer and free it). Do not WARN in this case. 3.The freeing of the buffer event requires synchronization. First a counter is decremented to zero so that no new uses of it will happen. Then it sets the buffered event to NULL, and finally it frees the buffered event. There's a synchronize_rcu() between the counter decrement and the setting the variable to NULL, but only a smp_wmb() between that and the freeing of the buffer. It is theoretically possible that a user missed seeing the decrement, but will use the buffer after it is free. Another synchronize_rcu() is needed in place of that smp_wmb(). - ring buffer timestamps on 32 bit machines The ring buffer timestamp on 32 bit machines has to break the 64 bit number into multiple values as cmpxchg is required on it, and a 64 bit cmpxchg on 32 bit architectures is very slow. The code use to just use two 32 bit values and make it a 60 bit timestamp where the other 4 bits were used as counters for synchronization. It later came known that the timestamp on 32 bit still need all 64 bits in some cases. So 3 words were created to handle the 64 bits. But issues arised with this: 1. The synchronization logic still only compared the counter with the first two, but not with the third number, so the synchronization could fail unknowingly. 2. A check on discard of an event could race if an event happened between the discard and updating one of the counters. The counter needs to be updated (forcing an absolute timestamp and not to use a delta) before the actual discard happens. * tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Test last update in 32bit version of __rb_time_read() ring-buffer: Force absolute timestamp on discard of event tracing: Fix a possible race when disabling buffered events tracing: Fix a warning when allocating buffered events fails tracing: Fix incomplete locking when disabling buffered events tracing: Disable snapshot buffer when stopping instance tracers tracing: Stop current tracer when resizing buffer tracing: Always update snapshot buffer size |
||
Jakub Kicinski
|
2483e7f04c |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/stmicro/stmmac/dwmac5.c drivers/net/ethernet/stmicro/stmmac/dwmac5.h drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c drivers/net/ethernet/stmicro/stmmac/hwif.h |
||
Steven Rostedt (Google)
|
f458a14534 |
ring-buffer: Test last update in 32bit version of __rb_time_read()
Since 64 bit cmpxchg() is very expensive on 32bit architectures, the
timestamp used by the ring buffer does some interesting tricks to be able
to still have an atomic 64 bit number. It originally just used 60 bits and
broke it up into two 32 bit words where the extra 2 bits were used for
synchronization. But this was not enough for all use cases, and all 64
bits were required.
The 32bit version of the ring buffer timestamp was then broken up into 3
32bit words using the same counter trick. But one update was not done. The
check to see if the read operation was done without interruption only
checked the first two words and not last one (like it had before this
update). Fix it by making sure all three updates happen without
interruption by comparing the initial counter with the last updated
counter.
Link: https://lore.kernel.org/linux-trace-kernel/20231206100050.3100b7bb@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Steven Rostedt (Google)
|
b2dd797543 |
ring-buffer: Force absolute timestamp on discard of event
There's a race where if an event is discarded from the ring buffer and an
interrupt were to happen at that time and insert an event, the time stamp
is still used from the discarded event as an offset. This can screw up the
timings.
If the event is going to be discarded, set the "before_stamp" to zero.
When a new event comes in, it compares the "before_stamp" with the
"write_stamp" and if they are not equal, it will insert an absolute
timestamp. This will prevent the timings from getting out of sync due to
the discarded event.
Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
Andrii Nakryiko
|
4cbb270e11 |
bpf: take into account BPF token when fetching helper protos
Instead of performing unconditional system-wide bpf_capable() and perfmon_capable() calls inside bpf_base_func_proto() function (and other similar ones) to determine eligibility of a given BPF helper for a given program, use previously recorded BPF token during BPF_PROG_LOAD command handling to inform the decision. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Petr Pavlu
|
c0591b1ccc |
tracing: Fix a possible race when disabling buffered events
Function trace_buffered_event_disable() is responsible for freeing pages
backing buffered events and this process can run concurrently with
trace_event_buffer_lock_reserve().
The following race is currently possible:
* Function trace_buffered_event_disable() is called on CPU 0. It
increments trace_buffered_event_cnt on each CPU and waits via
synchronize_rcu() for each user of trace_buffered_event to complete.
* After synchronize_rcu() is finished, function
trace_buffered_event_disable() has the exclusive access to
trace_buffered_event. All counters trace_buffered_event_cnt are at 1
and all pointers trace_buffered_event are still valid.
* At this point, on a different CPU 1, the execution reaches
trace_event_buffer_lock_reserve(). The function calls
preempt_disable_notrace() and only now enters an RCU read-side
critical section. The function proceeds and reads a still valid
pointer from trace_buffered_event[CPU1] into the local variable
"entry". However, it doesn't yet read trace_buffered_event_cnt[CPU1]
which happens later.
* Function trace_buffered_event_disable() continues. It frees
trace_buffered_event[CPU1] and decrements
trace_buffered_event_cnt[CPU1] back to 0.
* Function trace_event_buffer_lock_reserve() continues. It reads and
increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
believe that it can use the "entry" that it already obtained but the
pointer is now invalid and any access results in a use-after-free.
Fix the problem by making a second synchronize_rcu() call after all
trace_buffered_event values are set to NULL. This waits on all potential
users in trace_event_buffer_lock_reserve() that still read a previous
pointer from trace_buffered_event.
Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-4-petr.pavlu@suse.com
Cc: stable@vger.kernel.org
Fixes:
|
||
Petr Pavlu
|
34209fe83e |
tracing: Fix a warning when allocating buffered events fails
Function trace_buffered_event_disable() produces an unexpected warning
when the previous call to trace_buffered_event_enable() fails to
allocate pages for buffered events.
The situation can occur as follows:
* The counter trace_buffered_event_ref is at 0.
* The soft mode gets enabled for some event and
trace_buffered_event_enable() is called. The function increments
trace_buffered_event_ref to 1 and starts allocating event pages.
* The allocation fails for some page and trace_buffered_event_disable()
is called for cleanup.
* Function trace_buffered_event_disable() decrements
trace_buffered_event_ref back to 0, recognizes that it was the last
use of buffered events and frees all allocated pages.
* The control goes back to trace_buffered_event_enable() which returns.
The caller of trace_buffered_event_enable() has no information that
the function actually failed.
* Some time later, the soft mode is disabled for the same event.
Function trace_buffered_event_disable() is called. It warns on
"WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.
Buffered events are just an optimization and can handle failures. Make
trace_buffered_event_enable() exit on the first failure and left any
cleanup later to when trace_buffered_event_disable() is called.
Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-3-petr.pavlu@suse.com
Fixes:
|
||
Petr Pavlu
|
7fed14f7ac |
tracing: Fix incomplete locking when disabling buffered events
The following warning appears when using buffered events: [ 203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420 [...] [ 203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G E 6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a [ 203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 [ 203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420 [ 203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff [ 203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202 [ 203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000 [ 203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400 [ 203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000 [ 203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008 [ 203.781846] FS: 00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000 [ 203.781851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0 [ 203.781862] Call Trace: [ 203.781870] <TASK> [ 203.851949] trace_event_buffer_commit+0x1ea/0x250 [ 203.851967] trace_event_raw_event_sys_enter+0x83/0xe0 [ 203.851983] syscall_trace_enter.isra.0+0x182/0x1a0 [ 203.851990] do_syscall_64+0x3a/0xe0 [ 203.852075] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 203.852090] RIP: 0033:0x7f4cd870fa77 [ 203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48 [ 203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089 [ 203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77 [ 203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0 [ 203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0 [ 203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40 [ 204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0 [ 204.049256] </TASK> For instance, it can be triggered by running these two commands in parallel: $ while true; do echo hist:key=id.syscall:val=hitcount > \ /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger; done $ stress-ng --sysinfo $(nproc) The warning indicates that the current ring_buffer_per_cpu is not in the committing state. It happens because the active ring_buffer_event doesn't actually come from the ring_buffer_per_cpu but is allocated from trace_buffered_event. The bug is in function trace_buffered_event_disable() where the following normally happens: * The code invokes disable_trace_buffered_event() via smp_call_function_many() and follows it by synchronize_rcu(). This increments the per-CPU variable trace_buffered_event_cnt on each target CPU and grants trace_buffered_event_disable() the exclusive access to the per-CPU variable trace_buffered_event. * Maintenance is performed on trace_buffered_event, all per-CPU event buffers get freed. * The code invokes enable_trace_buffered_event() via smp_call_function_many(). This decrements trace_buffered_event_cnt and releases the access to trace_buffered_event. A problem is that smp_call_function_many() runs a given function on all target CPUs except on the current one. The following can then occur: * Task X executing trace_buffered_event_disable() runs on CPU 0. * The control reaches synchronize_rcu() and the task gets rescheduled on another CPU 1. * The RCU synchronization finishes. At this point, trace_buffered_event_disable() has the exclusive access to all trace_buffered_event variables except trace_buffered_event[CPU0] because trace_buffered_event_cnt[CPU0] is never incremented and if the buffer is currently unused, remains set to 0. * A different task Y is scheduled on CPU 0 and hits a trace event. The code in trace_event_buffer_lock_reserve() sees that trace_buffered_event_cnt[CPU0] is set to 0 and decides the use the buffer provided by trace_buffered_event[CPU0]. * Task X continues its execution in trace_buffered_event_disable(). The code incorrectly frees the event buffer pointed by trace_buffered_event[CPU0] and resets the variable to NULL. * Task Y writes event data to the now freed buffer and later detects the created inconsistency. The issue is observable since commit |
||
Steven Rostedt (Google)
|
b538bf7d0e |
tracing: Disable snapshot buffer when stopping instance tracers
It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). When stopping a tracer in an
instance would not disable the snapshot buffer. This could have some
unintended consequences if the irqsoff tracer is enabled.
Consolidate the tracing_start/stop() with tracing_start/stop_tr() so that
all instances behave the same. The tracing_start/stop() functions will
just call their respective tracing_start/stop_tr() with the global_array
passed in.
Link: https://lkml.kernel.org/r/20231205220011.041220035@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
|
||
Steven Rostedt (Google)
|
d78ab79270 |
tracing: Stop current tracer when resizing buffer
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.
Simply stop the running tracer before resizing the buffers and enable it
again when finished.
Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
|
||
Steven Rostedt (Google)
|
7be76461f3 |
tracing: Always update snapshot buffer size
It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). The update of the ring buffer
size would check if the instance was the top level and if so, it would
also update the snapshot buffer as it needs to be the same as the main
buffer.
Now that lower level instances also has a snapshot buffer, they too need
to update their snapshot buffer sizes when the main buffer is changed,
otherwise the following can be triggered:
# cd /sys/kernel/tracing
# echo 1500 > buffer_size_kb
# mkdir instances/foo
# echo irqsoff > instances/foo/current_tracer
# echo 1000 > instances/foo/buffer_size_kb
Produces:
WARNING: CPU: 2 PID: 856 at kernel/trace/trace.c:1938 update_max_tr_single.part.0+0x27d/0x320
Which is:
ret = ring_buffer_swap_cpu(tr->max_buffer.buffer, tr->array_buffer.buffer, cpu);
if (ret == -EBUSY) {
[..]
}
WARN_ON_ONCE(ret && ret != -EAGAIN && ret != -EBUSY); <== here
That's because ring_buffer_swap_cpu() has:
int ret = -EINVAL;
[..]
/* At least make sure the two buffers are somewhat the same */
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
[..]
out:
return ret;
}
Instead, update all instances' snapshot buffer sizes when their main
buffer size is updated.
Link: https://lkml.kernel.org/r/20231205220010.454662151@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
|
||
Jens Axboe
|
9fd7874c0e
|
iov_iter: replace import_single_range() with import_ubuf()
With the removal of the 'iov' argument to import_single_range(), the two functions are now fully identical. Convert the import_single_range() callers to import_ubuf(), and remove the former fully. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
Jens Axboe
|
6ac805d138
|
iov_iter: remove unused 'iov' argument from import_single_range()
It is entirely unused, just get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-2-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
Song Liu
|
ac9c05e0e4 |
bpf: Add kfunc bpf_get_file_xattr
It is common practice for security solutions to store tags/labels in xattrs. To implement similar functionalities in BPF LSM, add new kfunc bpf_get_file_xattr(). The first use case of bpf_get_file_xattr() is to implement file verifications with asymmetric keys. Specificially, security applications could use fsverity for file hashes and use xattr to store file signatures. (kfunc for fsverity hash will be added in a separate commit.) Currently, only xattrs with "user." prefix can be read with kfunc bpf_get_file_xattr(). As use cases evolve, we may add a dedicated prefix for bpf_get_file_xattr(). To avoid recursion, bpf_get_file_xattr can be only called from LSM hooks. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20231129234417.856536-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Kees Cook
|
8a3750ecf8 |
tracing/uprobe: Replace strlcpy() with strscpy()
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated[1]. Additionally, it returns the size of the source string, not the resulting size of the destination string. In an effort to remove strlcpy() completely[2], replace strlcpy() here with strscpy(). The negative return value is already handled by this code so no new handling is needed here. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1] Link: https://github.com/KSPP/linux/issues/89 [2] Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: linux-trace-kernel@vger.kernel.org Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20231130205607.work.463-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> |
||
Masami Hiramatsu (Google)
|
a1461f1fd6 |
rethook: Use __rcu pointer for rethook::handler
Since the rethook::handler is an RCU-maganged pointer so that it will
notice readers the rethook is stopped (unregistered) or not, it should
be an __rcu pointer and use appropriate functions to be accessed. This
will use appropriate memory barrier when accessing it. OTOH,
rethook::data is never changed, so we don't need to check it in
get_kretprobe().
NOTE: To avoid sparse warning, rethook::handler is defined by a raw
function pointer type with __rcu instead of rethook_handler_t.
Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/
Fixes:
|
||
Jiri Olsa
|
e56fdbfb06 |
bpf: Add link_info support for uprobe multi link
Adding support to get uprobe_link details through bpf_link_info interface. Adding new struct uprobe_multi to struct bpf_link_info to carry the uprobe_multi link details. The uprobe_multi.count is passed from user space to denote size of array fields (offsets/ref_ctr_offsets/cookies). The actual array size is stored back to uprobe_multi.count (allowing user to find out the actual array size) and array fields are populated up to the user passed size. All the non-array fields (path/count/flags/pid) are always set. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20231125193130.834322-4-jolsa@kernel.org |
||
Jiri Olsa
|
4930b7f53a |
bpf: Store ref_ctr_offsets values in bpf_uprobe array
We will need to return ref_ctr_offsets values through link_info interface in following change, so we need to keep them around. Storing ref_ctr_offsets values directly into bpf_uprobe array. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20231125193130.834322-3-jolsa@kernel.org |
||
Jakub Kicinski
|
53475287da |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZV0kjgAKCRDbK58LschI gy0EAP9XwncW2OhO72DpITluFzvWPgB0N97OANKBXjzKJrRAlQD/aUe9nlvBQuad WsbMKLeC4wvI2X/4PEIR4ukbuZ3ypAA= =LMVg -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-11-21 We've added 85 non-merge commits during the last 12 day(s) which contain a total of 63 files changed, 4464 insertions(+), 1484 deletions(-). The main changes are: 1) Huge batch of verifier changes to improve BPF register bounds logic and range support along with a large test suite, and verifier log improvements, all from Andrii Nakryiko. 2) Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id, from Yafang Shao. 3) Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext, from Dave Marchevsky. 4) Fix bpf_get_task_stack() helper to add the correct crosstask check for the get_perf_callchain(), from Jordan Rome. 5) Fix BPF task_iter internals where lockless usage of next_thread() was wrong. The rework also simplifies the code, from Oleg Nesterov. 6) Fix uninitialized tail padding via LIBBPF_OPTS_RESET, and another fix for certain BPF UAPI structs to fix verifier failures seen in bpf_dynptr usage, from Yonghong Song. 7) Add BPF selftest fixes for map_percpu_stats flakes due to per-CPU BPF memory allocator not being able to allocate per-CPU pointer successfully, from Hou Tao. 8) Add prep work around dynptr and string handling for kfuncs which is later going to be used by file verification via BPF LSM and fsverity, from Song Liu. 9) Improve BPF selftests to update multiple prog_tests to use ASSERT_* macros, from Yuran Pereira. 10) Optimize LPM trie lookup to check prefixlen before walking the trie, from Florian Lehner. 11) Consolidate virtio/9p configs from BPF selftests in config.vm file given they are needed consistently across archs, from Manu Bretelle. 12) Small BPF verifier refactor to remove register_is_const(), from Shung-Hsi Yu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits) selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca selftests/bpf: reduce verboseness of reg_bounds selftest logs bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos) bpf: bpf_iter_task_next: use __next_thread() rather than next_thread() bpf: task_group_seq_get_next: use __next_thread() rather than next_thread() bpf: emit frameno for PTR_TO_STACK regs if it differs from current one bpf: smarter verifier log number printing logic bpf: omit default off=0 and imm=0 in register state log bpf: emit map name in register state if applicable and available bpf: print spilled register state in stack slot bpf: extract register state printing bpf: move verifier state printing code to kernel/bpf/log.c bpf: move verbose_linfo() into kernel/bpf/log.c bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS bpf: Remove test for MOVSX32 with offset=32 selftests/bpf: add iter test requiring range x range logic veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag ... ==================== Link: https://lore.kernel.org/r/20231122000500.28126-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Yujie Liu
|
f032c53bea |
tracing/kprobes: Fix the order of argument descriptions
The order of descriptions should be consistent with the argument list of
the function, so "kretprobe" should be the second one.
int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe,
const char *name, const char *loc, ...)
Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/
Fixes:
|
||
Masami Hiramatsu (Google)
|
ce51e6153f |
tracing: fprobe-event: Fix to check tracepoint event and return
Fix to check the tracepoint event is not valid with $retval. The commit |
||
Song Liu
|
74523c06ae |
bpf: Add __bpf_dynptr_data* for in kernel use
Different types of bpf dynptr have different internal data storage. Specifically, SKB and XDP type of dynptr may have non-continuous data. Therefore, it is not always safe to directly access dynptr->data. Add __bpf_dynptr_data and __bpf_dynptr_data_rw to replace direct access to dynptr->data. Update bpf_verify_pkcs7_signature to use __bpf_dynptr_data instead of dynptr->data. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/bpf/20231107045725.2278852-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Linus Torvalds
|
89cdf9d556 |
Including fixes from netfilter and bpf.
Current release - regressions: - sched: fix SKB_NOT_DROPPED_YET splat under debug config Current release - new code bugs: - tcp: fix usec timestamps with TCP fastopen - tcp_sigpool: fix some off by one bugs - tcp: fix possible out-of-bounds reads in tcp_hash_fail() - tcp: fix SYN option room calculation for TCP-AO - bpf: fix compilation error without CGROUPS - ptp: - ptp_read() should not release queue - fix tsevqs corruption Previous releases - regressions: - llc: verify mac len before reading mac header Previous releases - always broken: - bpf: - fix check_stack_write_fixed_off() to correctly spill imm - fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END - check map->usercnt after timer->timer is assigned - dsa: lan9303: consequently nested-lock physical MDIO - dccp/tcp: call security_inet_conn_request() after setting IP addr - tg3: fix the TX ring stall due to incorrect full ring handling - phylink: initialize carrier state at creation - ice: fix direction of VF rules in switchdev mode Misc: - fill in a bunch of missing MODULE_DESCRIPTION()s, more to come Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVNRnAACgkQMUZtbf5S IrsYaA/+IUoYi96/oLtvvrET6HIbXeMaLKef0UlytEicQKy8h5EWlhcTZPhQEY0g dtaKOemQsO0dQTma4eQBiBDHeCeSkitgD9p7fh0i+//QFYWSFqHrBiF2mlToc/ZQ T1p4BlVL7D2Xsr1Lki93zk+EhFGEy2KroYgrWbZc9TWE5ap9PtSVF9eqeHAVCmZ7 ocre/eo4pqUM9rAHIAyhoL+0xtVQ59dBevbJC0qYcmflhafr82Gtdveo6pBBKuYm GhwbRrAXER3Neav9c6NHqat4zsMwGpC27SiN9dYWm6dlkeS9U9t2PUu71OkJGVfw VaSE+utkC/WmzGbuiUIjqQLBrRe372ItHCr78BfSRMshS+RBTHtoK7njeH8Iv67E RsMeCyVNj9dtGlOQG5JAv8IoCQ1WbMw9B36Yzw3ip/MmDX/ntXz7Dcr4ZMZ6VURS CHhHFZPnmMykMXkT6SIlxeAg2r8ELtESzkvLimdTVFPAlk3cPkibKJbh3F/tEqXS PDb3y0uoEgRQBAsWXXx9FQEvv9rTL6YrzbMhmJBIIEoNxppQYQ7FZBJX9utAVp5B 1GdyqhR6IRTaKb9cMRj/K1xPwm2KgCw9xj9pjKdAA7QUMslXbFp8blv1rIkFGshg hiNXmPcI8wo0j+0lZYktEcIERL5y6c8BgK2NnPU6RULua96tuQ4= =k6Wk -----END PGP SIGNATURE----- Merge tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and bpf. Current release - regressions: - sched: fix SKB_NOT_DROPPED_YET splat under debug config Current release - new code bugs: - tcp: - fix usec timestamps with TCP fastopen - fix possible out-of-bounds reads in tcp_hash_fail() - fix SYN option room calculation for TCP-AO - tcp_sigpool: fix some off by one bugs - bpf: fix compilation error without CGROUPS - ptp: - ptp_read() should not release queue - fix tsevqs corruption Previous releases - regressions: - llc: verify mac len before reading mac header Previous releases - always broken: - bpf: - fix check_stack_write_fixed_off() to correctly spill imm - fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END - check map->usercnt after timer->timer is assigned - dsa: lan9303: consequently nested-lock physical MDIO - dccp/tcp: call security_inet_conn_request() after setting IP addr - tg3: fix the TX ring stall due to incorrect full ring handling - phylink: initialize carrier state at creation - ice: fix direction of VF rules in switchdev mode Misc: - fill in a bunch of missing MODULE_DESCRIPTION()s, more to come" * tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits) net: ti: icss-iep: fix setting counter value ptp: fix corrupted list in ptp_open ptp: ptp_read should not release queue net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP net: kcm: fill in MODULE_DESCRIPTION() net/sched: act_ct: Always fill offloading tuple iifidx netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses netfilter: xt_recent: fix (increase) ipv6 literal buffer length ipvs: add missing module descriptions netfilter: nf_tables: remove catchall element in GC sync path netfilter: add missing module descriptions drivers/net/ppp: use standard array-copy-function net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt() r8169: respect userspace disabling IFF_MULTICAST selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg net: phylink: initialize carrier state at creation test/vsock: add dobule bind connect test test/vsock: refactor vsock_accept ... |
||
Linus Torvalds
|
31e5f934ff |
Tracing updates for v6.7:
- Remove eventfs_file descriptor This is the biggest change, and the second part of making eventfs create its files dynamically. In 6.6 the first part was added, and that maintained a one to one mapping between eventfs meta descriptors and the directories and file inodes and dentries that were dynamically created. The directories were represented by a eventfs_inode and the files were represented by a eventfs_file. In v6.7 the eventfs_file is removed. As all events have the same directory make up (sched_switch has an "enable", "id", "format", etc files), the handing of what files are underneath each leaf eventfs directory is moved back to the tracing subsystem via a callback. When a event is added to the eventfs, it registers an array of evenfs_entry's. These hold the names of the files and the callbacks to call when the file is referenced. The callback gets the name so that the same callback may be used by multiple files. The callback then supplies the filesystem_operations structure needed to create this file. This has brought the memory footprint of creating multiple eventfs instances down by 2 megs each! - User events now has persistent events that are not associated to a single processes. These are privileged events that hang around even if no process is attached to them. - Clean up of seq_buf. There's talk about using seq_buf more to replace strscpy() and friends. But this also requires some minor modifications of seq_buf to be able to do this. - Expand instance ring buffers individually Currently if boot up creates an instance, and a trace event is enabled on that instance, the ring buffer for that instance and the top level ring buffer are expanded (1.4 MB per CPU). This wastes memory as this happens when nothing is using the top level instance. - Other minor clean ups and fixes. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZUMrBBQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6quzVAQCed/kPM7X9j2QZamJVDruMf2CmVxpu /TOvKvSKV584GgEAxLntf5VKx1Q98bc68y3Zkg+OCi8jSgORos1ROmURhws= =iIgb -----END PGP SIGNATURE----- Merge tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Remove eventfs_file descriptor This is the biggest change, and the second part of making eventfs create its files dynamically. In 6.6 the first part was added, and that maintained a one to one mapping between eventfs meta descriptors and the directories and file inodes and dentries that were dynamically created. The directories were represented by a eventfs_inode and the files were represented by a eventfs_file. In v6.7 the eventfs_file is removed. As all events have the same directory make up (sched_switch has an "enable", "id", "format", etc files), the handing of what files are underneath each leaf eventfs directory is moved back to the tracing subsystem via a callback. When an event is added to the eventfs, it registers an array of evenfs_entry's. These hold the names of the files and the callbacks to call when the file is referenced. The callback gets the name so that the same callback may be used by multiple files. The callback then supplies the filesystem_operations structure needed to create this file. This has brought the memory footprint of creating multiple eventfs instances down by 2 megs each! - User events now has persistent events that are not associated to a single processes. These are privileged events that hang around even if no process is attached to them - Clean up of seq_buf There's talk about using seq_buf more to replace strscpy() and friends. But this also requires some minor modifications of seq_buf to be able to do this - Expand instance ring buffers individually Currently if boot up creates an instance, and a trace event is enabled on that instance, the ring buffer for that instance and the top level ring buffer are expanded (1.4 MB per CPU). This wastes memory as this happens when nothing is using the top level instance - Other minor clean ups and fixes * tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits) seq_buf: Export seq_buf_puts() seq_buf: Export seq_buf_putc() eventfs: Use simple_recursive_removal() to clean up dentries eventfs: Remove special processing of dput() of events directory eventfs: Delete eventfs_inode when the last dentry is freed eventfs: Hold eventfs_mutex when calling callback functions eventfs: Save ownership and mode eventfs: Test for ei->is_freed when accessing ei->dentry eventfs: Have a free_ei() that just frees the eventfs_inode eventfs: Remove "is_freed" union with rcu head eventfs: Fix kerneldoc of eventfs_remove_rec() tracing: Have the user copy of synthetic event address use correct context eventfs: Remove extra dget() in eventfs_create_events_dir() tracing: Have trace_event_file have ref counters seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str() eventfs: Fix typo in eventfs_inode union comment eventfs: Fix WARN_ON() in create_file_dentry() powerpc: Remove initialisation of readpos tracing/histograms: Simplify last_cmd_set() seq_buf: fix a misleading comment ... |
||
Dave Marchevsky
|
391145ba2a |
bpf: Add __bpf_kfunc_{start,end}_defs macros
BPF kfuncs are meant to be called from BPF programs. Accordingly, most kfuncs are not called from anywhere in the kernel, which the -Wmissing-prototypes warning is unhappy about. We've peppered __diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are defined in the codebase to suppress this warning. This patch adds two macros meant to bound one or many kfunc definitions. All existing kfunc definitions which use these __diag calls to suppress -Wmissing-prototypes are migrated to use the newly-introduced macros. A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the __bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier version of this patch [0] and another recent mailing list thread [1]. In the future we might need to ignore different warnings or do other kfunc-specific things. This change will make it easier to make such modifications for all kfunc defs. [0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1Xig@mail.gmail.com/ [1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Steven Rostedt (Google)
|
4f7969bcd6 |
tracing: Have the user copy of synthetic event address use correct context
A synthetic event is created by the synthetic event interface that can
read both user or kernel address memory. In reality, it reads any
arbitrary memory location from within the kernel. If the address space is
in USER (where CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE is set) then
it uses strncpy_from_user_nofault() to copy strings otherwise it uses
strncpy_from_kernel_nofault().
But since both functions use the same variable there's no annotation to
what that variable is (ie. __user). This makes sparse complain.
Quiet sparse by typecasting the strncpy_from_user_nofault() variable to
a __user pointer.
Link: https://lore.kernel.org/linux-trace-kernel/20231031151033.73c42e23@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes:
|
||
Steven Rostedt (Google)
|
bb32500fb9 |
tracing: Have trace_event_file have ref counters
The following can crash the kernel:
# cd /sys/kernel/tracing
# echo 'p:sched schedule' > kprobe_events
# exec 5>>events/kprobes/sched/enable
# > kprobe_events
# exec 5>&-
The above commands:
1. Change directory to the tracefs directory
2. Create a kprobe event (doesn't matter what one)
3. Open bash file descriptor 5 on the enable file of the kprobe event
4. Delete the kprobe event (removes the files too)
5. Close the bash file descriptor 5
The above causes a crash!
BUG: kernel NULL pointer dereference, address: 0000000000000028
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 877 Comm: bash Not tainted 6.5.0-rc4-test-00008-g2c6b6b1029d4-dirty #186
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:tracing_release_file_tr+0xc/0x50
What happens here is that the kprobe event creates a trace_event_file
"file" descriptor that represents the file in tracefs to the event. It
maintains state of the event (is it enabled for the given instance?).
Opening the "enable" file gets a reference to the event "file" descriptor
via the open file descriptor. When the kprobe event is deleted, the file is
also deleted from the tracefs system which also frees the event "file"
descriptor.
But as the tracefs file is still opened by user space, it will not be
totally removed until the final dput() is called on it. But this is not
true with the event "file" descriptor that is already freed. If the user
does a write to or simply closes the file descriptor it will reference the
event "file" descriptor that was just freed, causing a use-after-free bug.
To solve this, add a ref count to the event "file" descriptor as well as a
new flag called "FREED". The "file" will not be freed until the last
reference is released. But the FREE flag will be set when the event is
removed to prevent any more modifications to that event from happening,
even if there's still a reference to the event "file" descriptor.
Link: https://lore.kernel.org/linux-trace-kernel/20231031000031.1e705592@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231031122453.7a48b923@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes:
|
||
Linus Torvalds
|
05bf73aa27 |
Probes updates for v6.7:
- cleanups: . kprobes: Fixes typo in kprobes samples. . tracing/eprobes: Remove 'break' after return. - kretprobe/fprobe performance improvements: . lib: Introduce new `objpool`, which is a high performance lockless object queue. This uses per-cpu ring array to allocate/release objects from the pre-allocated object pool. Since the index of ring array is a 32bit sequential counter, we can retry to push/pop the object pointer from the ring without lock (as seq-lock does). . lib: Add an objpool test module to test the functionality and evaluate the performance under some circumstances. . kprobes/fprobe: Improve kretprobe and rethook scalability performance with objpool. This improves both legacy kretprobe and fprobe exit handler (which is based on rethook) to be scalable on SMP systems. Even with 8-threads parallel test, it shows a great scalability improvement. . Remove unneeded freelist.h which is replaced by objpool. . objpool: Add maintainers entry for the objpool. . objpool: Fix to remove unused include header lines. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmVA54obHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8busoH/3mG/rJwVVJw70zTLlfs ko4U1wn16aImYQYYLXkZLlYsKr6Y2dzNkb5C4CEI2r47EZjTamHatGZ6MSwvAtPb u9oloHEbRbE6yM+EjrE1JAKT9FwC+21/yZCN2zACZKJRwCwQRzxGIXUwGTWtDNdE NySLBDyMoR6zZJsFy8YueFBAJxcZdWIPK6mQH2Y5awVQA4tV7tQEe92KFqUYWTd5 exbfBbcVG8MBWmrPqRI46Hxh0NWOnPCqFwGqX8Q7hE/yrQnTPzJ+2ZsbYFkGRk6A pM5wRCdwO5+OlcHEcEHBMQSGCmFgk6m1UMG8RvbCKyF3cwHbxzlelbjzHosKQvSh EKQ= =/vZK -----END PGP SIGNATURE----- Merge tag 'probes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: "Cleanups: - kprobes: Fixes typo in kprobes samples - tracing/eprobes: Remove 'break' after return kretprobe/fprobe performance improvements: - lib: Introduce new `objpool`, which is a high performance lockless object queue. This uses per-cpu ring array to allocate/release objects from the pre-allocated object pool. Since the index of ring array is a 32bit sequential counter, we can retry to push/pop the object pointer from the ring without lock (as seq-lock does) - lib: Add an objpool test module to test the functionality and evaluate the performance under some circumstances - kprobes/fprobe: Improve kretprobe and rethook scalability performance with objpool. This improves both legacy kretprobe and fprobe exit handler (which is based on rethook) to be scalable on SMP systems. Even with 8-threads parallel test, it shows a great scalability improvement - Remove unneeded freelist.h which is replaced by objpool - objpool: Add maintainers entry for the objpool - objpool: Fix to remove unused include header lines" * tag 'probes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: unused header files removed MAINTAINERS: objpool added kprobes: freelist.h removed kprobes: kretprobe scalability improvement lib: objpool test module added lib: objpool added: ring-array based lockless MPMC tracing/eprobe: drop unneeded breaks samples: kprobes: Fixes a typo |
||
Linus Torvalds
|
89ed67ef12 |
Networking changes for 6.7.
Core & protocols ---------------- - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF --- - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure. https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code ---------------------- - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API ---------- - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc ---- - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed ------- - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers ------- - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmU8XsYACgkQMUZtbf5S Irv19RAAnud/24OOF5XMEJkIcYlnfqximh4XO6PujRSYkSkOUJdZTF6iJPgf3pSP YpwoHYbYKHYfeOf8+3bTNESiQNSnoVmvmvwiS6/7lZ3behHUrGLQzW9Htc3EZyWH 2h6QkDZ5OOjfg0bwYSfp3vXkmMH2k8WE9Y0NvCkhcohqZi13Rmp14RnyPmNb2d1V yZRYDMSM133KqE6gnBr1Ct65IEvnKeGlCUN2mTGqOJgdn6DZMsyxvtt0y4rmN7Ab 41+CgPU5SfxfbYpW+Dl2HJpgfte3WrC57KC6AM0PAPJzPmQWgeB/m9mjz/apj6Bg bhsEIo7FdvbCnQm3yWPhK2OgCAcSwLr8jfGMU+Q+W4VnL5SRRR3Rm0zjsze+kHNP OfqJgxzl3DpvoJqVBy1h5FGcZt0XHwhksm4cTxWqIahsF+veY0ECBXbuBBQx9XTF Y7INfI8ulg7wISJs+CJfIClYkgOibTw2u8taBS5ikbtgxNqp5D4QqODn7UefQap1 PR/IDYODF+zRgmMJLeBqSa6fij6BkfOEDiOWak5kggBoZdtbtmeKI6tzze06CNdW lWv1WEhRufxnwK+IuWsEkjhiMbs2WGLvkJ5JbgQV9BfqHfIfiqBCrcWtT/WbQnGt lmU46CXh1t/FZEqbmK9h+8vsIIfrcDl6jb5npEiKPRG00vDKRTM= =46nS -----END PGP SIGNATURE----- Merge tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF: - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure: https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code: - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API: - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc: - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed: - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend" * tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits) net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos() net: mana: Use xdp_set_features_flag instead of direct assignment vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size() iavf: delete the iavf client interface iavf: add a common function for undoing the interrupt scheme iavf: use unregister_netdev iavf: rely on netdev's own registered state iavf: fix the waiting time for initial reset iavf: in iavf_down, don't queue watchdog_task if comms failed iavf: simplify mutex_trylock+sleep loops iavf: fix comments about old bit locks doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name tools: ynl: introduce option to process unknown attributes or types ipvlan: properly track tx_errors netdevsim: Block until all devices are released nfp: using napi_build_skb() to replace build_skb() net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy" net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN net: dsa: microchip: Refactor switch shutdown routine for WoL preparation ... |
||
Linus Torvalds
|
3b3f874cc1 |
vfs-6.7.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
=7yD1
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.
- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".
- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.
When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:
(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem
The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.
This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.
Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.
- Make overlayfs use SB_I_NOUMASK too.
- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.
- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:
When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.
In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.
This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit
|
||
Kees Cook
|
dcc4e5728e |
seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str()
Solve two ergonomic issues with struct seq_buf; 1) Too much boilerplate is required to initialize: struct seq_buf s; char buf[32]; seq_buf_init(s, buf, sizeof(buf)); Instead, we can build this directly on the stack. Provide DECLARE_SEQ_BUF() macro to do this: DECLARE_SEQ_BUF(s, 32); 2) %NUL termination is fragile and requires 2 steps to get a valid C String (and is a layering violation exposing the "internals" of seq_buf): seq_buf_terminate(s); do_something(s->buffer); Instead, we can just return s->buffer directly after terminating it in the refactored seq_buf_terminate(), now known as seq_buf_str(): do_something(seq_buf_str(s)); Link: https://lore.kernel.org/linux-trace-kernel/20231027155634.make.260-kees@kernel.org Link: https://lore.kernel.org/linux-trace-kernel/20231026194033.it.702-kees@kernel.org/ Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Justin Stitt <justinstitt@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yun Zhou <yun.zhou@windriver.com> Cc: Jacob Keller <jacob.e.keller@intel.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Andrii Nakryiko
|
926fe783c8 |
tracing/kprobes: Fix symbol counting logic by looking at modules as well
Recent changes to count number of matching symbols when creating
a kprobe event failed to take into account kernel modules. As such, it
breaks kprobes on kernel module symbols, by assuming there is no match.
Fix this my calling module_kallsyms_on_each_symbol() in addition to
kallsyms_on_each_match_symbol() to perform a proper counting.
Link: https://lore.kernel.org/all/20231027233126.2073148-1-andrii@kernel.org/
Cc: Francis Laniel <flaniel@linux.microsoft.com>
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Fixes:
|
||
Yujie Liu
|
e0f831836c |
tracing/kprobes: Fix the description of variable length arguments
Fix the following kernel-doc warnings:
kernel/trace/trace_kprobe.c:1029: warning: Excess function parameter 'args' description in '__kprobe_event_gen_cmd_start'
kernel/trace/trace_kprobe.c:1097: warning: Excess function parameter 'args' description in '__kprobe_event_add_fields'
Refer to the usage of variable length arguments elsewhere in the kernel
code, "@..." is the proper way to express it in the description.
Link: https://lore.kernel.org/all/20231027041315.2613166-1-yujie.liu@intel.com/
Fixes:
|
||
Jakub Kicinski
|
ec4c20ca09 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/mac80211/rx.c |
||
wuqiang.matt
|
4758560fa2 |
kprobes: unused header files removed
As kernel test robot reported, lib/test_objpool.c (trace:probes/for-next) has linux/version.h included, but version.h is not used at all. Then more unused headers are found in test_objpool.c and rethook.c, and all of them should be removed. Link: https://lore.kernel.org/all/20231023112245.6112-1-wuqiang.matt@bytedance.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310191512.vvypKU5Z-lkp@intel.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Christophe JAILLET
|
545db7e21e |
tracing/histograms: Simplify last_cmd_set()
Turn a kzalloc()+strcpy()+strncat() into an equivalent and less verbose kasprintf(). Link: https://lore.kernel.org/linux-trace-kernel/30b6fb04dadc10a03cc1ad08f5d8a93ef623a167.1697899346.git.christophe.jaillet@wanadoo.fr Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Mukesh ojha <quic_mojha@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Matthew Wilcox (Oracle)
|
d0ed46b603 |
tracing: Move readpos from seq_buf to trace_seq
To make seq_buf more lightweight as a string buf, move the readpos member from seq_buf to its container, trace_seq. That puts the responsibility of maintaining the readpos entirely in the tracing code. If some future users want to package up the readpos with a seq_buf, we can define a new struct then. Link: https://lore.kernel.org/linux-trace-kernel/20231020033545.2587554-2-willy@infradead.org Cc: Kees Cook <keescook@chromium.org> Cc: Justin Stitt <justinstitt@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Dan Carpenter
|
5264a2f4bb |
tracing: Fix a NULL vs IS_ERR() bug in event_subsystem_dir()
The eventfs_create_dir() function returns error pointers, it never returns
NULL. Update the check to reflect that.
Link: https://lore.kernel.org/linux-trace-kernel/ff641474-84e2-46a7-9d7a-62b251a1050c@moroto.mountain
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|
||
Francis Laniel
|
b022f0c7e4 |
tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols
When a kprobe is attached to a function that's name is not unique (is
static and shares the name with other functions in the kernel), the
kprobe is attached to the first function it finds. This is a bug as the
function that it is attaching to is not necessarily the one that the
user wants to attach to.
Instead of blindly picking a function to attach to what is ambiguous,
error with EADDRNOTAVAIL to let the user know that this function is not
unique, and that the user must use another unique function with an
address offset to get to the function they want to attach to.
Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/
Cc: stable@vger.kernel.org
Fixes:
|
||
Jakub Kicinski
|
041c3466f3 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. net/mac80211/key.c |
||
Amir Goldstein
|
08582d678f
|
fs: create helper file_user_path() for user displayed mapped file path
Overlayfs uses backing files with "fake" overlayfs f_path and "real" underlying f_inode, in order to use underlying inode aops for mapped files and to display the overlayfs path in /proc/<pid>/maps. In preparation for storing the overlayfs "fake" path instead of the underlying "real" path in struct backing_file, define a noop helper file_user_path() that returns f_path for now. Use the new helper in procfs and kernel logs whenever a path of a mapped file is displayed to users. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
wuqiang.matt
|
4bbd934556 |
kprobes: kretprobe scalability improvement
kretprobe is using freelist to manage return-instances, but freelist, as LIFO queue based on singly linked list, scales badly and reduces the overall throughput of kretprobed routines, especially for high contention scenarios. Here's a typical throughput test of sys_prctl (counts in 10 seconds, measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl): OS: Debian 10 X86_64, Linux 6.5rc7 with freelist HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s 1T 2T 4T 8T 16T 24T 24150045 29317964 15446741 12494489 18287272 17708768 32T 48T 64T 72T 96T 128T 16200682 13737658 11645677 11269858 10470118 9931051 This patch introduces objpool to replace freelist. objpool is a high performance queue, which can bring near-linear scalability to kretprobed routines. Tests of kretprobe throughput show the biggest ratio as 159x of original freelist. Here's the result: 1T 2T 4T 8T 16T native: 41186213 82336866 164250978 328662645 658810299 freelist: 24150045 29317964 15446741 12494489 18287272 objpool: 23926730 48010314 96125218 191782984 385091769 32T 48T 64T 96T 128T native: 1330338351 1969957941 2512291791 2615754135 2671040914 freelist: 16200682 13737658 11645677 10470118 9931051 objpool: 764481096 1147149781 1456220214 1502109662 1579015050 Testings on 96-core ARM64 output similarly, but with the biggest ratio up to 448x: OS: Debian 10 AARCH64, Linux 6.5rc7 HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s 1T 2T 4T 8T 16T native: . 30066096 63569843 126194076 257447289 505800181 freelist: 16152090 11064397 11124068 7215768 5663013 objpool: 13997541 28032100 55726624 110099926 221498787 24T 32T 48T 64T 96T native: 763305277 1015925192 1521075123 2033009392 3021013752 freelist: 5015810 4602893 3766792 3382478 2945292 objpool: 328192025 439439564 668534502 887401381 1319972072 Link: https://lore.kernel.org/all/20231017135654.82270-4-wuqiang.matt@bytedance.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Jakub Kicinski
|
a3c2dd9648 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZS1d4wAKCRDbK58LschI g4DSAP441CdKh8fd+wNKUSKHFbpCQ6EvocR6Nf+Sj2DFUx/w/QEA7mfju7Abqjc3 xwDEx0BuhrjMrjV5MmEpxc7lYl9XcQU= =vuWk -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Masami Hiramatsu (Google)
|
700b2b4397 |
fprobe: Fix to ensure the number of active retprobes is not zero
The number of active retprobes can be zero but it is not acceptable,
so return EINVAL error if detected.
Link: https://lore.kernel.org/all/169750018550.186853.11198884812017796410.stgit@devnote2/
Reported-by: wuqiang.matt <wuqiang.matt@bytedance.com>
Closes: https://lore.kernel.org/all/20231016222103.cb9f426edc60220eabd8aa6a@kernel.org/
Fixes:
|
||
Artem Savkov
|
ba8ea72388 |
bpf: Change syscall_nr type to int in struct syscall_tp_t
linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support for lazy preemption")) that adds an extra member to struct trace_entry. This causes the offset of args field in struct trace_event_raw_sys_enter be different from the one in struct syscall_trace_enter: struct trace_event_raw_sys_enter { struct trace_entry ent; /* 0 12 */ /* XXX last struct has 3 bytes of padding */ /* XXX 4 bytes hole, try to pack */ long int id; /* 16 8 */ long unsigned int args[6]; /* 24 48 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ char __data[]; /* 72 0 */ /* size: 72, cachelines: 2, members: 4 */ /* sum members: 68, holes: 1, sum holes: 4 */ /* paddings: 1, sum paddings: 3 */ /* last cacheline: 8 bytes */ }; struct syscall_trace_enter { struct trace_entry ent; /* 0 12 */ /* XXX last struct has 3 bytes of padding */ int nr; /* 12 4 */ long unsigned int args[]; /* 16 0 */ /* size: 16, cachelines: 1, members: 3 */ /* paddings: 1, sum paddings: 3 */ /* last cacheline: 16 bytes */ }; This, in turn, causes perf_event_set_bpf_prog() fail while running bpf test_profiler testcase because max_ctx_offset is calculated based on the former struct, while off on the latter: 10488 if (is_tracepoint || is_syscall_tp) { 10489 int off = trace_event_get_offsets(event->tp_event); 10490 10491 if (prog->aux->max_ctx_offset > off) 10492 return -EACCES; 10493 } What bpf program is actually getting is a pointer to struct syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes the problem by aligning struct syscall_tp_t with struct syscall_trace_(enter|exit) and changing the tests to use these structs to dereference context. Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20231013054219.172920-1-asavkov@redhat.com |
||
Julia Lawall
|
f843249cb6 |
tracing/eprobe: drop unneeded breaks
Drop break after return. Link: https://lore.kernel.org/all/20230928104334.41215-1-Julia.Lawall@inria.fr/ Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Steven Rostedt (Google)
|
5ddd8baa48 |
tracing: Make system_callback() function static
The system_callback() function in trace_events.c is only used within that
file. The "static" annotation was missed.
Fixes:
|
||
Steven Rostedt (Google)
|
2819f23ac1 |
eventfs: Use eventfs_remove_events_dir()
The update to removing the eventfs_file changed the way the events top
level directory was handled. Instead of returning a dentry, it now returns
the eventfs_inode. In this changed, the removing of the events top level
directory is not much different than removing any of the other
directories. Because of this, the removal just called eventfs_remove_dir()
instead of eventfs_remove_events_dir().
Although eventfs_remove_dir() does the clean up, it misses out on the
dget() of the ei->dentry done in eventfs_create_events_dir(). It makes
more sense to match eventfs_create_events_dir() with a specific function
eventfs_remove_events_dir() and this specific function can then perform
the dput() to the dentry that had the dget() when it was created.
Fixes:
|
||
Steven Rostedt (Google)
|
5790b1fb3d |
eventfs: Remove eventfs_file and just use eventfs_inode
Instead of having a descriptor for every file represented in the eventfs directory, only have the directory itself represented. Change the API to send in a list of entries that represent all the files in the directory (but not other directories). The entry list contains a name and a callback function that will be used to create the files when they are accessed. struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent, const struct eventfs_entry *entries, int size, void *data); is used for the top level eventfs directory, and returns an eventfs_inode that will be used by: struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent, const struct eventfs_entry *entries, int size, void *data); where both of the above take an array of struct eventfs_entry entries for every file that is in the directory. The entries are defined by: typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); struct eventfs_entry { const char *name; eventfs_callback callback; }; Where the name is the name of the file and the callback gets called when the file is being created. The callback passes in the name (in case the same callback is used for multiple files), a pointer to the mode, data and fops. The data will be pointing to the data that was passed in eventfs_create_dir() or eventfs_create_events_dir() but may be overridden to point to something else, as it will be used to point to the inode->i_private that is created. The information passed back from the callback is used to create the dentry/inode. If the callback fills the data and the file should be created, it must return a positive number. On zero or negative, the file is ignored. This logic may also be used as a prototype to convert entire pseudo file systems into just-in-time allocation. The "show_events_dentry" file has been updated to show the directories, and any files they have. With just the eventfs_file allocations: Before after deltas for meminfo (in kB): MemFree: -14360 MemAvailable: -14260 Buffers: 40 Cached: 24 Active: 44 Inactive: 48 Inactive(anon): 28 Active(file): 44 Inactive(file): 20 Dirty: -4 AnonPages: 28 Mapped: 4 KReclaimable: 132 Slab: 1604 SReclaimable: 132 SUnreclaim: 1472 Committed_AS: 12 Before after deltas for slabinfo: <slab>: <objects> [ * <size> = <total>] ext4_inode_cache 27 [* 1184 = 31968 ] extent_status 102 [* 40 = 4080 ] tracefs_inode_cache 144 [* 656 = 94464 ] buffer_head 39 [* 104 = 4056 ] shmem_inode_cache 49 [* 800 = 39200 ] filp -53 [* 256 = -13568 ] dentry 251 [* 192 = 48192 ] lsm_file_cache 277 [* 32 = 8864 ] vm_area_struct -14 [* 184 = -2576 ] trace_event_file 1748 [* 88 = 153824 ] kmalloc-1k 35 [* 1024 = 35840 ] kmalloc-256 49 [* 256 = 12544 ] kmalloc-192 -28 [* 192 = -5376 ] kmalloc-128 -30 [* 128 = -3840 ] kmalloc-96 10581 [* 96 = 1015776 ] kmalloc-64 3056 [* 64 = 195584 ] kmalloc-32 1291 [* 32 = 41312 ] kmalloc-16 2310 [* 16 = 36960 ] kmalloc-8 9216 [* 8 = 73728 ] Free memory dropped by 14,360 kB Available memory dropped by 14,260 kB Total slab additions in size: 1,771,032 bytes With this change: Before after deltas for meminfo (in kB): MemFree: -12084 MemAvailable: -11976 Buffers: 32 Cached: 32 Active: 72 Inactive: 168 Inactive(anon): 176 Active(file): 72 Inactive(file): -8 Dirty: 24 AnonPages: 196 Mapped: 8 KReclaimable: 148 Slab: 836 SReclaimable: 148 SUnreclaim: 688 Committed_AS: 324 Before after deltas for slabinfo: <slab>: <objects> [ * <size> = <total>] tracefs_inode_cache 144 [* 656 = 94464 ] shmem_inode_cache -23 [* 800 = -18400 ] filp -92 [* 256 = -23552 ] dentry 179 [* 192 = 34368 ] lsm_file_cache -3 [* 32 = -96 ] vm_area_struct -13 [* 184 = -2392 ] trace_event_file 1748 [* 88 = 153824 ] kmalloc-1k -49 [* 1024 = -50176 ] kmalloc-256 -27 [* 256 = -6912 ] kmalloc-128 1864 [* 128 = 238592 ] kmalloc-64 4685 [* 64 = 299840 ] kmalloc-32 -72 [* 32 = -2304 ] kmalloc-16 256 [* 16 = 4096 ] total = 721352 Free memory dropped by 12,084 kB Available memory dropped by 11,976 kB Total slab additions in size: 721,352 bytes That's over 2 MB in savings per instance for free and available memory, and over 1 MB in savings per instance of slab memory. Link: https://lore.kernel.org/linux-trace-kernel/20231003184059.4924468e@gandalf.local.home Link: https://lore.kernel.org/linux-trace-kernel/20231004165007.43d79161@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Beau Belgrave
|
5dbd04eddb |
tracing/user_events: Allow events to persist for perfmon_capable users
There are several scenarios that have come up where having a user_event persist even if the process that registered it exits. The main one is having a daemon create events on bootup that shouldn't get deleted if the daemon has to exit or reload. Another is within OpenTelemetry exporters, they wish to potentially check if a user_event exists on the system to determine if exporting the data out should occur. The user_event in this case must exist even in the absence of the owning process running (such as the above daemon case). Expose the previously internal flag USER_EVENT_REG_PERSIST to user processes. Upon register or delete of events with this flag, ensure the user is perfmon_capable to prevent random user processes with access to tracefs from creating events that persist after exit. Link: https://lkml.kernel.org/r/20230912180704.1284-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Uros Bizjak
|
bdf4fb6280 |
ring_buffer: Use try_cmpxchg instead of cmpxchg in rb_insert_pages
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in rb_insert_pages. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). No functional change intended. Link: https://lore.kernel.org/linux-trace-kernel/20230914163420.12923-1-ubizjak@gmail.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Zheng Yejian
|
a1f157c7a3 |
tracing: Expand all ring buffers individually
The ring buffer of global_trace is set to the minimum size in order to save memory on boot up and then it will be expand when some trace feature enabled. However currently operations under an instance can also cause global_trace ring buffer being expanded, and the expanded memory would be wasted if global_trace then not being used. See following case, we enable 'sched_switch' event in instance 'A', then ring buffer of global_trace is unexpectedly expanded to be 1410KB, also the '(expanded: 1408)' from 'buffer_size_kb' of instance is confusing. # cd /sys/kernel/tracing # mkdir instances/A # cat buffer_size_kb 7 (expanded: 1408) # cat instances/A/buffer_size_kb 1410 (expanded: 1408) # echo sched:sched_switch > instances/A/set_event # cat buffer_size_kb 1410 # cat instances/A/buffer_size_kb 1410 To fix it, we can: - Make 'ring_buffer_expanded' as a member of 'struct trace_array'; - Make 'ring_buffer_expanded' of instance is defaultly true, global_trace is defaultly false; - In order not to expose 'global_trace' outside of file 'kernel/trace/trace.c', introduce trace_set_ring_buffer_expanded() to set 'ring_buffer_expanded' as 'true'; - Pass the expected trace_array to tracing_update_buffers(). Link: https://lore.kernel.org/linux-trace-kernel/20230906091837.3998020-1-zhengyejian1@huawei.com Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Beau Belgrave
|
2de9ee9405 |
tracing/user_events: Align set_bit() address for all archs
All architectures should use a long aligned address passed to set_bit().
User processes can pass either a 32-bit or 64-bit sized value to be
updated when tracing is enabled when on a 64-bit kernel. Both cases are
ensured to be naturally aligned, however, that is not enough. The
address must be long aligned without affecting checks on the value
within the user process which require different adjustments for the bit
for little and big endian CPUs.
Add a compat flag to user_event_enabler that indicates when a 32-bit
value is being used on a 64-bit kernel. Long align addresses and correct
the bit to be used by set_bit() to account for this alignment. Ensure
compat flags are copied during forks and used during deletion clears.
Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/
Cc: stable@vger.kernel.org
Fixes:
|
||
Clément Léger
|
23cce5f254 |
tracing: relax trace_event_eval_update() execution with cond_resched()
When kernel is compiled without preemption, the eval_map_work_func() (which calls trace_event_eval_update()) will not be preempted up to its complete execution. This can actually cause a problem since if another CPU call stop_machine(), the call will have to wait for the eval_map_work_func() function to finish executing in the workqueue before being able to be scheduled. This problem was observe on a SMP system at boot time, when the CPU calling the initcalls executed clocksource_done_booting() which in the end calls stop_machine(). We observed a 1 second delay because one CPU was executing eval_map_work_func() and was not preempted by the stop_machine() task. Adding a call to cond_resched() in trace_event_eval_update() allows other tasks to be executed and thus continue working asynchronously like before without blocking any pending task at boot time. Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Clément Léger <cleger@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
1e0cb399c7 |
ring-buffer: Update "shortest_full" in polling
It was discovered that the ring buffer polling was incorrectly stating that read would not block, but that's because polling did not take into account that reads will block if the "buffer-percent" was set. Instead, the ring buffer polling would say reads would not block if there was any data in the ring buffer. This was incorrect behavior from a user space point of view. This was fixed by commit |
||
Jiri Olsa
|
dd8657894c |
bpf: Count missed stats in trace_call_bpf
Increase misses stats in case bpf array execution is skipped because of recursion check in trace_call_bpf. Adding bpf_prog_inc_misses_counters that increase misses counts for all bpf programs in bpf_prog_array. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-5-jolsa@kernel.org |
||
Jiri Olsa
|
3acf8ace68 |
bpf: Add missed value to kprobe perf link info
Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution. The kprobe's missed counter gets incremented when kprobe handler is not executed due to another kprobe running on the same cpu. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org |
||
Jiri Olsa
|
e2b2cd592a |
bpf: Add missed value to kprobe_multi link info
Add missed value to kprobe_multi link info to hold the stats of missed kprobe_multi probe. The missed counter gets incremented when fprobe fails the recursion check or there's no rethook available for return probe. In either case the attached bpf program is not executed. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230920213145.1941596-3-jolsa@kernel.org |
||
Jiri Olsa
|
f915fcb385 |
bpf: Count stats for kprobe_multi programs
Adding support to gather missed stats for kprobe_multi programs due to bpf_prog_active protection. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-2-jolsa@kernel.org |
||
Linus Torvalds
|
5edc6bb321 |
Tracing fixes for 6.6-rc2:
- Fix the "bytes" output of the per_cpu stat file The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the accounting was not accurate. It is suppose to show how many used bytes are still in the ring buffer, but even when the ring buffer was empty it would still show there were bytes used. - Fix a bug in eventfs where reading a dynamic event directory (open) and then creating a dynamic event that goes into that diretory screws up the accounting. On close, the newly created event dentry will get a "dput" without ever having a "dget" done for it. The fix is to allocate an array on dir open to save what dentries were actually "dget" on, and what ones to "dput" on close. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZQ9wihQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6quz4AP4vSFohvmAcTzC+sKP7gMLUvEmqL76+ 1pixXrQOIP5BrQEApUW3VnjqYgjZJR2ne0N4MvvmYElm/ylBhDd4JRrD3g8= =X9wd -----END PGP SIGNATURE----- Merge tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the "bytes" output of the per_cpu stat file The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the accounting was not accurate. It is suppose to show how many used bytes are still in the ring buffer, but even when the ring buffer was empty it would still show there were bytes used. - Fix a bug in eventfs where reading a dynamic event directory (open) and then creating a dynamic event that goes into that diretory screws up the accounting. On close, the newly created event dentry will get a "dput" without ever having a "dget" done for it. The fix is to allocate an array on dir open to save what dentries were actually "dget" on, and what ones to "dput" on close. * tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Remember what dentries were created on dir open ring-buffer: Fix bytes info in per_cpu buffer stats |
||
Zheng Yejian
|
45d99ea451 |
ring-buffer: Fix bytes info in per_cpu buffer stats
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
bytes in cpu buffer that have not been consumed. However, currently
after consuming data by reading file 'trace_pipe', the 'bytes' info
was not changed as expected.
# cat per_cpu/cpu0/stats
entries: 0
overrun: 0
commit overrun: 0
bytes: 568 <--- 'bytes' is problematical !!!
oldest event ts: 8651.371479
now ts: 8653.912224
dropped events: 0
read events: 8
The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
1. When stat 'read_bytes', account consumed event in rb_advance_reader();
2. When stat 'entries_bytes', exclude the discarded padding event which
is smaller than minimum size because it is invisible to reader. Then
use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
page-based read/remove/overrun.
Also correct the comments of ring_buffer_bytes_cpu() in this patch.
Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes:
|
||
David S. Miller
|
1612cc4b14 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 21 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 450 insertions(+), 36 deletions(-). The main changes are: 1) Adjust bpf_mem_alloc buckets to match ksize(), from Hou Tao. 2) Check whether override is allowed in kprobe mult, from Jiri Olsa. 3) Fix btf_id symbol generation with ld.lld, from Jiri and Nick. 4) Fix potential deadlock when using queue and stack maps from NMI, from Toke Høiland-Jørgensen. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Alan Maguire, Biju Das, Björn Töpel, Dan Carpenter, Daniel Borkmann, Eduard Zingerman, Hsin-Wei Hung, Marcus Seyfarth, Nathan Chancellor, Satya Durga Srinivasu Prabhala, Song Liu, Stephen Rothwell ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Jiri Olsa
|
57eb5e1c5c |
bpf: Fix uprobe_multi get_pid_task error path
Dan reported Smatch static checker warning due to missing error value set in uprobe multi link's get_pid_task error path. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/bpf/c5ffa7c0-6b06-40d5-aca2-63833b5cd9af@moroto.mountain/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230915101420.1193800-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Linus Torvalds
|
99214f6778 |
Tracing fixes for 6.6:
- Add missing LOCKDOWN checks for eventfs callers When LOCKDOWN is active for tracing, it causes inconsistent state when some functions succeed and others fail. - Use dput() to free the top level eventfs descriptor There was a race between accesses and freeing it. - Fix a long standing bug that eventfs exposed due to changing timings by dynamically creating files. That is, If a event file is opened for an instance, there's nothing preventing the instance from being removed which will make accessing the files cause use-after-free bugs. - Fix a ring buffer race that happens when iterating over the ring buffer while writers are active. Check to make sure not to read the event meta data if it's beyond the end of the ring buffer sub buffer. - Fix the print trigger that disappeared because the test to create it was looking for the event dir field being filled, but now it has the "ef" field filled for the eventfs structure. - Remove the unused "dir" field from the event structure. - Fix the order of the trace_dynamic_info as it had it backwards for the offset and len fields for which one was for which endianess. - Fix NULL pointer dereference with eventfs_remove_rec() If an allocation fails in one of the eventfs_add_*() functions, the caller of it in event_subsystem_dir() or event_create_dir() assigns the result to the structure. But it's assigning the ERR_PTR and not NULL. This was passed to eventfs_remove_rec() which expects either a good pointer or a NULL, not ERR_PTR. The fix is to not assign the ERR_PTR to the structure, but to keep it NULL on error. - Fix list_for_each_rcu() to use list_for_each_srcu() in dcache_dir_open_wrapper(). One iteration of the code used RCU but because it had to call sleepable code, it had to be changed to use SRCU, but one of the iterations was missed. - Fix synthetic event print function to use "as_u64" instead of passing in a pointer to the union. To fix big/little endian issues, the u64 that represented several types was turned into a union to define the types properly. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZQCvoBQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qtgrAP9MiYiCMU+90oJ+61DFchbs3y7BNidP s3lLRDUMJ935NQD/SSAm54PqWb+YXMpD7m9+3781l6xqwfabBMXNaEl+FwA= =tlZu -----END PGP SIGNATURE----- Merge tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Add missing LOCKDOWN checks for eventfs callers When LOCKDOWN is active for tracing, it causes inconsistent state when some functions succeed and others fail. - Use dput() to free the top level eventfs descriptor There was a race between accesses and freeing it. - Fix a long standing bug that eventfs exposed due to changing timings by dynamically creating files. That is, If a event file is opened for an instance, there's nothing preventing the instance from being removed which will make accessing the files cause use-after-free bugs. - Fix a ring buffer race that happens when iterating over the ring buffer while writers are active. Check to make sure not to read the event meta data if it's beyond the end of the ring buffer sub buffer. - Fix the print trigger that disappeared because the test to create it was looking for the event dir field being filled, but now it has the "ef" field filled for the eventfs structure. - Remove the unused "dir" field from the event structure. - Fix the order of the trace_dynamic_info as it had it backwards for the offset and len fields for which one was for which endianess. - Fix NULL pointer dereference with eventfs_remove_rec() If an allocation fails in one of the eventfs_add_*() functions, the caller of it in event_subsystem_dir() or event_create_dir() assigns the result to the structure. But it's assigning the ERR_PTR and not NULL. This was passed to eventfs_remove_rec() which expects either a good pointer or a NULL, not ERR_PTR. The fix is to not assign the ERR_PTR to the structure, but to keep it NULL on error. - Fix list_for_each_rcu() to use list_for_each_srcu() in dcache_dir_open_wrapper(). One iteration of the code used RCU but because it had to call sleepable code, it had to be changed to use SRCU, but one of the iterations was missed. - Fix synthetic event print function to use "as_u64" instead of passing in a pointer to the union. To fix big/little endian issues, the u64 that represented several types was turned into a union to define the types properly. * tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec() tracefs/eventfs: Use list_for_each_srcu() in dcache_dir_open_wrapper() tracing/synthetic: Print out u64 values properly tracing/synthetic: Fix order of struct trace_dynamic_info selftests/ftrace: Fix dependencies for some of the synthetic event tests tracing: Remove unused trace_event_file dir field tracing: Use the new eventfs descriptor for print trigger ring-buffer: Do not attempt to read past "commit" tracefs/eventfs: Free top level files on removal ring-buffer: Avoid softlockup in ring_buffer_resize() tracing: Have event inject files inc the trace array ref count tracing: Have option files inc the trace array ref count tracing: Have current_trace inc the trace array ref count tracing: Have tracing_max_latency inc the trace array ref count tracing: Increase trace array ref count on enable and filter files tracefs/eventfs: Use dput to free the toplevel events directory tracefs/eventfs: Add missing lockdown checks tracefs: Add missing lockdown check to tracefs_create_dir() |
||
Jinjie Ruan
|
c8414dab16 |
eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()
Inject fault while probing btrfs.ko, if kstrdup() fails in
eventfs_prepare_ef() in eventfs_add_dir(), it will return ERR_PTR
to assign file->ef. But the eventfs_remove() check NULL in
trace_module_remove_events(), which causes the below NULL
pointer dereference.
As both Masami and Steven suggest, allocater side should handle the
error carefully and remove it, so fix the places where it failed.
Could not create tracefs 'raid56_write' directory
Btrfs loaded, zoned=no, fsverity=no
Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102544000
[000000000000001c] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: btrfs(-) libcrc32c xor xor_neon raid6_pq cfg80211 rfkill 8021q garp mrp stp llc ipv6 [last unloaded: btrfs]
CPU: 15 PID: 1343 Comm: rmmod Tainted: G N 6.5.0+ #40
Hardware name: linux,dummy-virt (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : eventfs_remove_rec+0x24/0xc0
lr : eventfs_remove+0x68/0x1d8
sp : ffff800082d63b60
x29: ffff800082d63b60 x28: ffffb84b80ddd00c x27: ffffb84b3054ba40
x26: 0000000000000002 x25: ffff800082d63bf8 x24: ffffb84b8398e440
x23: ffffb84b82af3000 x22: dead000000000100 x21: dead000000000122
x20: ffff800082d63bf8 x19: fffffffffffffff4 x18: ffffb84b82508820
x17: 0000000000000000 x16: 0000000000000000 x15: 000083bc876a3166
x14: 000000000000006d x13: 000000000000006d x12: 0000000000000000
x11: 0000000000000001 x10: 00000000000017e0 x9 : 0000000000000001
x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffb84b84289804
x5 : 0000000000000000 x4 : 9696969696969697 x3 : ffff33a5b7601f38
x2 : 0000000000000000 x1 : ffff800082d63bf8 x0 : fffffffffffffff4
Call trace:
eventfs_remove_rec+0x24/0xc0
eventfs_remove+0x68/0x1d8
remove_event_file_dir+0x88/0x100
event_remove+0x140/0x15c
trace_module_notify+0x1fc/0x230
notifier_call_chain+0x98/0x17c
blocking_notifier_call_chain+0x4c/0x74
__arm64_sys_delete_module+0x1a4/0x298
invoke_syscall+0x44/0x100
el0_svc_common.constprop.1+0x68/0xe0
do_el0_svc+0x1c/0x28
el0_svc+0x3c/0xc4
el0t_64_sync_handler+0xa0/0xc4
el0t_64_sync+0x174/0x178
Code: 5400052c a90153b3 aa0003f3 aa0103f4 (f9401400)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: 0x384b00c00000 from 0xffff800080000000
PHYS_OFFSET: 0xffffcc5b80000000
CPU features: 0x88000203,3c020000,1000421b
Memory Limit: none
Rebooting in 1 seconds..
Link: https://lore.kernel.org/linux-trace-kernel/20230912134752.1838524-1-ruanjinjie@huawei.com
Link: https://lore.kernel.org/all/20230912025808.668187-1-ruanjinjie@huawei.com/
Link: https://lore.kernel.org/all/20230911052818.1020547-1-ruanjinjie@huawei.com/
Link: https://lore.kernel.org/all/20230909072817.182846-1-ruanjinjie@huawei.com/
Link: https://lore.kernel.org/all/20230908074816.3724716-1-ruanjinjie@huawei.com/
Cc: Ajay Kaher <akaher@vmware.com>
Fixes:
|
||
Tero Kristo
|
62663b8496 |
tracing/synthetic: Print out u64 values properly
The synth traces incorrectly print pointer to the synthetic event values
instead of the actual value when using u64 type. Fix by addressing the
contents of the union properly.
Link: https://lore.kernel.org/linux-trace-kernel/20230911141704.3585965-1-tero.kristo@linux.intel.com
Fixes:
|
||
Steven Rostedt (Google)
|
6fdac58c56 |
tracing: Remove unused trace_event_file dir field
Now that eventfs structure is used to create the events directory via the eventfs dynamically allocate code, the "dir" field of the trace_event_file structure is no longer used. Remove it. Link: https://lkml.kernel.org/r/20230908022001.580400115@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
1ef26d8b2c |
tracing: Use the new eventfs descriptor for print trigger
The check to create the print event "trigger" was using the obsolete "dir"
value of the trace_event_file to determine if it should create the trigger
or not. But that value will now be NULL because it uses the event file
descriptor.
Change it to test the "ef" field of the trace_event_file structure so that
the trace_marker "trigger" file appears again.
Link: https://lkml.kernel.org/r/20230908022001.371815239@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Fixes:
|
||
Steven Rostedt (Google)
|
95a404bd60 |
ring-buffer: Do not attempt to read past "commit"
When iterating over the ring buffer while the ring buffer is active, the writer can corrupt the reader. There's barriers to help detect this and handle it, but that code missed the case where the last event was at the very end of the page and has only 4 bytes left. The checks to detect the corruption by the writer to reads needs to see the length of the event. If the length in the first 4 bytes is zero then the length is stored in the second 4 bytes. But if the writer is in the process of updating that code, there's a small window where the length in the first 4 bytes could be zero even though the length is only 4 bytes. That will cause rb_event_length() to read the next 4 bytes which could happen to be off the allocated page. To protect against this, fail immediately if the next event pointer is less than 8 bytes from the end of the commit (last byte of data), as all events must be a minimum of 8 bytes anyway. Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Reported-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Jiri Olsa
|
41bc46c12a |
bpf: Add override check to kprobe multi link attach
Currently the multi_kprobe link attach does not check error
injection list for programs with bpf_override_return helper
and allows them to attach anywhere. Adding the missing check.
Fixes:
|
||
Zheng Yejian
|
f6bd2c9248 |
ring-buffer: Avoid softlockup in ring_buffer_resize()
When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be allocated, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer allocation. Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Steven Rostedt (Google)
|
e5c624f027 |
tracing: Have event inject files inc the trace array ref count
The event inject files add events for a specific trace array. For an
instance, if the file is opened and the instance is deleted, reading or
writing to the file will cause a use after free.
Up the ref count of the trace_array when a event inject file is opened.
Link: https://lkml.kernel.org/r/20230907024804.292337868@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes:
|
||
Steven Rostedt (Google)
|
7e2cfbd2d3 |
tracing: Have option files inc the trace array ref count
The option files update the options for a given trace array. For an
instance, if the file is opened and the instance is deleted, reading or
writing to the file will cause a use after free.
Up the ref count of the trace_array when an option file is opened.
Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes:
|
||
Steven Rostedt (Google)
|
9b37febc57 |
tracing: Have current_trace inc the trace array ref count
The current_trace updates the trace array tracer. For an instance, if the
file is opened and the instance is deleted, reading or writing to the file
will cause a use after free.
Up the ref count of the trace array when current_trace is opened.
Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes:
|
||
Steven Rostedt (Google)
|
7d660c9b2b |
tracing: Have tracing_max_latency inc the trace array ref count
The tracing_max_latency file points to the trace_array max_latency field.
For an instance, if the file is opened and the instance is deleted,
reading or writing to the file will cause a use after free.
Up the ref count of the trace_array when tracing_max_latency is opened.
Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes:
|
||
Steven Rostedt (Google)
|
f5ca233e2e |
tracing: Increase trace array ref count on enable and filter files
When the trace event enable and filter files are opened, increment the
trace array ref counter, otherwise they can be accessed when the trace
array is being deleted. The ref counter keeps the trace array from being
deleted while those files are opened.
Link: https://lkml.kernel.org/r/20230907024803.456187066@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
|
||
Linus Torvalds
|
b70100f2e6 |
Probes updates for v6.6:
- kprobes: use struct_size() for variable size kretprobe_instance data structure. - eprobe: Simplify trace_eprobe list iteration. - probe events: Data structure field access support on BTF argument. . Update BTF argument support on the functions in the kernel loadable modules (only loaded modules are supported). . Move generic BTF access function (search function prototype and get function parameters) to a separated file. . Add a function to search a member of data structure in BTF. . Support accessing BTF data structure member from probe args by C-like arrow('->') and dot('.') operators. e.g. 't sched_switch next=next->pid vruntime=next->se.vruntime' . Support accessing BTF data structure member from $retval. e.g. 'f getname_flags%return +0($retval->name):string' . Add string type checking if BTF type info is available. This will reject if user specify ":string" type for non "char pointer" type. . Automatically assume the fprobe event as a function return event if $retval is used. - selftests/ftrace: Add BTF data field access test cases. - Documentation: Update fprobe event example with BTF data field. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmTycQkbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bqS8H/jeR1JhOzIXOvTw7XCFm MrSY/SKi8tQfV6lau2UmoYdbYvYjpqL34XLOQPNf2/lrcL2M9aNYXk9fbhlW8enx vkMyKQ0E5anixkF4vsTbEl9DaprxbpsPVACmZ/7VjQk2JuXIdyaNk8hno9LgIcEq udztb0o2HmDFqAXfRi0LvlSTAIwvXZ+usmEvYpaq1g2WwrCe7NHEYl42vMpj+h4H 9l4t5rA9JyPPX4yQUjtKGW5eRVTwDTm/Gn6DRzYfYzkkiBZv27qfovzBOt672LgG hyot+u7XeKvZx3jjnF7+mRWoH/m0dqyhyi/nPhpIE09VhgwclrbGAcDuR1x6sp01 PHY= =hBDN -----END PGP SIGNATURE----- Merge tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: - kprobes: use struct_size() for variable size kretprobe_instance data structure. - eprobe: Simplify trace_eprobe list iteration. - probe events: Data structure field access support on BTF argument. - Update BTF argument support on the functions in the kernel loadable modules (only loaded modules are supported). - Move generic BTF access function (search function prototype and get function parameters) to a separated file. - Add a function to search a member of data structure in BTF. - Support accessing BTF data structure member from probe args by C-like arrow('->') and dot('.') operators. e.g. 't sched_switch next=next->pid vruntime=next->se.vruntime' - Support accessing BTF data structure member from $retval. e.g. 'f getname_flags%return +0($retval->name):string' - Add string type checking if BTF type info is available. This will reject if user specify ":string" type for non "char pointer" type. - Automatically assume the fprobe event as a function return event if $retval is used. - selftests/ftrace: Add BTF data field access test cases. - Documentation: Update fprobe event example with BTF data field. * tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: Documentation: tracing: Update fprobe event example with BTF field selftests/ftrace: Add BTF fields access testcases tracing/fprobe-event: Assume fprobe is a return event by $retval tracing/probes: Add string type check with BTF tracing/probes: Support BTF field access from $retval tracing/probes: Support BTF based data structure field access tracing/probes: Add a function to search a member of a struct/union tracing/probes: Move finding func-proto API and getting func-param API to trace_btf tracing/probes: Support BTF argument on module functions tracing/eprobe: Iterate trace_eprobe directly kernel: kprobes: Use struct_size() |
||
Valentin Schneider
|
cbb557ba92 |
tracing/filters: Fix coding style issues
Recent commits have introduced some coding style issues, fix those up. Link: https://lkml.kernel.org/r/20230901151039.125186-5-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
2900bcbee3 |
tracing/filters: Change parse_pred() cpulist ternary into an if block
Review comments noted that an if block would be clearer than a ternary, so swap it out. No change in behaviour intended Link: https://lkml.kernel.org/r/20230901151039.125186-4-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
1caf7adb9e |
tracing/filters: Fix double-free of struct filter_pred.mask
When a cpulist filter is found to contain a single CPU, that CPU is saved as a scalar and the backing cpumask storage is freed. Also NULL the mask to avoid a double-free once we get down to free_predicate(). Link: https://lkml.kernel.org/r/20230901151039.125186-3-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
9af4058493 |
tracing/filters: Fix error-handling of cpulist parsing buffer
parse_pred() allocates a string buffer to parse the user-provided cpulist, but doesn't check the allocation result nor does it free the buffer once it is no longer needed. Add an allocation check, and free the buffer as soon as it is no longer needed. Link: https://lkml.kernel.org/r/20230901151039.125186-2-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Steven Rostedt <rostedt@goodmis.org> Reported-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Brian Foster
|
3d07fa1dd1 |
tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSY
The pipe cpumask used to serialize opens between the main and percpu
trace pipes is not zeroed or initialized. This can result in
spurious -EBUSY returns if underlying memory is not fully zeroed.
This has been observed by immediate failure to read the main
trace_pipe file on an otherwise newly booted and idle system:
# cat /sys/kernel/debug/tracing/trace_pipe
cat: /sys/kernel/debug/tracing/trace_pipe: Device or resource busy
Zero the allocation of pipe_cpumask to avoid the problem.
Link: https://lore.kernel.org/linux-trace-kernel/20230831125500.986862-1-bfoster@redhat.com
Cc: stable@vger.kernel.org
Fixes:
|
||
Ruan Jinjie
|
2a30dbcbef |
ftrace: Use LIST_HEAD to initialize clear_hash
Use LIST_HEAD() to initialize clear_hash instead of open-coding it. Link: https://lore.kernel.org/linux-trace-kernel/20230809071551.913041-1-ruanjinjie@huawei.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Levi Yun
|
1351148904 |
ftrace: Use within_module to check rec->ip within specified module.
within_module_core && within_module_init condition is same to within module but it's more readable. Use within_module instead of former condition to check rec->ip within specified module area or not. Link: https://lore.kernel.org/linux-trace-kernel/20230803205236.32201-1-ppbuk5246@gmail.com Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Zheng Yejian
|
3163f635b2 |
tracing: Fix race issue between cpu buffer write and swap
Warning happened in rb_end_commit() at code:
if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing)))
WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142
rb_commit+0x402/0x4a0
Call Trace:
ring_buffer_unlock_commit+0x42/0x250
trace_buffer_unlock_commit_regs+0x3b/0x250
trace_event_buffer_commit+0xe5/0x440
trace_event_buffer_reserve+0x11c/0x150
trace_event_raw_event_sched_switch+0x23c/0x2c0
__traceiter_sched_switch+0x59/0x80
__schedule+0x72b/0x1580
schedule+0x92/0x120
worker_thread+0xa0/0x6f0
It is because the race between writing event into cpu buffer and swapping
cpu buffer through file per_cpu/cpu0/snapshot:
Write on CPU 0 Swap buffer by per_cpu/cpu0/snapshot on CPU 1
-------- --------
tracing_snapshot_write()
[...]
ring_buffer_lock_reserve()
cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a';
[...]
rb_reserve_next_event()
[...]
ring_buffer_swap_cpu()
if (local_read(&cpu_buffer_a->committing))
goto out_dec;
if (local_read(&cpu_buffer_b->committing))
goto out_dec;
buffer_a->buffers[cpu] = cpu_buffer_b;
buffer_b->buffers[cpu] = cpu_buffer_a;
// 2. cpu_buffer has swapped here.
rb_start_commit(cpu_buffer);
if (unlikely(READ_ONCE(cpu_buffer->buffer)
!= buffer)) { // 3. This check passed due to 'cpu_buffer->buffer'
[...] // has not changed here.
return NULL;
}
cpu_buffer_b->buffer = buffer_a;
cpu_buffer_a->buffer = buffer_b;
[...]
// 4. Reserve event from 'cpu_buffer_a'.
ring_buffer_unlock_commit()
[...]
cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!!
rb_commit(cpu_buffer)
rb_end_commit() // 6. WARN for the wrong 'committing' state !!!
Based on above analysis, we can easily reproduce by following testcase:
``` bash
#!/bin/bash
dmesg -n 7
sysctl -w kernel.panic_on_warn=1
TR=/sys/kernel/tracing
echo 7 > ${TR}/buffer_size_kb
echo "sched:sched_switch" > ${TR}/set_event
while [ true ]; do
echo 1 > ${TR}/per_cpu/cpu0/snapshot
done &
while [ true ]; do
echo 1 > ${TR}/per_cpu/cpu0/snapshot
done &
while [ true ]; do
echo 1 > ${TR}/per_cpu/cpu0/snapshot
done &
```
To fix it, IIUC, we can use smp_call_function_single() to do the swap on
the target cpu where the buffer is located, so that above race would be
avoided.
Link: https://lore.kernel.org/linux-trace-kernel/20230831132739.4070878-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Fixes:
|
||
Mikhail Kobuk
|
2cf0dee989 |
tracing: Remove extra space at the end of hwlat_detector/mode
Space is printed after each mode value including the last one:
$ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\"
"none [round-robin] per-cpu "
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Link: https://lore.kernel.org/linux-trace-kernel/20230825103432.7750-1-m.kobuk@ispras.ru
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|
||
Linus Torvalds
|
34232fcfe9 |
Tracing updates for 6.6:
User visible changes: - Added a way to easier filter with cpumasks: # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter - Show actual size of ring buffer after modifying the ring buffer size via buffer_size_kb. Currently it just returns what was written, but the actual size rounds up to the sub buffer size. Show that real size instead. Major changes: - Added "eventfs". This is the code that handles the inodes and dentries of tracefs/events directory. As there are thousands of events, and each event has several inodes and dentries that currently exist even when tracing is never used, they take up precious memory. Instead, eventfs will allocate the inodes and dentries in a JIT way (similar to what procfs does). There is now metadata that handles the events and subdirectories, and will create the inodes and dentries when they are used. Note, I also have patches that remove the subdirectory meta data, but will wait till the next merge window before applying them. It's a little more complex, and I want to make sure the dynamic code works properly before adding more complexity, making it easier to revert if need be. Minor changes: - Optimization to user event list traversal. - Remove intermediate permission of tracefs files (note the intermediate permission removes all access to the files so it is not a security concern, but just a clean up.) - Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic. - Other minor clean ups. -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEXtmkj8VMCiLR0IBM68Js21pW3nMFAmTwtAsUHHJvc3RlZHRA Z29vZG1pcy5vcmcACgkQ68Js21pW3nNOXRAAsslQT6alY4OeplC4x47+V6+6NiIA oDtOmWAqf7TsH9bukzRFD36rUly42O20RJDx9z0Q3iRc3vGxEawId8z6P0HmBwRb VSl5BryWvL5Wc5w94xS8EeCuC1MRfhVDyfbtVFmWigzfvd/f+hp71ViMPHUvrRJX KhzzNSBc4ir5E1lzfwa7meYTXzDwrQlZbYfdf5aH94IWAkqDj85PUZDJ7UmLZhXG CIglSpNFXZ0j19Wo/U6KZlHR1XfunBKungCzJ5Dbznc9YLWZTQXOIZF4YPKfPIJL ulRG9chwXY0nQWhG3xM1UHZLsAMSWw5i13a4ZN4d8FCNOgv8ttcJnfDk7ZYUS0Oz RmY1dGcSRKAZTUTjm8ZBtmyiUCc9kZAIk0fyEfIHtoDYXmhnvni3wuTnbRSdXaSi q4YkxPaLfX8Fn3QloCqqddt8iONu7BnbpZOhUCl2AtBib52gnTTF7+rQ6/0D3rjo SSuvEHhnjJhzk+3jM2odxjmTAztNT+yu6FbKXZUKPt1Kj9YHv1J9cEQw9/Etw+GV 8jQBe979D8hFJmDOJOT/O/TdPqE9mQoMNBt6Y8QnE4nbJWM+i/MBrThFpUSQhRCr 0Ya/HgR2QyRH7RmZW5o2H9mNtN+V9c7RxZW8erYzRbUs0YofK2OpGi9SrPzxWCke w6j0VVZHaxdPguM= =/s+e -----END PGP SIGNATURE----- Merge tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "User visible changes: - Added a way to easier filter with cpumasks: # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter - Show actual size of ring buffer after modifying the ring buffer size via buffer_size_kb. Currently it just returns what was written, but the actual size rounds up to the sub buffer size. Show that real size instead. Major changes: - Added "eventfs". This is the code that handles the inodes and dentries of tracefs/events directory. As there are thousands of events, and each event has several inodes and dentries that currently exist even when tracing is never used, they take up precious memory. Instead, eventfs will allocate the inodes and dentries in a JIT way (similar to what procfs does). There is now metadata that handles the events and subdirectories, and will create the inodes and dentries when they are used. Note, I also have patches that remove the subdirectory meta data, but will wait till the next merge window before applying them. It's a little more complex, and I want to make sure the dynamic code works properly before adding more complexity, making it easier to revert if need be. Minor changes: - Optimization to user event list traversal - Remove intermediate permission of tracefs files (note the intermediate permission removes all access to the files so it is not a security concern, but just a clean up) - Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic - Other minor cleanups" * tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits) tracefs: Remove kerneldoc from struct eventfs_file tracefs: Avoid changing i_mode to a temp value tracing/user_events: Optimize safe list traversals ftrace: Remove empty declaration ftrace_enable_daemon() and ftrace_disable_daemon() tracing: Remove unused function declarations tracing/filters: Document cpumask filtering tracing/filters: Further optimise scalar vs cpumask comparison tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU tracing/filters: Enable filtering the CPU common field by a cpumask tracing/filters: Enable filtering a scalar field by a cpumask tracing/filters: Enable filtering a cpumask field by another cpumask tracing/filters: Dynamically allocate filter_pred.regex test: ftrace: Fix kprobe test for eventfs eventfs: Move tracing/events to eventfs eventfs: Implement removal of meta data from eventfs eventfs: Implement functions to create files and dirs when accessed eventfs: Implement eventfs lookup, read, open functions eventfs: Implement eventfs file add functions ... |
||
Linus Torvalds
|
bd6c11bc43 |
Networking changes for 6.6.
Core ---- - Increase size limits for to-be-sent skb frag allocations. This allows tun, tap devices and packet sockets to better cope with large writes operations. - Store netdevs in an xarray, to simplify iterating over netdevs. - Refactor nexthop selection for multipath routes. - Improve sched class lifetime handling. - Add backup nexthop ID support for bridge. - Implement drop reasons support in openvswitch. - Several data races annotations and fixes. - Constify the sk parameter of routing functions. - Prepend kernel version to netconsole message. Protocols --------- - Implement support for TCP probing the peer being under memory pressure. - Remove hard coded limitation on IPv6 specific info placement inside the socket struct. - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per socket scaling factor. - Scaling-up the IPv6 expired route GC via a separated list of expiring routes. - In-kernel support for the TLS alert protocol. - Better support for UDP reuseport with connected sockets. - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR header size. - Get rid of additional ancillary per MPTCP connection struct socket. - Implement support for BPF-based MPTCP packet schedulers. - Format MPTCP subtests selftests results in TAP. - Several new SMC 2.1 features including unique experimental options, max connections per lgr negotiation, max links per lgr negotiation. BPF --- - Multi-buffer support in AF_XDP. - Add multi uprobe BPF links for attaching multiple uprobes and usdt probes, which is significantly faster and saves extra fds. - Implement an fd-based tc BPF attach API (TCX) and BPF link support on top of it. - Add SO_REUSEPORT support for TC bpf_sk_assign. - Support new instructions from cpu v4 to simplify the generated code and feature completeness, for x86, arm64, riscv64. - Support defragmenting IPv(4|6) packets in BPF. - Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling. - Introduce bpf map element count and enable it for all program types. - Add a BPF hook in sys_socket() to change the protocol ID from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy. - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress. - Add uprobe support for the bpf_get_func_ip helper. - Check skb ownership against full socket. - Support for up to 12 arguments in BPF trampoline. - Extend link_info for kprobe_multi and perf_event links. Netfilter --------- - Speed-up process exit by aborting ruleset validation if a fatal signal is pending. - Allow NLA_POLICY_MASK to be used with BE16/BE32 types. Driver API ---------- - Page pool optimizations, to improve data locality and cache usage. - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need for raw ioctl() handling in drivers. - Simplify genetlink dump operations (doit/dumpit) providing them the common information already populated in struct genl_info. - Extend and use the yaml devlink specs to [re]generate the split ops. - Introduce devlink selective dumps, to allow SF filtering SF based on handle and other attributes. - Add yaml netlink spec for netlink-raw families, allow route, link and address related queries via the ynl tool. - Remove phylink legacy mode support. - Support offload LED blinking to phy. - Add devlink port function attributes for IPsec. New hardware / drivers ---------------------- - Ethernet: - Broadcom ASP 2.0 (72165) ethernet controller - MediaTek MT7988 SoC - Texas Instruments AM654 SoC - Texas Instruments IEP driver - Atheros qca8081 phy - Marvell 88Q2110 phy - NXP TJA1120 phy - WiFi: - MediaTek mt7981 support - Can: - Kvaser SmartFusion2 PCI Express devices - Allwinner T113 controllers - Texas Instruments tcan4552/4553 chips - Bluetooth: - Intel Gale Peak - Qualcomm WCN3988 and WCN7850 - NXP AW693 and IW624 - Mediatek MT2925 Drivers ------- - Ethernet NICs: - nVidia/Mellanox: - mlx5: - support UDP encapsulation in packet offload mode - IPsec packet offload support in eswitch mode - improve aRFS observability by adding new set of counters - extends MACsec offload support to cover RoCE traffic - dynamic completion EQs - mlx4: - convert to use auxiliary bus instead of custom interface logic - Intel - ice: - implement switchdev bridge offload, even for LAG interfaces - implement SRIOV support for LAG interfaces - igc: - add support for multiple in-flight TX timestamps - Broadcom: - bnxt: - use the unified RX page pool buffers for XDP and non-XDP - use the NAPI skb allocation cache - OcteonTX2: - support Round Robin scheduling HTB offload - TC flower offload support for SPI field - Freescale: - add XDP_TX feature support - AMD: - ionic: add support for PCI FLR event - sfc: - basic conntrack offload - introduce eth, ipv4 and ipv6 pedit offloads - ST Microelectronics: - stmmac: maximze PTP timestamping resolution - Virtual NICs: - Microsoft vNIC: - batch ringing RX queue doorbell on receiving packets - add page pool for RX buffers - Virtio vNIC: - add per queue interrupt coalescing support - Google vNIC: - add queue-page-list mode support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add port range matching tc-flower offload - permit enslavement to netdevices with uppers - Ethernet embedded switches: - Marvell (mv88e6xxx): - convert to phylink_pcs - Renesas: - r8A779fx: add speed change support - rzn1: enables vlan support - Ethernet PHYs: - convert mv88e6xxx to phylink_pcs - WiFi: - Qualcomm Wi-Fi 7 (ath12k): - extremely High Throughput (EHT) PHY support - RealTek (rtl8xxxu): - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU), RTL8192EU and RTL8723BU - RealTek (rtw89): - Introduce Time Averaged SAR (TAS) support - Connector: - support for event filtering Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmTt1ZoSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkgFUP/REFaYWdWUvAzmWeezyx9dqgZMfSOjWq 9QvySiA94OAOcjIYkb7wfzQ5BBAZqaBQ/f8XqWwS1EDDDEBs8sP1cxmABKwW7Hsr qFRu2sOqLzKBk223d0jIgEocfQaFpGbF71gXoTlDivBjBi5UxWm9bF0XnbYWcKgO /QEvzNosi9uNdi85Fzmv62J6YzAdidEpwGsM7X2CfejwNRmStxAEg/NwvRR0Hyiq OJCo97omEgTRaUle8nc64PDx33u4h5kQ1BkaeHEv0rbE3hftFC2YPKn/InmqSFGz 6ew2xnrGPR37LCuAiCcIIv6yR7K0eu0iYJ7jXwZxBDqxGavEPuwWGBoCP6qFiitH ZLWhIrAUrdmSbySkTOCONhJ475qFAuQoYHYpZnX/bJZUHlSsb/9lwDJYJQGpVfd1 /daqJVSb7lhaifmNO1iNd/ibCIXq9zapwtkRwA897M8GkZBTsnVvazFld1Em+Se3 Bx6DSDUVBqVQ9fpZG2IAGD6odDwOzC1lF2IoceFvK9Ff6oE0psI+A0qNLMkHxZbW Qlo7LsNe53hpoCC+yHTfXX7e/X8eNt0EnCGOQJDusZ0Nr3K7H4LKFA0i8UBUK05n 4lKnnaSQW7GQgdofLWt103OMDR9GoDxpFsm7b1X9+AEk6Fz6tq50wWYeMZETUKYP DCW8VGFOZjZM =9CsR -----END PGP SIGNATURE----- Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Increase size limits for to-be-sent skb frag allocations. This allows tun, tap devices and packet sockets to better cope with large writes operations - Store netdevs in an xarray, to simplify iterating over netdevs - Refactor nexthop selection for multipath routes - Improve sched class lifetime handling - Add backup nexthop ID support for bridge - Implement drop reasons support in openvswitch - Several data races annotations and fixes - Constify the sk parameter of routing functions - Prepend kernel version to netconsole message Protocols: - Implement support for TCP probing the peer being under memory pressure - Remove hard coded limitation on IPv6 specific info placement inside the socket struct - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per socket scaling factor - Scaling-up the IPv6 expired route GC via a separated list of expiring routes - In-kernel support for the TLS alert protocol - Better support for UDP reuseport with connected sockets - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR header size - Get rid of additional ancillary per MPTCP connection struct socket - Implement support for BPF-based MPTCP packet schedulers - Format MPTCP subtests selftests results in TAP - Several new SMC 2.1 features including unique experimental options, max connections per lgr negotiation, max links per lgr negotiation BPF: - Multi-buffer support in AF_XDP - Add multi uprobe BPF links for attaching multiple uprobes and usdt probes, which is significantly faster and saves extra fds - Implement an fd-based tc BPF attach API (TCX) and BPF link support on top of it - Add SO_REUSEPORT support for TC bpf_sk_assign - Support new instructions from cpu v4 to simplify the generated code and feature completeness, for x86, arm64, riscv64 - Support defragmenting IPv(4|6) packets in BPF - Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling - Introduce bpf map element count and enable it for all program types - Add a BPF hook in sys_socket() to change the protocol ID from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress - Add uprobe support for the bpf_get_func_ip helper - Check skb ownership against full socket - Support for up to 12 arguments in BPF trampoline - Extend link_info for kprobe_multi and perf_event links Netfilter: - Speed-up process exit by aborting ruleset validation if a fatal signal is pending - Allow NLA_POLICY_MASK to be used with BE16/BE32 types Driver API: - Page pool optimizations, to improve data locality and cache usage - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need for raw ioctl() handling in drivers - Simplify genetlink dump operations (doit/dumpit) providing them the common information already populated in struct genl_info - Extend and use the yaml devlink specs to [re]generate the split ops - Introduce devlink selective dumps, to allow SF filtering SF based on handle and other attributes - Add yaml netlink spec for netlink-raw families, allow route, link and address related queries via the ynl tool - Remove phylink legacy mode support - Support offload LED blinking to phy - Add devlink port function attributes for IPsec New hardware / drivers: - Ethernet: - Broadcom ASP 2.0 (72165) ethernet controller - MediaTek MT7988 SoC - Texas Instruments AM654 SoC - Texas Instruments IEP driver - Atheros qca8081 phy - Marvell 88Q2110 phy - NXP TJA1120 phy - WiFi: - MediaTek mt7981 support - Can: - Kvaser SmartFusion2 PCI Express devices - Allwinner T113 controllers - Texas Instruments tcan4552/4553 chips - Bluetooth: - Intel Gale Peak - Qualcomm WCN3988 and WCN7850 - NXP AW693 and IW624 - Mediatek MT2925 Drivers: - Ethernet NICs: - nVidia/Mellanox: - mlx5: - support UDP encapsulation in packet offload mode - IPsec packet offload support in eswitch mode - improve aRFS observability by adding new set of counters - extends MACsec offload support to cover RoCE traffic - dynamic completion EQs - mlx4: - convert to use auxiliary bus instead of custom interface logic - Intel - ice: - implement switchdev bridge offload, even for LAG interfaces - implement SRIOV support for LAG interfaces - igc: - add support for multiple in-flight TX timestamps - Broadcom: - bnxt: - use the unified RX page pool buffers for XDP and non-XDP - use the NAPI skb allocation cache - OcteonTX2: - support Round Robin scheduling HTB offload - TC flower offload support for SPI field - Freescale: - add XDP_TX feature support - AMD: - ionic: add support for PCI FLR event - sfc: - basic conntrack offload - introduce eth, ipv4 and ipv6 pedit offloads - ST Microelectronics: - stmmac: maximze PTP timestamping resolution - Virtual NICs: - Microsoft vNIC: - batch ringing RX queue doorbell on receiving packets - add page pool for RX buffers - Virtio vNIC: - add per queue interrupt coalescing support - Google vNIC: - add queue-page-list mode support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add port range matching tc-flower offload - permit enslavement to netdevices with uppers - Ethernet embedded switches: - Marvell (mv88e6xxx): - convert to phylink_pcs - Renesas: - r8A779fx: add speed change support - rzn1: enables vlan support - Ethernet PHYs: - convert mv88e6xxx to phylink_pcs - WiFi: - Qualcomm Wi-Fi 7 (ath12k): - extremely High Throughput (EHT) PHY support - RealTek (rtl8xxxu): - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU), RTL8192EU and RTL8723BU - RealTek (rtw89): - Introduce Time Averaged SAR (TAS) support - Connector: - support for event filtering" * tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits) net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler net: stmmac: clarify difference between "interface" and "phy_interface" r8152: add vendor/device ID pair for D-Link DUB-E250 devlink: move devlink_notify_register/unregister() to dev.c devlink: move small_ops definition into netlink.c devlink: move tracepoint definitions into core.c devlink: push linecard related code into separate file devlink: push rate related code into separate file devlink: push trap related code into separate file devlink: use tracepoint_enabled() helper devlink: push region related code into separate file devlink: push param related code into separate file devlink: push resource related code into separate file devlink: push dpipe related code into separate file devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper devlink: push shared buffer related code into separate file devlink: push port related code into separate file devlink: push object register/unregister notifications into separate helpers inet: fix IP_TRANSPARENT error handling ... |
||
Masami Hiramatsu (Google)
|
08c9306fc2 |
tracing/fprobe-event: Assume fprobe is a return event by $retval
Assume the fprobe event is a return event if there is $retval is used in the probe's argument without %return. e.g. echo 'f:myevent vfs_read $retval' >> dynamic_events then 'myevent' is a return probe event. Link: https://lore.kernel.org/all/169272160261.160970.13613040161560998787.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
27973e5c64 |
tracing/probes: Add string type check with BTF
Add a string type checking with BTF information if possible. This will check whether the given BTF argument (and field) is signed char array or pointer to signed char. If not, it reject the 'string' type. If it is pointer to signed char, it adds a dereference opration so that it can correctly fetch the string data from memory. # echo 'f getname_flags%return retval->name:string' >> dynamic_events # echo 't sched_switch next->comm:string' >> dynamic_events The above cases, 'struct filename::name' is 'char *' and 'struct task_struct::comm' is 'char []'. But in both case, user can specify ':string' to fetch the string data. Link: https://lore.kernel.org/all/169272159250.160970.1881112937198526188.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
d157d76944 |
tracing/probes: Support BTF field access from $retval
Support BTF argument on '$retval' for function return events including kretprobe and fprobe for accessing the return value. This also allows user to access its fields if the return value is a pointer of a data structure. E.g. # echo 'f getname_flags%return +0($retval->name):string' \ > dynamic_events # echo 1 > events/fprobes/getname_flags__exit/enable # ls > /dev/null # head -n 40 trace | tail ls-87 [000] ...1. 8067.616101: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./function_profile_enabled" ls-87 [000] ...1. 8067.616108: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./trace_stat" ls-87 [000] ...1. 8067.616115: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_graph_notrace" ls-87 [000] ...1. 8067.616122: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_graph_function" ls-87 [000] ...1. 8067.616129: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_notrace" ls-87 [000] ...1. 8067.616135: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_filter" ls-87 [000] ...1. 8067.616143: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./touched_functions" ls-87 [000] ...1. 8067.616237: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./enabled_functions" ls-87 [000] ...1. 8067.616245: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./available_filter_functions" ls-87 [000] ...1. 8067.616253: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_notrace_pid" Link: https://lore.kernel.org/all/169272158234.160970.2446691104240645205.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
c440adfbe3 |
tracing/probes: Support BTF based data structure field access
Using BTF to access the fields of a data structure. You can use this for accessing the field with '->' or '.' operation with BTF argument. # echo 't sched_switch next=next->pid vruntime=next->se.vruntime' \ > dynamic_events # echo 1 > events/tracepoints/sched_switch/enable # head -n 40 trace | tail <idle>-0 [000] d..3. 272.565382: sched_switch: (__probestub_sched_switch+0x4/0x10) next=26 vruntime=956533179 kcompactd0-26 [000] d..3. 272.565406: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0 <idle>-0 [000] d..3. 273.069441: sched_switch: (__probestub_sched_switch+0x4/0x10) next=9 vruntime=956533179 kworker/0:1-9 [000] d..3. 273.069464: sched_switch: (__probestub_sched_switch+0x4/0x10) next=26 vruntime=956579181 kcompactd0-26 [000] d..3. 273.069480: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0 <idle>-0 [000] d..3. 273.141434: sched_switch: (__probestub_sched_switch+0x4/0x10) next=22 vruntime=956533179 kworker/u2:1-22 [000] d..3. 273.141461: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0 <idle>-0 [000] d..3. 273.480872: sched_switch: (__probestub_sched_switch+0x4/0x10) next=22 vruntime=956585857 kworker/u2:1-22 [000] d..3. 273.480905: sched_switch: (__probestub_sched_switch+0x4/0x10) next=70 vruntime=959533179 sh-70 [000] d..3. 273.481102: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0 Link: https://lore.kernel.org/all/169272157251.160970.9318175874130965571.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
302db0f5b3 |
tracing/probes: Add a function to search a member of a struct/union
Add btf_find_struct_member() API to search a member of a given data structure or union from the member's name. Link: https://lore.kernel.org/all/169272156248.160970.8868479822371129043.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
ebeed8d4a5 |
tracing/probes: Move finding func-proto API and getting func-param API to trace_btf
Move generic function-proto find API and getting function parameter API to BTF library code from trace_probe.c. This will avoid redundant efforts on different feature. Link: https://lore.kernel.org/all/169272155255.160970.719426926348706349.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Masami Hiramatsu (Google)
|
b1d1e90490 |
tracing/probes: Support BTF argument on module functions
Since the btf returned from bpf_get_btf_vmlinux() only covers functions in the vmlinux, BTF argument is not available on the functions in the modules. Use bpf_find_btf_id() instead of bpf_get_btf_vmlinux()+btf_find_name_kind() so that BTF argument can find the correct struct btf and btf_type in it. With this fix, fprobe events can use `$arg*` on module functions as below # grep nf_log_ip_packet /proc/kallsyms ffffffffa0005c00 t nf_log_ip_packet [nf_log_syslog] ffffffffa0005bf0 t __pfx_nf_log_ip_packet [nf_log_syslog] # echo 'f nf_log_ip_packet $arg*' > dynamic_events # cat dynamic_events f:fprobes/nf_log_ip_packet__entry nf_log_ip_packet net=net pf=pf hooknum=hooknum skb=skb in=in out=out loginfo=loginfo prefix=prefix To support the module's btf which is removable, the struct btf needs to be ref-counted. So this also records the btf in the traceprobe_parse_context and returns the refcount when the parse has done. Link: https://lore.kernel.org/all/169272154223.160970.3507930084247934031.stgit@devnote2/ Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Chuang Wang
|
f8bbf8b990 |
tracing/eprobe: Iterate trace_eprobe directly
Refer to the description in [1], we can skip "container_of()" following "list_for_each_entry()" by using "list_for_each_entry()" with "struct trace_eprobe" and "tp.list". Also, this patch defines "for_each_trace_eprobe_tp" to simplify the code of the same logic. [1] https://lore.kernel.org/all/CAHk-=wjakjw6-rDzDDBsuMoDCqd+9ogifR_EE1F0K-jYek1CdA@mail.gmail.com/ Link: https://lore.kernel.org/all/20230822022433.262478-1-nashuiliang@gmail.com/ Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
Eric Vaughn
|
a943188dab |
tracing/user_events: Optimize safe list traversals
Several of the list traversals in the user_events facility use safe list traversals where they could be using the unsafe versions instead. Replace these safe traversals with their unsafe counterparts in the interest of optimization. Link: https://lore.kernel.org/linux-trace-kernel/20230810194337.695983-1-ervaughn@linux.microsoft.com Suggested-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Eric Vaughn <ervaughn@linux.microsoft.com> Acked-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Yue Haibing
|
efde97a175 |
tracing: Remove unused function declarations
Commit |
||
Valentin Schneider
|
38c6f68083 |
tracing/filters: Further optimise scalar vs cpumask comparison
Per the previous commits, we now only enter do_filter_scalar_cpumask() with a mask of weight greater than one. Optimise the equality checks. Link: https://lkml.kernel.org/r/20230707172155.70873-9-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
1cffbe6c62 |
tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU
Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. In this case we can directly re-use filter_pred_cpu(), we just need to transform '&' into '==' before executing it. Link: https://lkml.kernel.org/r/20230707172155.70873-8-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
ca77dd8ce4 |
tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. When the mask contains a single CPU, directly re-use the unsigned field predicate functions. Transform '&' into '==' beforehand. Link: https://lkml.kernel.org/r/20230707172155.70873-7-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
fe4fa4ec9b |
tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. Reuse do_filter_scalar_cpumask() when the input mask has a weight of one. Link: https://lkml.kernel.org/r/20230707172155.70873-6-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
347d24fc82 |
tracing/filters: Enable filtering the CPU common field by a cpumask
The tracing_cpumask lets us specify which CPUs are traced in a buffer instance, but doesn't let us do this on a per-event basis (unless one creates an instance per event). A previous commit added filtering scalar fields by a user-given cpumask, make this work with the CPU common field as well. This enables doing things like $ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \ -e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}' Link: https://lkml.kernel.org/r/20230707172155.70873-5-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
3cbec9d7b9 |
tracing/filters: Enable filtering a scalar field by a cpumask
Several events use a scalar field to denote a CPU: o sched_wakeup.target_cpu o sched_migrate_task.orig_cpu,dest_cpu o sched_move_numa.src_cpu,dst_cpu o ipi_send_cpu.cpu o ... Filtering these currently requires using arithmetic comparison functions, which can be tedious when dealing with interleaved SMT or NUMA CPU ids. Allow these to be filtered by a user-provided cpumask, which enables e.g.: $ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/20230707172155.70873-4-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
39f7c41c90 |
tracing/filters: Enable filtering a cpumask field by another cpumask
The recently introduced ipi_send_cpumask trace event contains a cpumask field, but it currently cannot be used in filter expressions. Make event filtering aware of cpumask fields, and allow these to be filtered by a user-provided cpumask. The user-provided cpumask is to be given in cpulist format and wrapped as: "CPUS{$cpulist}". The use of curly braces instead of parentheses is to prevent predicate_parse() from parsing the contents of CPUS{...} as a full-fledged predicate subexpression. This enables e.g.: $ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/20230707172155.70873-3-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Valentin Schneider
|
cfb58e278c |
tracing/filters: Dynamically allocate filter_pred.regex
Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array in the regex field, even if the predicate function does not use the field. A later commit will introduce a dynamically allocated cpumask to struct filter_pred, which will require a dedicated freeing function. Bite the bullet and make filter_pred.regex dynamically allocated. While at it, reorder the fields of filter_pred to fill in the byte holes. The struct now fits on a single cacheline. No change in behaviour intended. The kfree()'s were patched via Coccinelle: @@ struct filter_pred *pred; @@ -kfree(pred); +free_predicate(pred); Link: https://lkml.kernel.org/r/20230707172155.70873-2-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Jiri Olsa
|
686328d80c |
bpf: Add bpf_get_func_ip helper support for uprobe link
Adding support for bpf_get_func_ip helper being called from ebpf program attached by uprobe_multi link. It returns the ip of the uprobe. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-7-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jiri Olsa
|
b733eeade4 |
bpf: Add pid filter support for uprobe_multi link
Adding support to specify pid for uprobe_multi link and the uprobes are created only for task with given pid value. Using the consumer.filter filter callback for that, so the task gets filtered during the uprobe installation. We still need to check the task during runtime in the uprobe handler, because the handler could get executed if there's another system wide consumer on the same uprobe (thanks Oleg for the insight). Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jiri Olsa
|
0b779b61f6 |
bpf: Add cookies support for uprobe_multi link
Adding support to specify cookies array for uprobe_multi link. The cookies array share indexes and length with other uprobe_multi arrays (offsets/ref_ctr_offsets). The cookies[i] value defines cookie for i-the uprobe and will be returned by bpf_get_attach_cookie helper when called from ebpf program hooked to that specific uprobe. Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jiri Olsa
|
89ae89f53d |
bpf: Add multi uprobe link
Adding new multi uprobe link that allows to attach bpf program to multiple uprobes. Uprobes to attach are specified via new link_create uprobe_multi union: struct { __aligned_u64 path; __aligned_u64 offsets; __aligned_u64 ref_ctr_offsets; __u32 cnt; __u32 flags; } uprobe_multi; Uprobes are defined for single binary specified in path and multiple calling sites specified in offsets array with optional reference counters specified in ref_ctr_offsets array. All specified arrays have length of 'cnt'. The 'flags' supports single bit for now that marks the uprobe as return probe. Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Zheng Yejian
|
c2489bb7e6 |
tracing: Introduce pipe_cpumask to avoid race on trace_pipes
There is race issue when concurrently splice_read main trace_pipe and per_cpu trace_pipes which will result in data read out being different from what actually writen. As suggested by Steven: > I believe we should add a ref count to trace_pipe and the per_cpu > trace_pipes, where if they are opened, nothing else can read it. > > Opening trace_pipe locks all per_cpu ref counts, if any of them are > open, then the trace_pipe open will fail (and releases any ref counts > it had taken). > > Opening a per_cpu trace_pipe will up the ref count for just that > CPU buffer. This will allow multiple tasks to read different per_cpu > trace_pipe files, but will prevent the main trace_pipe file from > being opened. But because we only need to know whether per_cpu trace_pipe is open or not, using a cpumask instead of using ref count may be easier. After this patch, users will find that: - Main trace_pipe can be opened by only one user, and if it is opened, all per_cpu trace_pipes cannot be opened; - Per_cpu trace_pipes can be opened by multiple users, but each per_cpu trace_pipe can only be opened by one user. And if one of them is opened, main trace_pipe cannot be opened. Link: https://lore.kernel.org/linux-trace-kernel/20230818022645.1948314-1-zhengyejian1@huawei.com Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
Zheng Yejian
|
eecb91b9f9 |
tracing: Fix memleak due to race between current_tracer and trace
Kmemleak report a leak in graph_trace_open():
unreferenced object 0xffff0040b95f4a00 (size 128):
comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s)
hex dump (first 32 bytes):
e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}..........
f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e...
backtrace:
[<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0
[<000000007df90faa>] graph_trace_open+0xb0/0x344
[<00000000737524cd>] __tracing_open+0x450/0xb10
[<0000000098043327>] tracing_open+0x1a0/0x2a0
[<00000000291c3876>] do_dentry_open+0x3c0/0xdc0
[<000000004015bcd6>] vfs_open+0x98/0xd0
[<000000002b5f60c9>] do_open+0x520/0x8d0
[<00000000376c7820>] path_openat+0x1c0/0x3e0
[<00000000336a54b5>] do_filp_open+0x14c/0x324
[<000000002802df13>] do_sys_openat2+0x2c4/0x530
[<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4
[<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394
[<00000000313647bf>] do_el0_svc+0xac/0xec
[<000000002ef1c651>] el0_svc+0x20/0x30
[<000000002fd4692a>] el0_sync_handler+0xb0/0xb4
[<000000000c309c35>] el0_sync+0x160/0x180
The root cause is descripted as follows:
__tracing_open() { // 1. File 'trace' is being opened;
...
*iter->trace = *tr->current_trace; // 2. Tracer 'function_graph' is
// currently set;
...
iter->trace->open(iter); // 3. Call graph_trace_open() here,
// and memory are allocated in it;
...
}
s_start() { // 4. The opened file is being read;
...
*iter->trace = *tr->current_trace; // 5. If tracer is switched to
// 'nop' or others, then memory
// in step 3 are leaked!!!
...
}
To fix it, in s_start(), close tracer before switching then reopen the
new tracer after switching. And some tracers like 'wakeup' may not update
'iter->private' in some cases when reopen, then it should be cleared
to avoid being mistakenly closed again.
Link: https://lore.kernel.org/linux-trace-kernel/20230817125539.1646321-1-zhengyejian1@huawei.com
Fixes:
|
||
Sven Schnelle
|
c4d6b54381 |
tracing/synthetic: Allocate one additional element for size
While debugging another issue I noticed that the stack trace contains one
invalid entry at the end:
<idle>-0 [008] d..4. 26.484201: wake_lat: pid=0 delta=2629976084 000000009cc24024 stack=STACK:
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x150/0x2c0
=> kcompactd+0x9ca/0xc20
=> kthread+0x2f6/0x3d8
=> __ret_from_fork+0x8a/0xe8
=> 0x6b6b6b6b6b6b6b6b
This is because the code failed to add the one element containing the
number of entries to field_size.
Link: https://lkml.kernel.org/r/20230816154928.4171614-4-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|
||
Sven Schnelle
|
887f92e09e |
tracing/synthetic: Skip first entry for stack traces
While debugging another issue I noticed that the stack trace output
contains the number of entries on top:
<idle>-0 [000] d..4. 203.322502: wake_lat: pid=0 delta=2268270616 stack=STACK:
=> 0x10
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x242/0x2c0
=> __wait_for_common+0x434/0x680
=> __wait_rcu_gp+0x198/0x3e0
=> synchronize_rcu+0x112/0x138
=> ring_buffer_reset_online_cpus+0x140/0x2e0
=> tracing_reset_online_cpus+0x15c/0x1d0
=> tracing_set_clock+0x180/0x1d8
=> hist_register_trigger+0x486/0x670
=> event_hist_trigger_parse+0x494/0x1318
=> trigger_process_regex+0x1d4/0x258
=> event_trigger_write+0xb4/0x170
=> vfs_write+0x210/0xad0
=> ksys_write+0x122/0x208
Fix this by skipping the first element. Also replace the pointer
logic with an index variable which is easier to read.
Link: https://lkml.kernel.org/r/20230816154928.4171614-3-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|