Commit Graph

39760 Commits

Author SHA1 Message Date
Namhyung Kim
16817ad7e8 perf/bpf: Always use perf callchains if exist
If the perf_event has PERF_SAMPLE_CALLCHAIN, BPF can use it for stack trace.
The problematic cases like PEBS and IBS already handled in the PMU driver and
they filled the callchain info in the sample data.  For others, we can call
perf_callchain() before the BPF handler.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220908214104.3851807-2-namhyung@kernel.org
2022-09-13 15:03:22 +02:00
Namhyung Kim
3749d33e51 perf: Use sample_flags for callchain
So that it can call perf_callchain() only if needed.  Historically it used
__PERF_SAMPLE_CALLCHAIN_EARLY but we can do that with sample_flags in the
struct perf_sample_data.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220908214104.3851807-1-namhyung@kernel.org
2022-09-13 15:03:22 +02:00
Peter Zijlstra
f3c0eba287 perf: Add a few assertions
While auditing 6b959ba22d ("perf/core: Fix reentry problem in
perf_output_read_group()") a few spots were found that wanted
assertions.

Notable for_each_sibling_event() relies on exclusion from
modification. This would normally be holding either ctx->lock or
ctx->mutex, however due to how things are constructed disabling IRQs
is a valid and sufficient substitute for ctx->lock.

Another possible site to add assertions would be the various
pmu::{add,del,read,..}() methods, but that's not trivially expressable
in C -- the best option is wrappers, but those are easy enough to
forget.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2022-09-07 21:54:01 +02:00
Anshuman Khandual
03b02db93b perf: Consolidate branch sample filter helpers
Besides the branch type filtering requests, 'event.attr.branch_sample_type'
also contains various flags indicating which additional information should
be captured, along with the base branch record. These flags help configure
the underlying hardware, and capture the branch records appropriately when
required e.g after PMU interrupt. But first, this moves an existing helper
perf_sample_save_hw_index() into the header before adding some more helpers
for other branch sample filter flags.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220906084414.396220-1-anshuman.khandual@arm.com
2022-09-07 21:54:00 +02:00
Kan Liang
ee9db0e14b perf: Use sample_flags for txn
Use the new sample_flags to indicate whether the txn field is filled by
the PMU driver.

Remove the txn field from the perf_sample_data_init() to minimize the
number of cache lines touched.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-7-kan.liang@linux.intel.com
2022-09-06 11:33:03 +02:00
Kan Liang
e16fd7f2cb perf: Use sample_flags for data_src
Use the new sample_flags to indicate whether the data_src field is
filled by the PMU driver.

Remove the data_src field from the perf_sample_data_init() to minimize
the number of cache lines touched.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-6-kan.liang@linux.intel.com
2022-09-06 11:33:03 +02:00
Kan Liang
2abe681da0 perf: Use sample_flags for weight
Use the new sample_flags to indicate whether the weight field is filled
by the PMU driver.

Remove the weight field from the perf_sample_data_init() to minimize the
number of cache lines touched.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-5-kan.liang@linux.intel.com
2022-09-06 11:33:02 +02:00
Kan Liang
a9a931e266 perf: Use sample_flags for branch stack
Use the new sample_flags to indicate whether the branch stack is filled
by the PMU driver.

Remove the br_stack from the perf_sample_data_init() to minimize the number
of cache lines touched.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-4-kan.liang@linux.intel.com
2022-09-06 11:33:02 +02:00
Kan Liang
3aac580d5c perf: Add sample_flags to indicate the PMU-filled sample data
On some platforms, some data e.g., timestamps, can be retrieved from
the PMU driver. Usually, the data from the PMU driver is more accurate.
The current perf kernel should output the PMU-filled sample data if
it's available.

To check the availability of the PMU-filled sample data, the current
perf kernel initializes the related fields in the
perf_sample_data_init(). When outputting a sample, the perf checks
whether the field is updated by the PMU driver. If yes, the updated
value will be output. If not, the perf uses an SW way to calculate the
value or just outputs the initialized value if an SW way is unavailable
either.

With more and more data being provided by the PMU driver, more fields
has to be initialized in the perf_sample_data_init(). That will
increase the number of cache lines touched in perf_sample_data_init()
and be harmful to the performance.

Add new "sample_flags" to indicate the PMU-filled sample data. The PMU
driver should set the corresponding PERF_SAMPLE_ flag when the field is
updated. The initialization of the corresponding field is not required
anymore. The following patches will make use of it and remove the
corresponding fields from the perf_sample_data_init(), which will
further minimize the number of cache lines touched.

Only clear the sample flags that have already been done by the PMU
driver in the perf_prepare_sample() for the PERF_RECORD_SAMPLE. For the
other PERF_RECORD_ event type, the sample data is not available.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-2-kan.liang@linux.intel.com
2022-09-06 11:33:01 +02:00
Marco Elver
ecdfb8896f perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets
We can still see that a majority of the time is spent hashing task pointers:

    ...
    16.98%  [kernel]       [k] rhashtable_jhash2
    ...

Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
CPU-independent. The reason for this is to update the per-CPU
'tsk_pinned' histogram.

To optimize the CPU-independent case to O(1), keep a separate
CPU-independent 'tsk_pinned_all' histogram.

The major source of complexity are transitions between "all
CPU-independent task breakpoints" and "mixed CPU-independent and
CPU-dependent task breakpoints". The code comments list all cases that
require handling.

After this optimization:

 | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
 |      Total time: 1.758 [sec]
 |
 |       34.336621 usecs/op
 |     4395.087500 usecs/op/cpu

    38.08%  [kernel]       [k] queued_spin_lock_slowpath
    10.81%  [kernel]       [k] smp_cfm_core_cond
     3.01%  [kernel]       [k] update_sg_lb_stats
     2.58%  [kernel]       [k] osq_lock
     2.57%  [kernel]       [k] llist_reverse_order
     1.45%  [kernel]       [k] find_next_bit
     1.21%  [kernel]       [k] flush_tlb_func_common
     1.01%  [kernel]       [k] arch_install_hw_breakpoint

Showing that the time spent hashing keys has become insignificant.

With the given benchmark parameters, that's an improvement of 12%
compared with the old O(#cpus) version.

And finally, using the less aggressive parameters from the preceding
changes, we now observe:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.067 [sec]
 |
 |       35.292187 usecs/op
 |     2258.700000 usecs/op/cpu

Which is an improvement of 12% compared to without the histogram
optimizations (baseline is 40 usecs/op). This is now on par with the
theoretical ideal (constraints disabled), and only 12% slower than no
breakpoints at all.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-15-elver@google.com
2022-08-30 10:56:24 +02:00
Marco Elver
9b1933b864 perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets
Running the perf benchmark with (note: more aggressive parameters vs.
preceding changes, but same 256 CPUs host):

 | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
 |      Total time: 1.989 [sec]
 |
 |       38.854160 usecs/op
 |     4973.332500 usecs/op/cpu

    20.43%  [kernel]       [k] queued_spin_lock_slowpath
    18.75%  [kernel]       [k] osq_lock
    16.98%  [kernel]       [k] rhashtable_jhash2
     8.34%  [kernel]       [k] task_bp_pinned
     4.23%  [kernel]       [k] smp_cfm_core_cond
     3.65%  [kernel]       [k] bcmp
     2.83%  [kernel]       [k] toggle_bp_slot
     1.87%  [kernel]       [k] find_next_bit
     1.49%  [kernel]       [k] __reserve_bp_slot

We can see that a majority of the time is now spent hashing task
pointers to index into task_bps_ht in task_bp_pinned().

Obtaining the max_bp_pinned_slots() for CPU-independent task targets
currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if
the result of task_bp_pinned() is CPU-independent.

The loop in max_bp_pinned_slots() wants to compute the maximum slots
across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by
obtaining the max slots across all CPUs and adding task_bp_pinned().

To do so in O(1), use a bp_slots_histogram for CPU-pinned slots.

After this optimization:

 | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
 |      Total time: 1.930 [sec]
 |
 |       37.697832 usecs/op
 |     4825.322500 usecs/op/cpu

    19.13%  [kernel]       [k] queued_spin_lock_slowpath
    18.21%  [kernel]       [k] rhashtable_jhash2
    15.46%  [kernel]       [k] osq_lock
     6.27%  [kernel]       [k] toggle_bp_slot
     5.91%  [kernel]       [k] task_bp_pinned
     5.05%  [kernel]       [k] smp_cfm_core_cond
     1.78%  [kernel]       [k] update_sg_lb_stats
     1.36%  [kernel]       [k] llist_reverse_order
     1.34%  [kernel]       [k] find_next_bit
     1.19%  [kernel]       [k] bcmp

Suggesting that time spent in task_bp_pinned() has been reduced.
However, we're still hashing too much, which will be addressed in the
subsequent change.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-14-elver@google.com
2022-08-30 10:56:24 +02:00
Marco Elver
16db2839a5 perf/hw_breakpoint: Introduce bp_slots_histogram
Factor out the existing `atomic_t count[N]` into its own struct called
'bp_slots_histogram', to generalize and make its intent clearer in
preparation of reusing elsewhere. The basic idea of bucketing "total
uses of N slots" resembles a histogram, so calling it such seems most
intuitive.

No functional change.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-13-elver@google.com
2022-08-30 10:56:24 +02:00
Marco Elver
0912037fec perf/hw_breakpoint: Reduce contention with large number of tasks
While optimizing task_bp_pinned()'s runtime complexity to O(1) on
average helps reduce time spent in the critical section, we still suffer
due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
that now contention is the biggest issue:

    95.93%  [kernel]       [k] osq_lock
     0.70%  [kernel]       [k] mutex_spin_on_owner
     0.22%  [kernel]       [k] smp_cfm_core_cond
     0.18%  [kernel]       [k] task_bp_pinned
     0.18%  [kernel]       [k] rhashtable_jhash2
     0.15%  [kernel]       [k] queued_spin_lock_slowpath

when running the breakpoint benchmark with (system with 256 CPUs):

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.207 [sec]
 |
 |      108.267188 usecs/op
 |     6929.100000 usecs/op/cpu

The main concern for synchronizing the breakpoint constraints data is
that a consistent snapshot of the per-CPU and per-task data is observed.

The access pattern is as follows:

 1. If the target is a task: the task's pinned breakpoints are counted,
    checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
    is used to check for conflicts with CPU-only breakpoints;
    bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
    unused.

 2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
    with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
    incremented. No per-task breakpoints are checked.

Since rhltable safely synchronizes insertions/deletions, we can allow
concurrency as follows:

 1. If the target is a task: independent tasks may update and check the
    constraints concurrently, but same-task target calls need to be
    serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
    checked, these modifications can happen concurrently by switching
    tsk_pinned to atomic_t.

 2. If the target is a CPU: access to the per-CPU constraints needs to
    be serialized with other CPU-target and task-target callers (to
    stabilize the bp_cpuinfo::tsk_pinned snapshot).

We can allow the above concurrency by introducing a per-CPU constraints
data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses
task_struct::perf_event_mutex):

  1. If the target is a task: acquires perf_event_mutex, and acquires
     bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes
     contention in the presence of many read-lock but few write-lock
     acquisitions: we assume many orders of magnitude more task target
     breakpoints creations/destructions than CPU target breakpoints.

  2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer.

With these changes, contention with thousands of tasks is reduced to the
point where waiting on locking no longer dominates the profile:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.077 [sec]
 |
 |       40.201563 usecs/op
 |     2572.900000 usecs/op/cpu

    21.54%  [kernel]       [k] task_bp_pinned
    20.18%  [kernel]       [k] rhashtable_jhash2
     6.81%  [kernel]       [k] toggle_bp_slot
     5.47%  [kernel]       [k] queued_spin_lock_slowpath
     3.75%  [kernel]       [k] smp_cfm_core_cond
     3.48%  [kernel]       [k] bcmp

On this particular setup that's a speedup of 2.7x.

We're also getting closer to the theoretical ideal performance through
optimizations in hw_breakpoint.c -- constraints accounting disabled:

 | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.067 [sec]
 |
 |       35.286458 usecs/op
 |     2258.333333 usecs/op/cpu

Which means the current implementation is ~12% slower than the
theoretical ideal.

For reference, performance without any breakpoints:

 | $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 0 breakpoints and 64 parallelism
 |      Total time: 0.060 [sec]
 |
 |       31.365625 usecs/op
 |     2007.400000 usecs/op/cpu

On a system with 256 CPUs, the theoretical ideal is only ~12% slower
than no breakpoints at all; the current implementation is ~28% slower.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com
2022-08-30 10:56:24 +02:00
Marco Elver
01fe8a3f81 locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked()
Implement simple accessors to probe percpu-rwsem's locked state:
percpu_is_write_locked(), percpu_is_read_locked().

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-11-elver@google.com
2022-08-30 10:56:23 +02:00
Marco Elver
24198ad373 perf/hw_breakpoint: Remove useless code related to flexible breakpoints
Flexible breakpoints have never been implemented, with
bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max
flexible count in fetch_bp_busy_slots().

This again causes suboptimal code generation, when we always know that
`!!slots.flexible` will be 0.

Just get rid of the flexible "placeholder" and remove all real code
related to it. Make a note in the comment related to the constraints
algorithm but don't remove them from the algorithm, so that if in future
flexible breakpoints need supporting, it should be trivial to revive
them (along with reverting this change).

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-9-elver@google.com
2022-08-30 10:56:22 +02:00
Marco Elver
9caf87be11 perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
Due to being a __weak function, hw_breakpoint_weight() will cause the
compiler to always emit a call to it. This generates unnecessarily bad
code (register spills etc.) for no good reason; in fact it appears in
profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:

    ...
    0.70%  [kernel]       [k] hw_breakpoint_weight
    ...

While a small percentage, no architecture defines its own
hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
which makes the fact it is currently __weak a poor choice.

Change hw_breakpoint_weight()'s definition to follow a similar protocol
to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
hw_breakpoint_weight(), we'll use it instead.

The result is that it is inlined and no longer shows up in profiles.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-8-elver@google.com
2022-08-30 10:56:22 +02:00
Marco Elver
be3f152568 perf/hw_breakpoint: Optimize constant number of breakpoint slots
Optimize internal hw_breakpoint state if the architecture's number of
breakpoint slots is constant. This avoids several kmalloc() calls and
potentially unnecessary failures if the allocations fail, as well as
subtly improves code generation and cache locality.

The protocol is that if an architecture defines hw_breakpoint_slots via
the preprocessor, it must be constant and the same for all types.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-7-elver@google.com
2022-08-30 10:56:22 +02:00
Marco Elver
db5f6f8531 perf/hw_breakpoint: Mark data __ro_after_init
Mark read-only data after initialization as __ro_after_init.

While we are here, turn 'constraints_initialized' into a bool.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-6-elver@google.com
2022-08-30 10:56:21 +02:00
Marco Elver
0370dc314d perf/hw_breakpoint: Optimize list of per-task breakpoints
On a machine with 256 CPUs, running the recently added perf breakpoint
benchmark results in:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 236.418 [sec]
 |
 |   123134.794271 usecs/op
 |  7880626.833333 usecs/op/cpu

The benchmark tests inherited breakpoint perf events across many
threads.

Looking at a perf profile, we can see that the majority of the time is
spent in various hw_breakpoint.c functions, which execute within the
'nr_bp_mutex' critical sections which then results in contention on that
mutex as well:

    37.27%  [kernel]       [k] osq_lock
    34.92%  [kernel]       [k] mutex_spin_on_owner
    12.15%  [kernel]       [k] toggle_bp_slot
    11.90%  [kernel]       [k] __reserve_bp_slot

The culprit here is task_bp_pinned(), which has a runtime complexity of
O(#tasks) due to storing all task breakpoints in the same list and
iterating through that list looking for a matching task. Clearly, this
does not scale to thousands of tasks.

Instead, make use of the "rhashtable" variant "rhltable" which stores
multiple items with the same key in a list. This results in average
runtime complexity of O(1) for task_bp_pinned().

With the optimization, the benchmark shows:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.208 [sec]
 |
 |      108.422396 usecs/op
 |     6939.033333 usecs/op/cpu

On this particular setup that's a speedup of ~1135x.

While one option would be to make task_struct a breakpoint list node,
this would only further bloat task_struct for infrequently used data.
Furthermore, after all optimizations in this series, there's no evidence
it would result in better performance: later optimizations make the time
spent looking up entries in the hash table negligible (we'll reach the
theoretical ideal performance i.e. no constraints).

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com
2022-08-30 10:56:21 +02:00
Marco Elver
089cdcb0cd perf/hw_breakpoint: Clean up headers
Clean up headers:

 - Remove unused <linux/kallsyms.h>

 - Remove unused <linux/kprobes.h>

 - Remove unused <linux/module.h>

 - Remove unused <linux/smp.h>

 - Add <linux/export.h> for EXPORT_SYMBOL_GPL().

 - Add <linux/mutex.h> for mutex.

 - Sort alphabetically.

 - Move <linux/hw_breakpoint.h> to top to test it compiles on its own.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-4-elver@google.com
2022-08-30 10:56:21 +02:00
Marco Elver
c5b81449f9 perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test
Provide hw_breakpoint_is_used() to check if breakpoints are in use on
the system.

Use it in the KUnit test to verify the global state before and after a
test case.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-3-elver@google.com
2022-08-30 10:56:20 +02:00
Marco Elver
724c299c6a perf/hw_breakpoint: Add KUnit test for constraints accounting
Add KUnit test for hw_breakpoint constraints accounting, with various
interesting mixes of breakpoint targets (some care was taken to catch
interesting corner cases via bug-injection).

The test cannot be built as a module because it requires access to
hw_breakpoint_slots(), which is not inlinable or exported on all
architectures.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-2-elver@google.com
2022-08-30 10:56:20 +02:00
Linus Torvalds
7fb312d225 Various fixes for tracing:
- Fix a return value of traceprobe_parse_event_name()
 
  - Fix NULL pointer dereference from failed ftrace enabling
 
  - Fix NULL pointer dereference when asking for registers from eprobes
 
  - Make eprobes consistent with kprobes/uprobes, filters and histograms
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYwKRrhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qosDAP9WySmPxjoMfR0hbjmnepLy2zJtBbIq
 ABWR3LDrjvLlYwD9H/wrD+6ctOZtXh5XJc0Vn5z6XEyNtqrVGSse7Lm+sg4=
 =qb/R
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various fixes for tracing:

   - Fix a return value of traceprobe_parse_event_name()

   - Fix NULL pointer dereference from failed ftrace enabling

   - Fix NULL pointer dereference when asking for registers from eprobes

   - Make eprobes consistent with kprobes/uprobes, filters and
     histograms"

* tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Have filter accept "common_cpu" to be consistent
  tracing/probes: Have kprobes and uprobes use $COMM too
  tracing/eprobes: Have event probes be consistent with kprobes and uprobes
  tracing/eprobes: Fix reading of string fields
  tracing/eprobes: Do not hardcode $comm as a string
  tracing/eprobes: Do not allow eprobes to use $stack, or % for regs
  ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
  tracing/perf: Fix double put of trace event when init fails
  tracing: React to error return from traceprobe_parse_event_name()
2022-08-21 14:49:42 -07:00
Steven Rostedt (Google)
b2380577d4 tracing: Have filter accept "common_cpu" to be consistent
Make filtering consistent with histograms. As "cpu" can be a field of an
event, allow for "common_cpu" to keep it from being confused with the
"cpu" field of the event.

Link: https://lkml.kernel.org/r/20220820134401.513062765@goodmis.org
Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.org/

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 1e3bac71c5 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
ab8384442e tracing/probes: Have kprobes and uprobes use $COMM too
Both $comm and $COMM can be used to get current->comm in eprobes and the
filtering and histogram logic. Make kprobes and uprobes consistent in this
regard and allow both $comm and $COMM as well. Currently kprobes and
uprobes only handle $comm, which is inconsistent with the other utilities,
and can be confusing to users.

Link: https://lkml.kernel.org/r/20220820134401.317014913@goodmis.org
Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.org/

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 533059281e ("tracing: probeevent: Introduce new argument fetching code")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
6a832ec3d6 tracing/eprobes: Have event probes be consistent with kprobes and uprobes
Currently, if a symbol "@" is attempted to be used with an event probe
(eprobes), it will cause a NULL pointer dereference crash.

Both kprobes and uprobes can reference data other than the main registers.
Such as immediate address, symbols and the current task name. Have eprobes
do the same thing.

For "comm", if "comm" is used and the event being attached to does not
have the "comm" field, then make it the "$comm" that kprobes has. This is
consistent to the way histograms and filters work.

Link: https://lkml.kernel.org/r/20220820134401.136924220@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
f04dec9346 tracing/eprobes: Fix reading of string fields
Currently when an event probe (eprobe) hooks to a string field, it does
not display it as a string, but instead as a number. This makes the field
rather useless. Handle the different kinds of strings, dynamic, static,
relational/dynamic etc.

Now when a string field is used, the ":string" type can be used to display
it:

  echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events

Link: https://lkml.kernel.org/r/20220820134400.959640191@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
02333de90e tracing/eprobes: Do not hardcode $comm as a string
The variable $comm is hard coded as a string, which is true for both
kprobes and uprobes, but for event probes (eprobes) it is a field name. In
most cases the "comm" field would be a string, but there's no guarantee of
that fact.

Do not assume that comm is a string. Not to mention, it currently forces
comm fields to fault, as string processing for event probes is currently
broken.

Link: https://lkml.kernel.org/r/20220820134400.756152112@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
2673c60ee6 tracing/eprobes: Do not allow eprobes to use $stack, or % for regs
While playing with event probes (eprobes), I tried to see what would
happen if I attempted to retrieve the instruction pointer (%rip) knowing
that event probes do not use pt_regs. The result was:

 BUG: kernel NULL pointer dereference, address: 0000000000000024
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 1 PID: 1847 Comm: trace-cmd Not tainted 5.19.0-rc5-test+ #309
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
 RIP: 0010:get_event_field.isra.0+0x0/0x50
 Code: ff 48 c7 c7 c0 8f 74 a1 e8 3d 8b f5 ff e8 88 09 f6 ff 4c 89 e7 e8
50 6a 13 00 48 89 ef 5b 5d 41 5c 41 5d e9 42 6a 13 00 66 90 <48> 63 47 24
8b 57 2c 48 01 c6 8b 47 28 83 f8 02 74 0e 83 f8 04 74
 RSP: 0018:ffff916c394bbaf0 EFLAGS: 00010086
 RAX: ffff916c854041d8 RBX: ffff916c8d9fbf50 RCX: ffff916c255d2000
 RDX: 0000000000000000 RSI: ffff916c255d2008 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff916c3a2a0c08 R09: ffff916c394bbda8
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff916c854041d8
 R13: ffff916c854041b0 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff916c9ea40000(0000)
knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000024 CR3: 000000011b60a002 CR4: 00000000001706e0
 Call Trace:
  <TASK>
  get_eprobe_size+0xb4/0x640
  ? __mod_node_page_state+0x72/0xc0
  __eprobe_trace_func+0x59/0x1a0
  ? __mod_lruvec_page_state+0xaa/0x1b0
  ? page_remove_file_rmap+0x14/0x230
  ? page_remove_rmap+0xda/0x170
  event_triggers_call+0x52/0xe0
  trace_event_buffer_commit+0x18f/0x240
  trace_event_raw_event_sched_wakeup_template+0x7a/0xb0
  try_to_wake_up+0x260/0x4c0
  __wake_up_common+0x80/0x180
  __wake_up_common_lock+0x7c/0xc0
  do_notify_parent+0x1c9/0x2a0
  exit_notify+0x1a9/0x220
  do_exit+0x2ba/0x450
  do_group_exit+0x2d/0x90
  __x64_sys_exit_group+0x14/0x20
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

Obviously this is not the desired result.

Move the testing for TPARG_FL_TPOINT which is only used for event probes
to the top of the "$" variable check, as all the other variables are not
used for event probes. Also add a check in the register parsing "%" to
fail if an event probe is used.

Link: https://lkml.kernel.org/r/20220820134400.564426983@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Yang Jihong
c3b0f72e80 ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
ftrace_startup does not remove ops from ftrace_ops_list when
ftrace_startup_enable fails:

register_ftrace_function
  ftrace_startup
    __register_ftrace_function
      ...
      add_ftrace_ops(&ftrace_ops_list, ops)
      ...
    ...
    ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1
    ...
  return 0 // ops is in the ftrace_ops_list.

When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything:
unregister_ftrace_function
  ftrace_shutdown
    if (unlikely(ftrace_disabled))
            return -ENODEV;  // return here, __unregister_ftrace_function is not executed,
                             // as a result, ops is still in the ftrace_ops_list
    __unregister_ftrace_function
    ...

If ops is dynamically allocated, it will be free later, in this case,
is_ftrace_trampoline accesses NULL pointer:

is_ftrace_trampoline
  ftrace_ops_trampoline
    do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL!

Syzkaller reports as follows:
[ 1203.506103] BUG: kernel NULL pointer dereference, address: 000000000000010b
[ 1203.508039] #PF: supervisor read access in kernel mode
[ 1203.508798] #PF: error_code(0x0000) - not-present page
[ 1203.509558] PGD 800000011660b067 P4D 800000011660b067 PUD 130fb8067 PMD 0
[ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI
[ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G    B   W         5.10.0 #8
[ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0
[ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 <48> 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00
[ 1203.518838] RSP: 0018:ffffc900012cf960 EFLAGS: 00010246
[ 1203.520092] RAX: 0000000000000000 RBX: 000000000000007b RCX: ffffffff8a331866
[ 1203.521469] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000000010b
[ 1203.522583] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff8df18b07
[ 1203.523550] R10: fffffbfff1be3160 R11: 0000000000000001 R12: 0000000000478399
[ 1203.524596] R13: 0000000000000000 R14: ffff888145088000 R15: 0000000000000008
[ 1203.525634] FS:  00007f429f5f4700(0000) GS:ffff8881daf00000(0000) knlGS:0000000000000000
[ 1203.526801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1203.527626] CR2: 000000000000010b CR3: 0000000170e1e001 CR4: 00000000003706e0
[ 1203.528611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1203.529605] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Therefore, when ftrace_startup_enable fails, we need to rollback registration
process and remove ops from ftrace_ops_list.

Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.com

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:07 -04:00
Steven Rostedt (Google)
7249921d94 tracing/perf: Fix double put of trace event when init fails
If in perf_trace_event_init(), the perf_trace_event_open() fails, then it
will call perf_trace_event_unreg() which will not only unregister the perf
trace event, but will also call the put() function of the tp_event.

The problem here is that the trace_event_try_get_ref() is called by the
caller of perf_trace_event_init() and if perf_trace_event_init() returns a
failure, it will then call trace_event_put(). But since the
perf_trace_event_unreg() already called the trace_event_put() function, it
triggers a WARN_ON().

 WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20

If perf_trace_event_reg() does not call the trace_event_try_get_ref() then
the perf_trace_event_unreg() should not be calling trace_event_put(). This
breaks symmetry and causes bugs like these.

Pull out the trace_event_put() from perf_trace_event_unreg() and call it
in the locations that perf_trace_event_unreg() is called. This not only
fixes this bug, but also brings back the proper symmetry of the reg/unreg
vs get/put logic.

Link: https://lore.kernel.org/all/cover.1660347763.git.kjlx@templeofstupid.com/
Link: https://lkml.kernel.org/r/20220816192817.43d5e17f@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: 1d18538e6a ("tracing: Have dynamic events have a ref counter")
Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Reviewed-by: Krister Johansen <kjlx@templeofstupid.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:07 -04:00
Lukas Bulwahn
d8a64313c1 tracing: React to error return from traceprobe_parse_event_name()
The function traceprobe_parse_event_name() may set the first two function
arguments to a non-null value and still return -EINVAL to indicate an
unsuccessful completion of the function. Hence, it is not sufficient to
just check the result of the two function arguments for being not null,
but the return value also needs to be checked.

Commit 95c104c378 ("tracing: Auto generate event name when creating a
group of events") changed the error-return-value checking of the second
traceprobe_parse_event_name() invocation in __trace_eprobe_create() and
removed checking the return value to jump to the error handling case.

Reinstate using the return value in the error-return-value checking.

Link: https://lkml.kernel.org/r/20220811071734.20700-1-lukas.bulwahn@gmail.com

Fixes: 95c104c378 ("tracing: Auto generate event name when creating a group of events")
Acked-by: Linyu Yuan <quic_linyyuan@quicinc.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:07 -04:00
Linus Torvalds
4c2d0b039c Including fixes from netfilter.
Current release - regressions:
 
  - tcp: fix cleanup and leaks in tcp_read_skb() (the new way BPF
    socket maps get data out of the TCP stack)
 
  - tls: rx: react to strparser initialization errors
 
  - netfilter: nf_tables: fix scheduling-while-atomic splat
 
  - net: fix suspicious RCU usage in bpf_sk_reuseport_detach()
 
 Current release - new code bugs:
 
  - mlxsw: ptp: fix a couple of races, static checker warnings
    and error handling
 
 Previous releases - regressions:
 
  - netfilter:
    - nf_tables: fix possible module reference underflow in error path
    - make conntrack helpers deal with BIG TCP (skbs > 64kB)
    - nfnetlink: re-enable conntrack expectation events
 
  - net: fix potential refcount leak in ndisc_router_discovery()
 
 Previous releases - always broken:
 
  - sched: cls_route: disallow handle of 0
 
  - neigh: fix possible local DoS due to net iface start/stop loop
 
  - rtnetlink: fix module refcount leak in rtnetlink_rcv_msg
 
  - sched: fix adding qlen to qcpu->backlog in gnet_stats_add_queue_cpu
 
  - virtio_net: fix endian-ness for RSS
 
  - dsa: mv88e6060: prevent crash on an unused port
 
  - fec: fix timer capture timing in `fec_ptp_enable_pps()`
 
  - ocelot: stats: fix races, integer wrapping and reading incorrect
    registers (the change of register definitions here accounts for
    bulk of the changed LoC in this PR)
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmL+lGYACgkQMUZtbf5S
 IrunKw/+OfV68qJ2C+zg/qPgZg5XAD/v+3WuQo9Vsj4Z+dmxelyQkKqok61xLc6t
 eXr8v3/stDM1/zxHqCc0zJZMGhOug4RLS6kfVVwNbo6XaceTJlKcFTgM1bjQgLyT
 pMlet2JMhzpmWkMma2oztsG4zQaWSITCCjgLJByUmeO8+zKXDMojc1eew2bH8ueo
 KzZjIys+lHdEIo2uhGEU3OdhqnFn2zdVGVxcmtgtV3N9rIobnHiJdVwqLlTgnTvQ
 nU5ZoYUM4h1AG7gKSXsDbM0CPH3s4xavpkA3rMB1x4ahfxNd3y6WmpVt9qjE5wME
 8HbzutQ+x7Xf2XAQBBZma/KjmLW0GCHlQhRT+RHBryk21Yizb04HqXNMB1sPFZe6
 uDAvSZjZqPX+3aMznLTzz1T+F1TJygoeVNQ2tlxHkMuPrfS9g3T+jiohGnELF8+K
 /A3g7oCQin/qiMk35JXBuhGk4RqjyPsITOwAZ2OycHZWD/U5xd1OlkKPGUoUAg+m
 y+7XswZZJ/uBw+U+16AMMzg8vxCmoBHbgYGvnw0+96wpv4yVqTW26Wtzv01gjZPp
 wZuJkd+sHZLBNP5RkBC0PQj5rfcUj+4PUTXtW+57z+XM0HcmcqsXZHLXpMr4rS0b
 EnSsuDlfp9SWwfpMld75v/LA19a6opi6novjY4Nds3+t22ffEHY=
 =ednY
 -----END PGP SIGNATURE-----

Merge tag 'net-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter.

  Current release - regressions:

   - tcp: fix cleanup and leaks in tcp_read_skb() (the new way BPF
     socket maps get data out of the TCP stack)

   - tls: rx: react to strparser initialization errors

   - netfilter: nf_tables: fix scheduling-while-atomic splat

   - net: fix suspicious RCU usage in bpf_sk_reuseport_detach()

  Current release - new code bugs:

   - mlxsw: ptp: fix a couple of races, static checker warnings and
     error handling

  Previous releases - regressions:

   - netfilter:
      - nf_tables: fix possible module reference underflow in error path
      - make conntrack helpers deal with BIG TCP (skbs > 64kB)
      - nfnetlink: re-enable conntrack expectation events

   - net: fix potential refcount leak in ndisc_router_discovery()

  Previous releases - always broken:

   - sched: cls_route: disallow handle of 0

   - neigh: fix possible local DoS due to net iface start/stop loop

   - rtnetlink: fix module refcount leak in rtnetlink_rcv_msg

   - sched: fix adding qlen to qcpu->backlog in gnet_stats_add_queue_cpu

   - virtio_net: fix endian-ness for RSS

   - dsa: mv88e6060: prevent crash on an unused port

   - fec: fix timer capture timing in `fec_ptp_enable_pps()`

   - ocelot: stats: fix races, integer wrapping and reading incorrect
     registers (the change of register definitions here accounts for
     bulk of the changed LoC in this PR)"

* tag 'net-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
  net: moxa: MAC address reading, generating, validity checking
  tcp: handle pure FIN case correctly
  tcp: refactor tcp_read_skb() a bit
  tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
  tcp: fix sock skb accounting in tcp_read_skb()
  igb: Add lock to avoid data race
  dt-bindings: Fix incorrect "the the" corrections
  net: genl: fix error path memory leak in policy dumping
  stmmac: intel: Add a missing clk_disable_unprepare() call in intel_eth_pci_remove()
  net: ethernet: mtk_eth_soc: fix possible NULL pointer dereference in mtk_xdp_run
  net/mlx5e: Allocate flow steering storage during uplink initialization
  net: mscc: ocelot: report ndo_get_stats64 from the wraparound-resistant ocelot->stats
  net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset
  net: mscc: ocelot: make struct ocelot_stat_layout array indexable
  net: mscc: ocelot: fix race between ndo_get_stats64 and ocelot_check_stats_work
  net: mscc: ocelot: turn stats_lock into a spinlock
  net: mscc: ocelot: fix address of SYS_COUNT_TX_AGING counter
  net: mscc: ocelot: fix incorrect ndo_get_stats64 packet counters
  net: dsa: felix: fix ethtool 256-511 and 512-1023 TX packet counters
  net: dsa: don't warn in dsa_port_set_state_now() when driver doesn't support it
  ...
2022-08-18 19:37:15 -07:00
David Howells
fc4aaf9fb3 net: Fix suspicious RCU usage in bpf_sk_reuseport_detach()
bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags()
to obtain the value of sk->sk_user_data, but that function is only usable
if the RCU read lock is held, and neither that function nor any of its
callers hold it.

Fix this by adding a new helper, __locked_read_sk_user_data_with_flags()
that checks to see if sk->sk_callback_lock() is held and use that here
instead.

Alternatively, making __rcu_dereference_sk_user_data_with_flags() use
rcu_dereference_checked() might suffice.

Without this, the following warning can be occasionally observed:

=============================
WARNING: suspicious RCU usage
6.0.0-rc1-build2+ #563 Not tainted
-----------------------------
include/net/sock.h:592 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
5 locks held by locktest/29873:
 #0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121
 #1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70
 #2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0
 #3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd
 #4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4

stack backtrace:
CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x4c/0x5f
 bpf_sk_reuseport_detach+0x6d/0xa4
 reuseport_detach_sock+0x75/0xdd
 inet_unhash+0xa5/0x1c0
 tcp_set_state+0x169/0x20f
 ? lockdep_sock_is_held+0x3a/0x3a
 ? __lock_release.isra.0+0x13e/0x220
 ? reacquire_held_locks+0x1bb/0x1bb
 ? hlock_class+0x31/0x96
 ? mark_lock+0x9e/0x1af
 __tcp_close+0x50/0x4b6
 tcp_close+0x28/0x70
 inet_release+0x8e/0xa7
 __sock_release+0x95/0x121
 sock_close+0x14/0x17
 __fput+0x20f/0x36a
 task_work_run+0xa3/0xcc
 exit_to_user_mode_prepare+0x9c/0x14d
 syscall_exit_to_user_mode+0x18/0x44
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: cf8c1e9672 ("net: refactor bpf_sk_reuseport_detach()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hawkins Jiawei <yin31149@gmail.com>
Link: https://lore.kernel.org/r/166064248071.3502205.10036394558814861778.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-17 16:42:59 -07:00
Linus Torvalds
5d6a0f4da9 xen: branch for v6.0-rc1b
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYvi0yQAKCRCAXGG7T9hj
 vmikAQDWSrcWuxDkGnzut0A1tBQRUCWDMyKPqigWAA5tH2sPgAEAtWfBvT1xyl7T
 gZ22I7o21WxxDGyvNUcA65pK7c2cpg8=
 =UMbq
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.0-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull more xen updates from Juergen Gross:

 - fix the handling of the "persistent grants" feature negotiation
   between Xen blkfront and Xen blkback drivers

 - a cleanup of xen.config and adding xen.config to Xen section in
   MAINTAINERS

 - support HVMOP_set_evtchn_upcall_vector, which is more compliant to
   "normal" interrupt handling than the global callback used up to now

 - further small cleanups

* tag 'for-linus-6.0-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  MAINTAINERS: add xen config fragments to XEN HYPERVISOR sections
  xen: remove XEN_SCRUB_PAGES in xen.config
  xen/pciback: Fix comment typo
  xen/xenbus: fix return type in xenbus_file_read()
  xen-blkfront: Apply 'feature_persistent' parameter when connect
  xen-blkback: Apply 'feature_persistent' parameter when connect
  xen-blkback: fix persistent grants negotiation
  x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
2022-08-14 09:28:54 -07:00
Linus Torvalds
f6eb0fed6a Misc timer fixes:
- fix a potential use-after-free bug in posix timers
  - correct a prototype
  - address a build warning
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmL3epQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iPZw/+I/9GXcf3SzbG5M6Nf21SJpSjC4hAHHgb
 eyv5MUNxKvCHU5iT2SrCvgKjESl5I/E70kubeRHJnvarBPUzGnHHzGlYIYOaJPQ7
 irJpUj/6R8ps4UsMBJ8vj5f3b7163zhBJVP8egDW6roT1HUrYTFeIjIli/SOCxpY
 H1/DqHlbEALE5o5xykg3zuqAbywym+hNRleIVls4wqjZNnfqiTElSuW9xqw9xt3n
 9xYmOKZaztdv5Lp2JCm7QOu2byGzeHje72ppsDcBZ3EBvHUBLSndhfe5NQUGhtxy
 UlBqAELA653uPgPnNKLRMqt/kop8emHqvAx8T0RawPwoUS6XGDVxRX+my8+HKklg
 P8KsM/8W7+3KTHz0bf72DEHTFiXCzlswRzdOSvP5bR4xw1G4ychzvuxAiPDFR3zT
 v7uPgykxxCrEexVCBCdPmrl4WikwLJtcrSXtJ4bsisxQFlq7WWd2/osZkTffI3pN
 IIxDXuHFHC78lrUMk2OQ+ITBz01z4nCFSlgMGZ6ZY6ppS1Rndy1HG/B2NgjW1zGP
 Y/1xq/nWaql0QO7RmyoJXt1ZSMJYCyKFocRDh9nBmtBSlYm3A8aIA8b4i1VRRG1G
 8HOkdS8ef2eOWj8wqk0NvoTbiGjV7YM5pf0g1dmRLA+aGCBD1P9/iFcBv5b6Uxaq
 qZ7ZtuQzsyc=
 =Plg8
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Misc timer fixes:

   - fix a potential use-after-free bug in posix timers

   - correct a prototype

   - address a build warning"

* tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Cleanup CPU timers before freeing them during exec
  time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64
  posix-timers: Make do_clock_gettime() static
2022-08-13 14:38:22 -07:00
Linus Torvalds
1da8cf961b io_uring-6.0-2022-08-13
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL3+fQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmXyEACfERdKYdZ/W3IvPoyK8CJ3p7f/6SOj2/p1
 DTuaa3l7/kVq2HcRUGgZwvgeWpOCFghdBm5co/4hGqSw7bT8rERGDelo41ohhTfr
 xKIiwJflK/s280VXLJFA+o7Jeoj1oTFYCmdUmU3wcKFVnQdu1rz9s0L6bwsEqq93
 y1uty96dxYZn2mENLbBah0x9yV0h2ZxRkguUm0sdnKl/tMkUVLSD1TPLHf2s6eAL
 o3Dbmo9jv4HFXoJj8YL50Oxl22zIKBHl9hZqHdLcKesFgyFTChckKUNijWyPL2vE
 zesbnd57sXgY6ghi4LDGeCOtN41WNjiVeAm/c4XK5oFhTag8Q2x0D1hTPUByHksl
 IV/116xs6pHTeZRhNlMOBVMZGLSz95zSuRUyTONAmKgc/b3if/w3zTi1W3CnJSlx
 7O5GpqQDZTQuin0jldNKImbx1aPAATb+UWDkl7O5aXkjw4FUtxT5GrYcBNswVuKX
 iybx8NyVn8kFD1hix3U8huBOPSg1JMkR+sFml+NqYRd4i2CwV8KAPPuzsPw6MRBL
 U4DfkAkpsbKqSK+mri5aUrYxmpYkJ45mgyldiewiOso9+AYg9DDp3D2iGgAiRbKm
 i3pz1Gh/3iUow0UAI5ZFlDhjHgWPlIH7IBbemivhjhFV4GrXJqTwUzsA1iDKTe14
 3lHKkAPVPA==
 =FfLf
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Regression fix for this merge window, fixing a wrong order of
   arguments for io_req_set_res() for passthru (Dylan)

 - Fix for the audit code leaking context memory (Peilin)

 - Ensure that provided buffers are memcg accounted (Pavel)

 - Correctly handle short zero-copy sends (Pavel)

 - Sparse warning fixes for the recvmsg multishot command (Dylan)

 - Error handling fix for passthru (Anuj)

 - Remove randomization of struct kiocb fields, to avoid it growing in
   size if re-arranged in such a fashion that it grows more holes or
   padding (Keith, Linus)

 - Small series improving type safety of the sqe fields (Stefan)

* tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block:
  io_uring: add missing BUILD_BUG_ON() checks for new io_uring_sqe fields
  io_uring: make io_kiocb_to_cmd() typesafe
  fs: don't randomize struct kiocb fields
  io_uring: consistently make use of io_notif_to_data()
  io_uring: fix error handling for io_uring_cmd
  io_uring: fix io_recvmsg_prep_multishot sparse warnings
  io_uring/net: send retry for zerocopy
  io_uring: mem-account pbuf buckets
  audit, io_uring, io-wq: Fix memory leak in io_sq_thread() and io_wqe_worker()
  io_uring: pass correct parameters to io_req_set_res
2022-08-13 13:28:54 -07:00
Lukas Bulwahn
aa6d1e5b50 xen: remove XEN_SCRUB_PAGES in xen.config
Commit 197ecb3802 ("xen/balloon: add runtime control for scrubbing
ballooned out pages") changed config XEN_SCRUB_PAGES to config
XEN_SCRUB_PAGES_DEFAULT. As xen.config sets 'XEN_BALLOON=y' and
XEN_SCRUB_PAGES_DEFAULT defaults to yes, there is no further need to set
this config in the xen.config file.

Remove setting XEN_SCRUB_PAGES in xen.config, which is without
effect since the commit above anyway.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20220810050712.9539-3-lukas.bulwahn@gmail.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-08-12 12:22:23 +02:00
Linus Torvalds
7ebfc85e2c Including fixes from bluetooth, bpf, can and netfilter.
A little longer PR than usual but it's all fixes, no late features.
 It's long partially because of timing, and partially because of
 follow ups to stuff that got merged a week or so before the merge
 window and wasn't as widely tested. Maybe the Bluetooth fixes are
 a little alarming so we'll address that, but the rest seems okay
 and not scary.
 
 Notably we're including a fix for the netfilter Kconfig [1], your
 WiFi warning [2] and a bluetooth fix which should unblock syzbot [3].
 
 Current release - regressions:
 
  - Bluetooth:
    - don't try to cancel uninitialized works [3]
    - L2CAP: fix use-after-free caused by l2cap_chan_put
 
  - tls: rx: fix device offload after recent rework
 
  - devlink: fix UAF on failed reload and leftover locks in mlxsw
 
 Current release - new code bugs:
 
  - netfilter:
    - flowtable: fix incorrect Kconfig dependencies [1]
    - nf_tables: fix crash when nf_trace is enabled
 
  - bpf:
    - use proper target btf when exporting attach_btf_obj_id
    - arm64: fixes for bpf trampoline support
 
  - Bluetooth:
    - ISO: unlock on error path in iso_sock_setsockopt()
    - ISO: fix info leak in iso_sock_getsockopt()
    - ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP
    - ISO: fix memory corruption on iso_pinfo.base
    - ISO: fix not using the correct QoS
    - hci_conn: fix updating ISO QoS PHY
 
  - phy: dp83867: fix get nvmem cell fail
 
 Previous releases - regressions:
 
  - wifi: cfg80211: fix validating BSS pointers in
    __cfg80211_connect_result [2]
 
  - atm: bring back zatm uAPI after ATM had been removed
 
  - properly fix old bug making bonding ARP monitor mode not being
    able to work with software devices with lockless Tx
 
  - tap: fix null-deref on skb->dev in dev_parse_header_protocol
 
  - revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps
    some devices and breaks others
 
  - netfilter:
    - nf_tables: many fixes rejecting cross-object linking
      which may lead to UAFs
    - nf_tables: fix null deref due to zeroed list head
    - nf_tables: validate variable length element extension
 
  - bgmac: fix a BUG triggered by wrong bytes_compl
 
  - bcmgenet: indicate MAC is in charge of PHY PM
 
 Previous releases - always broken:
 
  - bpf:
    - fix bad pointer deref in bpf_sys_bpf() injected via test infra
    - disallow non-builtin bpf programs calling the prog_run command
    - don't reinit map value in prealloc_lru_pop
    - fix UAFs during the read of map iterator fd
    - fix invalidity check for values in sk local storage map
    - reject sleepable program for non-resched map iterator
 
  - mptcp:
    - move subflow cleanup in mptcp_destroy_common()
    - do not queue data on closed subflows
 
  - virtio_net: fix memory leak inside XDP_TX with mergeable
 
  - vsock: fix memory leak when multiple threads try to connect()
 
  - rework sk_user_data sharing to prevent psock leaks
 
  - geneve: fix TOS inheriting for ipv4
 
  - tunnels & drivers: do not use RT_TOS for IPv6 flowlabel
 
  - phy: c45 baset1: do not skip aneg configuration if clock role
    is not specified
 
  - rose: avoid overflow when /proc displays timer information
 
  - x25: fix call timeouts in blocking connects
 
  - can: mcp251x: fix race condition on receive interrupt
 
  - can: j1939:
    - replace user-reachable WARN_ON_ONCE() with netdev_warn_once()
    - fix memory leak of skbs in j1939_session_destroy()
 
 Misc:
 
  - docs: bpf: clarify that many things are not uAPI
 
  - seg6: initialize induction variable to first valid array index
    (to silence clang vs objtool warning)
 
  - can: ems_usb: fix clang 14's -Wunaligned-access warning
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmL1TtkACgkQMUZtbf5S
 Iruz8Q/+O5xFFsjxuyZD0Mw9d3Jeo3ZI9PeeDvcYl5dZXVegpxqorujTFntxv1Ad
 JC8o5qqms3kO51d+W/yai6iDacEHX2YcJrupZve+vGvpOEVmBRY5O0E1AckJ18+u
 ItmjSVESkybUP5P08/An7Y0dMmj9Xb2z84dGkLe+n8lg6/fimo6Ki6yZjcOBOALu
 AYquMXUcnwztRMbTFjscbJjBd4xFMKZEtthljYtPdIReIN976wmMNYYx+jcPK7ha
 g39Kv6maklp4euerkGIJ/AMnOWHaOGCFjIaz7rr4444NDfrKdt/jeirUXJaz77Jo
 TJM2UOwgOeg6WZkSa3cmdq6UdjdkJ6LTe2CJFf1wJ1qfhAi+s8yWoszsM2Enp+66
 c/mo9jTCMAjmgEJF11idZuz2S697/5j0hvbfM3ZPgNyNBgn8qxz/Z56fNOisx95u
 TkoKKFnGH+mcm/et+omBcyLBtBVK2+/6B6mpl6btf4DOkPn5KFYWHV67uV3ksHzQ
 ye+pnzidoIG0yKbRM2EQKXk7ELKROpl52xUHko93ZinMJt0Q7jBm7tZhJozNFEzi
 hWgUvpmNXgawzLYQcJ9jJmKw3PmYZnRhvYZB/1r91YamM28Hd58k9WfpWtUtjYJN
 N0X58L6JSnKPqzR70pcFppz6iBlh0tHdcEQGWhhKU5ScS3FDxGc=
 =C5Ck
 -----END PGP SIGNATURE-----

Merge tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, bpf, can and netfilter.

  A little larger than usual but it's all fixes, no late features. It's
  large partially because of timing, and partially because of follow ups
  to stuff that got merged a week or so before the merge window and
  wasn't as widely tested. Maybe the Bluetooth fixes are a little
  alarming so we'll address that, but the rest seems okay and not scary.

  Notably we're including a fix for the netfilter Kconfig [1], your WiFi
  warning [2] and a bluetooth fix which should unblock syzbot [3].

  Current release - regressions:

   - Bluetooth:
      - don't try to cancel uninitialized works [3]
      - L2CAP: fix use-after-free caused by l2cap_chan_put

   - tls: rx: fix device offload after recent rework

   - devlink: fix UAF on failed reload and leftover locks in mlxsw

  Current release - new code bugs:

   - netfilter:
      - flowtable: fix incorrect Kconfig dependencies [1]
      - nf_tables: fix crash when nf_trace is enabled

   - bpf:
      - use proper target btf when exporting attach_btf_obj_id
      - arm64: fixes for bpf trampoline support

   - Bluetooth:
      - ISO: unlock on error path in iso_sock_setsockopt()
      - ISO: fix info leak in iso_sock_getsockopt()
      - ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP
      - ISO: fix memory corruption on iso_pinfo.base
      - ISO: fix not using the correct QoS
      - hci_conn: fix updating ISO QoS PHY

   - phy: dp83867: fix get nvmem cell fail

  Previous releases - regressions:

   - wifi: cfg80211: fix validating BSS pointers in
     __cfg80211_connect_result [2]

   - atm: bring back zatm uAPI after ATM had been removed

   - properly fix old bug making bonding ARP monitor mode not being able
     to work with software devices with lockless Tx

   - tap: fix null-deref on skb->dev in dev_parse_header_protocol

   - revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps some
     devices and breaks others

   - netfilter:
      - nf_tables: many fixes rejecting cross-object linking which may
        lead to UAFs
      - nf_tables: fix null deref due to zeroed list head
      - nf_tables: validate variable length element extension

   - bgmac: fix a BUG triggered by wrong bytes_compl

   - bcmgenet: indicate MAC is in charge of PHY PM

  Previous releases - always broken:

   - bpf:
      - fix bad pointer deref in bpf_sys_bpf() injected via test infra
      - disallow non-builtin bpf programs calling the prog_run command
      - don't reinit map value in prealloc_lru_pop
      - fix UAFs during the read of map iterator fd
      - fix invalidity check for values in sk local storage map
      - reject sleepable program for non-resched map iterator

   - mptcp:
      - move subflow cleanup in mptcp_destroy_common()
      - do not queue data on closed subflows

   - virtio_net: fix memory leak inside XDP_TX with mergeable

   - vsock: fix memory leak when multiple threads try to connect()

   - rework sk_user_data sharing to prevent psock leaks

   - geneve: fix TOS inheriting for ipv4

   - tunnels & drivers: do not use RT_TOS for IPv6 flowlabel

   - phy: c45 baset1: do not skip aneg configuration if clock role is
     not specified

   - rose: avoid overflow when /proc displays timer information

   - x25: fix call timeouts in blocking connects

   - can: mcp251x: fix race condition on receive interrupt

   - can: j1939:
      - replace user-reachable WARN_ON_ONCE() with netdev_warn_once()
      - fix memory leak of skbs in j1939_session_destroy()

  Misc:

   - docs: bpf: clarify that many things are not uAPI

   - seg6: initialize induction variable to first valid array index (to
     silence clang vs objtool warning)

   - can: ems_usb: fix clang 14's -Wunaligned-access warning"

* tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (117 commits)
  net: atm: bring back zatm uAPI
  dpaa2-eth: trace the allocated address instead of page struct
  net: add missing kdoc for struct genl_multicast_group::flags
  nfp: fix use-after-free in area_cache_get()
  MAINTAINERS: use my korg address for mt7601u
  mlxsw: minimal: Fix deadlock in ports creation
  bonding: fix reference count leak in balance-alb mode
  net: usb: qmi_wwan: Add support for Cinterion MV32
  bpf: Shut up kern_sys_bpf warning.
  net/tls: Use RCU API to access tls_ctx->netdev
  tls: rx: device: don't try to copy too much on detach
  tls: rx: device: bound the frag walk
  net_sched: cls_route: remove from list when handle is 0
  selftests: forwarding: Fix failing tests with old libnet
  net: refactor bpf_sk_reuseport_detach()
  net: fix refcount bug in sk_psock_get (2)
  selftests/bpf: Ensure sleepable program is rejected by hash map iter
  selftests/bpf: Add write tests for sk local storage map iterator
  selftests/bpf: Add tests for reading a dangling map iter fd
  bpf: Only allow sleepable program for resched-able iterator
  ...
2022-08-11 13:45:37 -07:00
Alexei Starovoitov
4e4588f1c4 bpf: Shut up kern_sys_bpf warning.
Shut up this warning:
kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes]
int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-10 23:58:13 -07:00
Jakub Kicinski
fbe8870f72 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
bpf 2022-08-10

We've added 23 non-merge commits during the last 7 day(s) which contain
a total of 19 files changed, 424 insertions(+), 35 deletions(-).

The main changes are:

1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao.

2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia.

3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu.

4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev.

5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney.

6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi.

7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa.

8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai.

9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address
   offset buffer, from Aijun Sun.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
  selftests/bpf: Ensure sleepable program is rejected by hash map iter
  selftests/bpf: Add write tests for sk local storage map iterator
  selftests/bpf: Add tests for reading a dangling map iter fd
  bpf: Only allow sleepable program for resched-able iterator
  bpf: Check the validity of max_rdwr_access for sock local storage map iterator
  bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator
  bpf: Acquire map uref in .init_seq_private for sock local storage map iterator
  bpf: Acquire map uref in .init_seq_private for hash map iterator
  bpf: Acquire map uref in .init_seq_private for array map iterator
  bpf: Disallow bpf programs call prog_run command.
  bpf, arm64: Fix bpf trampoline instruction endianness
  selftests/bpf: Add test for prealloc_lru_pop bug
  bpf: Don't reinit map value in prealloc_lru_pop
  bpf: Allow calling bpf_prog_test kfuncs in tracing programs
  bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc
  selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf
  bpf: Use proper target btf when exporting attach_btf_obj_id
  mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled
  bpf: Cleanup ftrace hash in bpf_trampoline_put
  BPF: Fix potential bad pointer dereference in bpf_sys_bpf()
  ...
====================

Link: https://lore.kernel.org/r/20220810190624.10748-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-10 21:48:15 -07:00
Hawkins Jiawei
cf8c1e9672 net: refactor bpf_sk_reuseport_detach()
Refactor sk_user_data dereference using more generic function
__rcu_dereference_sk_user_data_with_flags(), which improve its
maintainability

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-10 21:48:04 -07:00
Linus Torvalds
c235698355 cxl for 6.0
- Introduce a 'struct cxl_region' object with support for provisioning
   and assembling persistent memory regions.
 
 - Introduce alloc_free_mem_region() to accompany the existing
   request_free_mem_region() as a method to allocate physical memory
   capacity out of an existing resource.
 
 - Export insert_resource_expand_to_fit() for the CXL subsystem to
   late-publish CXL platform windows in iomem_resource.
 
 - Add a polled mode PCI DOE (Data Object Exchange) driver service and
   use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute
   Table).
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYvLYmAAKCRDfioYZHlFs
 Z0pbAQC/3j+WriWpU7CdhrnZI1Wqn+x5IIklF0Lc4/f6LwGZtAEAsSbLpItzvwqx
 M/rcLaeLpwYlgvS1JjdsuQ2VQ7KOtAs=
 =ehNT
 -----END PGP SIGNATURE-----

Merge tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl updates from Dan Williams:
 "Compute Express Link (CXL) updates for 6.0:

   - Introduce a 'struct cxl_region' object with support for
     provisioning and assembling persistent memory regions.

   - Introduce alloc_free_mem_region() to accompany the existing
     request_free_mem_region() as a method to allocate physical memory
     capacity out of an existing resource.

   - Export insert_resource_expand_to_fit() for the CXL subsystem to
     late-publish CXL platform windows in iomem_resource.

   - Add a polled mode PCI DOE (Data Object Exchange) driver service and
     use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute
     Table)"

* tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (74 commits)
  cxl/hdm: Fix skip allocations vs multiple pmem allocations
  cxl/region: Disallow region granularity != window granularity
  cxl/region: Fix x1 interleave to greater than x1 interleave routing
  cxl/region: Move HPA setup to cxl_region_attach()
  cxl/region: Fix decoder interleave programming
  Documentation: cxl: remove dangling kernel-doc reference
  cxl/region: describe targets and nr_targets members of cxl_region_params
  cxl/regions: add padding for cxl_rr_ep_add nested lists
  cxl/region: Fix IS_ERR() vs NULL check
  cxl/region: Fix region reference target accounting
  cxl/region: Fix region commit uninitialized variable warning
  cxl/region: Fix port setup uninitialized variable warnings
  cxl/region: Stop initializing interleave granularity
  cxl/hdm: Fix DPA reservation vs cxl_endpoint_decoder lifetime
  cxl/acpi: Minimize granularity for x1 interleaves
  cxl/region: Delete 'region' attribute from root decoders
  cxl/acpi: Autoload driver for 'cxl_acpi' test devices
  cxl/region: decrement ->nr_targets on error in cxl_region_attach()
  cxl/region: prevent underflow in ways_to_cxl()
  cxl/region: uninitialized variable in alloc_hpa()
  ...
2022-08-10 11:07:26 -07:00
Hou Tao
d247049f4f bpf: Only allow sleepable program for resched-able iterator
When a sleepable program is attached to a hash map iterator, might_fault()
will report "BUG: sleeping function called from invalid context..." if
CONFIG_DEBUG_ATOMIC_SLEEP is enabled. The reason is that rcu_read_lock()
is held in bpf_hash_map_seq_next() and won't be released until all elements
are traversed or bpf_hash_map_seq_stop() is called.

Fixing it by reusing BPF_ITER_RESCHED to indicate that only non-sleepable
program is allowed for iterator without BPF_ITER_RESCHED. We can revise
bpf_iter_link_attach() later if there are other conditions which may
cause rcu_read_lock() or spin_lock() issues.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-10 10:12:48 -07:00
Hou Tao
ef1e93d2ee bpf: Acquire map uref in .init_seq_private for hash map iterator
bpf_iter_attach_map() acquires a map uref, and the uref may be released
before or in the middle of iterating map elements. For example, the uref
could be released in bpf_iter_detach_map() as part of
bpf_link_release(), or could be released in bpf_map_put_with_uref() as
part of bpf_map_release().

So acquiring an extra map uref in bpf_iter_init_hash_map() and
releasing it in bpf_iter_fini_hash_map().

Fixes: d6c4503cc2 ("bpf: Implement bpf iterator for hash maps")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-10 10:12:47 -07:00
Hou Tao
f76fa6b338 bpf: Acquire map uref in .init_seq_private for array map iterator
bpf_iter_attach_map() acquires a map uref, and the uref may be released
before or in the middle of iterating map elements. For example, the uref
could be released in bpf_iter_detach_map() as part of
bpf_link_release(), or could be released in bpf_map_put_with_uref() as
part of bpf_map_release().

Alternative fix is acquiring an extra bpf_link reference just like
a pinned map iterator does, but it introduces unnecessary dependency
on bpf_link instead of bpf_map.

So choose another fix: acquiring an extra map uref in .init_seq_private
for array map iterator.

Fixes: d3cc2ab546 ("bpf: Implement bpf iterator for array maps")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-10 10:12:47 -07:00
Alexei Starovoitov
86f44fcec2 bpf: Disallow bpf programs call prog_run command.
The verifier cannot perform sufficient validation of bpf_attr->test.ctx_in
pointer, therefore bpf programs should not be allowed to call BPF_PROG_RUN
command from within the program.
To fix this issue split bpf_sys_bpf() bpf helper into normal kern_sys_bpf()
kernel function that can only be used by the kernel light skeleton directly.

Reported-by: YiFei Zhu <zhuyifei@google.com>
Fixes: b1d18a7574 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-10 09:43:07 -07:00
Kumar Kartikeya Dwivedi
275c30bcee bpf: Don't reinit map value in prealloc_lru_pop
The LRU map that is preallocated may have its elements reused while
another program holds a pointer to it from bpf_map_lookup_elem. Hence,
only check_and_free_fields is appropriate when the element is being
deleted, as it ensures proper synchronization against concurrent access
of the map value. After that, we cannot call check_and_init_map_value
again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
they can be concurrently accessed from a BPF program.

This is safe to do as when the map entry is deleted, concurrent access
is protected against by check_and_free_fields, i.e. an existing timer
would be freed, and any existing kptr will be released by it. The
program can create further timers and kptrs after check_and_free_fields,
but they will eventually be released once the preallocated items are
freed on map destruction, even if the item is never reused again. Hence,
the deleted item sitting in the free list can still have resources
attached to it, and they would never leak.

With spin_lock, we never touch the field at all on delete or update, as
we may end up modifying the state of the lock. Since the verifier
ensures that a bpf_spin_lock call is always paired with bpf_spin_unlock
call, the program will eventually release the lock so that on reuse the
new user of the value can take the lock.

Essentially, for the preallocated case, we must assume that the map
value may always be in use by the program, even when it is sitting in
the freelist, and handle things accordingly, i.e. use proper
synchronization inside check_and_free_fields, and never reinitialize the
special fields when it is reused on update.

Fixes: 68134668c1 ("bpf: Add map side support for bpf timers.")
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220809213033.24147-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-09 18:46:11 -07:00
Youngmin Nam
46dae32fe6 time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64
In ns_to_kernel_old_timeval() definition, the function argument is defined
with const identifier in kernel/time/time.c, but the prototype in
include/linux/time32.h looks different.

- The function is defined in kernel/time/time.c as below:
  struct __kernel_old_timeval ns_to_kernel_old_timeval(const s64 nsec)

- The function is decalared in include/linux/time32.h as below:
  extern struct __kernel_old_timeval ns_to_kernel_old_timeval(s64 nsec);

Because the variable of arithmethic types isn't modified in the calling scope,
there's no need to mark arguments as const, which was already mentioned during 
review (Link[1) of the original patch.

Likewise remove the "const" keyword in both definition and declaration of
ns_to_timespec64() as requested by Arnd (Link[2]).

Fixes: a84d116916 ("y2038: Introduce struct __kernel_old_timeval")
Signed-off-by: Youngmin Nam <youngmin.nam@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/all/20220712094715.2918823-1-youngmin.nam@samsung.com
Link[1]: https://lore.kernel.org/all/20180310081123.thin6wphgk7tongy@gmail.com/
Link[2]: https://lore.kernel.org/all/CAK8P3a3nknJgEDESGdJH91jMj6R_xydFqWASd8r5BbesdvMBgA@mail.gmail.com/
2022-08-09 20:02:13 +02:00
Linus Torvalds
5d5d353bed remoteproc updates for v5.20
This introduces support for the remoteproc on Mediatek MT8188, and
 enables caches for MT8186 SCP. It adds support for PRU cores found on
 the TI K3 AM62x SoCs.
 
 It moves the recovery work after a firmware crash to an unbound
 workqueue, to allow recovery to happen in parallel.
 
 A new DMA API is introduced to release dma_mem for a device.
 
 It adds support a panic handler for the Qualcomm modem remoteproc, with
 the goal of having caches flushed in memory dumps for post-mortem
 debugging and it introduces a mechanism to wait for the modem firmware
 on SM8450 to decrypt part of its memory for post-mortem debugging.
 
 Qualcomm sysmon is restricted to only inform remote processors about
 peers that are actually running, to avoid a race where Linux tries to
 notify a recovering remote processor about its peers new state. A
 mechanism for waiting for the sysmon connection to be established is also
 introduced, to avoid out-of-sync updates for rapidly restarting remote
 processors.
 
 A number of Devicetree binding cleanups and conversions to YAML are
 introduced, to facilitate Devicetree validation. Lastly it introduces a
 number of smaller fixes and cleanups in the core and a few different
 drivers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEBd4DzF816k8JZtUlCx85Pw2ZrcUFAmLxXTUbHGJqb3JuLmFu
 ZGVyc3NvbkBsaW5hcm8ub3JnAAoJEAsfOT8Nma3F6lcQAKEAtkd7dRChx5Y11h8J
 BdUmqYTGrlZCfZhGePgUgm9KXvf+BwjnYgZGNPVsno0h9/taY6pWggGz1/hMeD97
 oTFrzZreOEHmrB7tKCQmzKdHzlVaf1aMifzz1BkICH+TRG2t/V3ycr+KJhyCK6IV
 CcsQ6D4FRdVDTWHEizWRewO7uFzaA3CWlr7uSY99aDMXikxSSGU7TgkH8ac04TU/
 Z1+X2uClOa7IzaQX6dSm5lzZGDACatA0+WLFBf6LlEC2XtywKxPHq60QjWQwuXth
 /5mljBbIyW+5Qblm1r1gaipOCd6bGUvlY+0TdqbLlK8LpNIpDjFrt1mrmT4N2T+6
 OAEyXglFvqHG8qjDafew5SxOEYbmFCMJ/oY+akNmpKS7Hhwx3AHeiZJdtu+bDY3O
 JeMQVCqrdMbrdBTNPJEjkTnhWCu1fPTn8STGaAEHgxsOPkarEtk37DuEy6KcV4It
 RTFY4mfnJrTfNeFpm60tOxg/zGYTjXol7uqY7BUTB7bV82W5+UTVGlpO8ayHvxru
 MwtN0HIDH/liXEsbt8INATXTEiTwJmEiqga53/EEWhMtnor3/xE2e26TZwzfq3sB
 Ue8TXnuQEN+v/ThHHvjyOZH0MONivYiW6iHkAuzq0RdnHIVDrFD/YQusWpxj7uuM
 nuk9OY0SbxMvUXIFKucg7zXJ
 =gbAX
 -----END PGP SIGNATURE-----

Merge tag 'rproc-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux

Pull remoteproc updates from Bjorn Andersson:
 "This introduces support for the remoteproc on Mediatek MT8188, and
  enables caches for MT8186 SCP. It adds support for PRU cores found on
  the TI K3 AM62x SoCs.

  It moves the recovery work after a firmware crash to an unbound
  workqueue, to allow recovery to happen in parallel.

  A new DMA API is introduced to release dma_mem for a device.

  It adds support a panic handler for the Qualcomm modem remoteproc,
  with the goal of having caches flushed in memory dumps for post-mortem
  debugging and it introduces a mechanism to wait for the modem firmware
  on SM8450 to decrypt part of its memory for post-mortem debugging.

  Qualcomm sysmon is restricted to only inform remote processors about
  peers that are actually running, to avoid a race where Linux tries to
  notify a recovering remote processor about its peers new state. A
  mechanism for waiting for the sysmon connection to be established is
  also introduced, to avoid out-of-sync updates for rapidly restarting
  remote processors.

  A number of Devicetree binding cleanups and conversions to YAML are
  introduced, to facilitate Devicetree validation. Lastly it introduces
  a number of smaller fixes and cleanups in the core and a few different
  drivers"

* tag 'rproc-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux: (42 commits)
  remoteproc: qcom_q6v5_pas: Do not fail if regulators are not found
  drivers/remoteproc: fix repeated words in comments
  remoteproc: Directly use ida_alloc()/free()
  remoteproc: Use unbounded workqueue for recovery work
  remoteproc: using pm_runtime_resume_and_get instead of pm_runtime_get_sync
  remoteproc: qcom_q6v5_pas: Deal silently with optional px and cx regulators
  remoteproc: sysmon: Send sysmon state only for running rprocs
  remoteproc: sysmon: Wait for SSCTL service to come up
  remoteproc: qcom: q6v5: Set q6 state to offline on receiving wdog irq
  remoteproc: qcom: pas: Check if coredump is enabled
  remoteproc: qcom: pas: Mark devices as wakeup capable
  remoteproc: qcom: pas: Mark va as io memory
  remoteproc: qcom: pas: Add decrypt shutdown support for modem
  remoteproc: qcom: q6v5-mss: add powerdomains to MSM8996 config
  remoteproc: qcom_q6v5: Introduce panic handler for MSS
  remoteproc: qcom_q6v5_mss: Update MBA log info
  remoteproc: qcom: correct kerneldoc
  remoteproc: qcom_q6v5_mss: map/unmap metadata region before/after use
  remoteproc: qcom: using pm_runtime_resume_and_get to simplify the code
  remoteproc: mediatek: Support MT8188 SCP
  ...
2022-08-08 15:16:29 -07:00