linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-10 20:44:09 +08:00

History

Marco Elver 0370dc314d perf/hw_breakpoint: Optimize list of per-task breakpoints On a machine with 256 CPUs, running the recently added perf breakpoint benchmark results in: \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 236.418 [sec] \| \| 123134.794271 usecs/op \| 7880626.833333 usecs/op/cpu The benchmark tests inherited breakpoint perf events across many threads. Looking at a perf profile, we can see that the majority of the time is spent in various hw_breakpoint.c functions, which execute within the 'nr_bp_mutex' critical sections which then results in contention on that mutex as well: 37.27% [kernel] [k] osq_lock 34.92% [kernel] [k] mutex_spin_on_owner 12.15% [kernel] [k] toggle_bp_slot 11.90% [kernel] [k] __reserve_bp_slot The culprit here is task_bp_pinned(), which has a runtime complexity of O(#tasks) due to storing all task breakpoints in the same list and iterating through that list looking for a matching task. Clearly, this does not scale to thousands of tasks. Instead, make use of the "rhashtable" variant "rhltable" which stores multiple items with the same key in a list. This results in average runtime complexity of O(1) for task_bp_pinned(). With the optimization, the benchmark shows: \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.208 [sec] \| \| 108.422396 usecs/op \| 6939.033333 usecs/op/cpu On this particular setup that's a speedup of ~1135x. While one option would be to make task_struct a breakpoint list node, this would only further bloat task_struct for infrequently used data. Furthermore, after all optimizations in this series, there's no evidence it would result in better performance: later optimizations make the time spent looking up entries in the hash table negligible (we'll reach the theoretical ideal performance i.e. no constraints). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com		2022-08-30 10:56:21 +02:00
..
callchain.c	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
core.c	Misc fixes to kprobes and the faddr2line script, plus a cleanup.	2022-08-06 17:28:12 -07:00
hw_breakpoint_test.c	perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test	2022-08-30 10:56:20 +02:00
hw_breakpoint.c	perf/hw_breakpoint: Optimize list of per-task breakpoints	2022-08-30 10:56:21 +02:00
internal.h	perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled	2022-04-19 21:15:42 +02:00
Makefile	perf/hw_breakpoint: Add KUnit test for constraints accounting	2022-08-30 10:56:20 +02:00
ring_buffer.c	perf/core: Add a new read format to get a number of lost samples	2022-06-28 09:08:31 +02:00
uprobes.c	Yang Shi has improved the behaviour of khugepaged collapsing of readonly	2022-05-26 12:32:41 -07:00