linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-11 04:54:13 +08:00

History

Marco Elver 9b1933b864 perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets Running the perf benchmark with (note: more aggressive parameters vs. preceding changes, but same 256 CPUs host): \| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 100 threads with 4 breakpoints and 128 parallelism \| Total time: 1.989 [sec] \| \| 38.854160 usecs/op \| 4973.332500 usecs/op/cpu 20.43% [kernel] [k] queued_spin_lock_slowpath 18.75% [kernel] [k] osq_lock 16.98% [kernel] [k] rhashtable_jhash2 8.34% [kernel] [k] task_bp_pinned 4.23% [kernel] [k] smp_cfm_core_cond 3.65% [kernel] [k] bcmp 2.83% [kernel] [k] toggle_bp_slot 1.87% [kernel] [k] find_next_bit 1.49% [kernel] [k] __reserve_bp_slot We can see that a majority of the time is now spent hashing task pointers to index into task_bps_ht in task_bp_pinned(). Obtaining the max_bp_pinned_slots() for CPU-independent task targets currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if the result of task_bp_pinned() is CPU-independent. The loop in max_bp_pinned_slots() wants to compute the maximum slots across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by obtaining the max slots across all CPUs and adding task_bp_pinned(). To do so in O(1), use a bp_slots_histogram for CPU-pinned slots. After this optimization: \| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 100 threads with 4 breakpoints and 128 parallelism \| Total time: 1.930 [sec] \| \| 37.697832 usecs/op \| 4825.322500 usecs/op/cpu 19.13% [kernel] [k] queued_spin_lock_slowpath 18.21% [kernel] [k] rhashtable_jhash2 15.46% [kernel] [k] osq_lock 6.27% [kernel] [k] toggle_bp_slot 5.91% [kernel] [k] task_bp_pinned 5.05% [kernel] [k] smp_cfm_core_cond 1.78% [kernel] [k] update_sg_lb_stats 1.36% [kernel] [k] llist_reverse_order 1.34% [kernel] [k] find_next_bit 1.19% [kernel] [k] bcmp Suggesting that time spent in task_bp_pinned() has been reduced. However, we're still hashing too much, which will be addressed in the subsequent change. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-14-elver@google.com		2022-08-30 10:56:24 +02:00
..
callchain.c	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
core.c	Misc fixes to kprobes and the faddr2line script, plus a cleanup.	2022-08-06 17:28:12 -07:00
hw_breakpoint_test.c	perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test	2022-08-30 10:56:20 +02:00
hw_breakpoint.c	perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets	2022-08-30 10:56:24 +02:00
internal.h	perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled	2022-04-19 21:15:42 +02:00
Makefile	perf/hw_breakpoint: Add KUnit test for constraints accounting	2022-08-30 10:56:20 +02:00
ring_buffer.c	perf/core: Add a new read format to get a number of lost samples	2022-06-28 09:08:31 +02:00
uprobes.c	Yang Shi has improved the behaviour of khugepaged collapsing of readonly	2022-05-26 12:32:41 -07:00