linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-11 04:54:13 +08:00

History

Marco Elver 0912037fec perf/hw_breakpoint: Reduce contention with large number of tasks While optimizing task_bp_pinned()'s runtime complexity to O(1) on average helps reduce time spent in the critical section, we still suffer due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows that now contention is the biggest issue: 95.93% [kernel] [k] osq_lock 0.70% [kernel] [k] mutex_spin_on_owner 0.22% [kernel] [k] smp_cfm_core_cond 0.18% [kernel] [k] task_bp_pinned 0.18% [kernel] [k] rhashtable_jhash2 0.15% [kernel] [k] queued_spin_lock_slowpath when running the breakpoint benchmark with (system with 256 CPUs): \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.207 [sec] \| \| 108.267188 usecs/op \| 6929.100000 usecs/op/cpu The main concern for synchronizing the breakpoint constraints data is that a consistent snapshot of the per-CPU and per-task data is observed. The access pattern is as follows: 1. If the target is a task: the task's pinned breakpoints are counted, checked for space, and then appended to; only bp_cpuinfo::cpu_pinned is used to check for conflicts with CPU-only breakpoints; bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise unused. 2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is incremented. No per-task breakpoints are checked. Since rhltable safely synchronizes insertions/deletions, we can allow concurrency as follows: 1. If the target is a task: independent tasks may update and check the constraints concurrently, but same-task target calls need to be serialized; since bp_cpuinfo::tsk_pinned is only updated, but not checked, these modifications can happen concurrently by switching tsk_pinned to atomic_t. 2. If the target is a CPU: access to the per-CPU constraints needs to be serialized with other CPU-target and task-target callers (to stabilize the bp_cpuinfo::tsk_pinned snapshot). We can allow the above concurrency by introducing a per-CPU constraints data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses task_struct::perf_event_mutex): 1. If the target is a task: acquires perf_event_mutex, and acquires bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes contention in the presence of many read-lock but few write-lock acquisitions: we assume many orders of magnitude more task target breakpoints creations/destructions than CPU target breakpoints. 2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer. With these changes, contention with thousands of tasks is reduced to the point where waiting on locking no longer dominates the profile: \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.077 [sec] \| \| 40.201563 usecs/op \| 2572.900000 usecs/op/cpu 21.54% [kernel] [k] task_bp_pinned 20.18% [kernel] [k] rhashtable_jhash2 6.81% [kernel] [k] toggle_bp_slot 5.47% [kernel] [k] queued_spin_lock_slowpath 3.75% [kernel] [k] smp_cfm_core_cond 3.48% [kernel] [k] bcmp On this particular setup that's a speedup of 2.7x. We're also getting closer to the theoretical ideal performance through optimizations in hw_breakpoint.c -- constraints accounting disabled: \| perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.067 [sec] \| \| 35.286458 usecs/op \| 2258.333333 usecs/op/cpu Which means the current implementation is ~12% slower than the theoretical ideal. For reference, performance without any breakpoints: \| $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 0 breakpoints and 64 parallelism \| Total time: 0.060 [sec] \| \| 31.365625 usecs/op \| 2007.400000 usecs/op/cpu On a system with 256 CPUs, the theoretical ideal is only ~12% slower than no breakpoints at all; the current implementation is ~28% slower. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com		2022-08-30 10:56:24 +02:00
..
callchain.c	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
core.c	Misc fixes to kprobes and the faddr2line script, plus a cleanup.	2022-08-06 17:28:12 -07:00
hw_breakpoint_test.c	perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test	2022-08-30 10:56:20 +02:00
hw_breakpoint.c	perf/hw_breakpoint: Reduce contention with large number of tasks	2022-08-30 10:56:24 +02:00
internal.h	perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled	2022-04-19 21:15:42 +02:00
Makefile	perf/hw_breakpoint: Add KUnit test for constraints accounting	2022-08-30 10:56:20 +02:00
ring_buffer.c	perf/core: Add a new read format to get a number of lost samples	2022-06-28 09:08:31 +02:00
uprobes.c	Yang Shi has improved the behaviour of khugepaged collapsing of readonly	2022-05-26 12:32:41 -07:00