linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-04 17:44:14 +08:00

History

Andrii Nakryiko 520fad2e32 selftests/bpf: scale benchmark counting by using per-CPU counters When benchmarking with multiple threads (-pN, where N>1), we start contending on single atomic counter that both BPF trigger benchmarks are using, as well as "baseline" tests in user space (trig-base and trig-uprobe-base benchmarks). As such, we start bottlenecking on something completely irrelevant to benchmark at hand. Scale counting up by using per-CPU counters on BPF side. On use space side we do the next best thing: hash thread ID to approximate per-CPU behavior. It seems to work quite well in practice. To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8, 16, and 32 threads: - trig-uprobe-base (no syscalls, pure tight counting loop in user-space); - trig-base (get_pgid() syscall, atomic counter in user-space); - trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU counter on BPF side). Command used: for b in uprobe-base base fentry; do \ for p in 1 2 4 8 16 32; do \ printf "%-11s %2d: %s\n" $b $p \ "$(sudo ./bench -w2 -d5 -a -p$p trig-$b \| tail -n1 \| cut -d'(' -f1 \| cut -d' ' -f3-)"; \ done; \ done Before these changes, aggregate throughput across all threads doesn't scale well with number of threads, it actually even falls sharply for uprobe-base due to a very high contention: uprobe-base 1: 138.998 ± 0.650M/s uprobe-base 2: 70.526 ± 1.147M/s uprobe-base 4: 63.114 ± 0.302M/s uprobe-base 8: 54.177 ± 0.138M/s uprobe-base 16: 45.439 ± 0.057M/s uprobe-base 32: 37.163 ± 0.242M/s base 1: 16.940 ± 0.182M/s base 2: 19.231 ± 0.105M/s base 4: 21.479 ± 0.038M/s base 8: 23.030 ± 0.037M/s base 16: 22.034 ± 0.004M/s base 32: 18.152 ± 0.013M/s fentry 1: 14.794 ± 0.054M/s fentry 2: 17.341 ± 0.055M/s fentry 4: 23.792 ± 0.024M/s fentry 8: 21.557 ± 0.047M/s fentry 16: 21.121 ± 0.004M/s fentry 32: 17.067 ± 0.023M/s After these changes, we see almost perfect linear scaling, as expected. The sub-linear scaling when going from 8 to 16 threads is interesting and consistent on my test machine, but I haven't investigated what is causing it this peculiar slowdown (across all benchmarks, could be due to hyperthreading effects, not sure). uprobe-base 1: 139.980 ± 0.648M/s uprobe-base 2: 270.244 ± 0.379M/s uprobe-base 4: 532.044 ± 1.519M/s uprobe-base 8: 1004.571 ± 3.174M/s uprobe-base 16: 1720.098 ± 0.744M/s uprobe-base 32: 3506.659 ± 8.549M/s base 1: 16.869 ± 0.071M/s base 2: 33.007 ± 0.092M/s base 4: 64.670 ± 0.203M/s base 8: 121.969 ± 0.210M/s base 16: 207.832 ± 0.112M/s base 32: 424.227 ± 1.477M/s fentry 1: 14.777 ± 0.087M/s fentry 2: 28.575 ± 0.146M/s fentry 4: 56.234 ± 0.176M/s fentry 8: 106.095 ± 0.385M/s fentry 16: 181.440 ± 0.032M/s fentry 32: 369.131 ± 0.693M/s Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240315213329.1161589-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>		2024-03-19 23:41:35 -07:00
..
accounting
arch	Core x86 changes for v6.9:	2024-03-11 19:53:15 -07:00
bootconfig
bpf	bpftool: Remove unnecessary source files from bootstrap version	2024-03-19 23:17:55 -07:00
build
certs
cgroup	samples: introduce new samples subdir for cgroup	2023-12-10 16:51:54 -08:00
counter	tools/counter: Remove unneeded semicolon	2023-12-20 11:43:31 -05:00
crypto	crypto: tcrypt - add script tcrypt_speed_compare.py	2023-12-29 11:25:55 +08:00
debugging
edid
firewire
firmware
gpio
hv
iio	iio: add modifiers for A and B ultraviolet light	2023-12-04 13:57:24 +00:00
include	bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs	2024-03-19 23:05:34 -07:00
kvm/kvm_stat
laptop
leds
lib	libbpf: add support for BPF cookie for raw_tp/tp_btf programs	2024-03-19 23:05:34 -07:00
memory-model
mm
net/ynl	netlink: specs: support generating code for genl socket priv	2024-03-11 15:15:42 -07:00
objtool	hardening updates for v6.9-rc1	2024-03-12 14:49:30 -07:00
pci
pcmcia
perf	perf evlist: Fix evlist__new_default() for > 1 core PMU	2024-01-30 11:40:28 -03:00
power	tools cpupower bench: Override CFLAGS assignments	2024-01-21 16:57:51 -07:00
rcu
scripts
spi
testing	selftests/bpf: scale benchmark counting by using per-CPU counters	2024-03-19 23:41:35 -07:00
thermal	tools/thermal/tmon: Fix compilation warning for wrong format	2024-01-02 09:33:19 +01:00
time
tracing	tools/rtla: Exit with EXIT_SUCCESS when help is invoked	2024-02-12 10:59:09 +01:00
usb
verification	tools/rv: Fix curr_reactor uninitialized variable	2024-02-12 09:58:36 +01:00
virtio	tools: virtio: introduce vhost_net_test	2024-03-05 11:38:14 +01:00
wmi
workqueue	workqueue: Implement BH workqueues to eventually replace tasklets	2024-02-04 11:28:06 -10:00
Makefile