linux/tools
Andrii Nakryiko 520fad2e32 selftests/bpf: scale benchmark counting by using per-CPU counters
When benchmarking with multiple threads (-pN, where N>1), we start
contending on single atomic counter that both BPF trigger benchmarks are
using, as well as "baseline" tests in user space (trig-base and
trig-uprobe-base benchmarks). As such, we start bottlenecking on
something completely irrelevant to benchmark at hand.

Scale counting up by using per-CPU counters on BPF side. On use space
side we do the next best thing: hash thread ID to approximate per-CPU
behavior. It seems to work quite well in practice.

To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8,
16, and 32 threads:
  - trig-uprobe-base (no syscalls, pure tight counting loop in user-space);
  - trig-base (get_pgid() syscall, atomic counter in user-space);
  - trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU
    counter on BPF side).

Command used:

  for b in uprobe-base base fentry; do \
    for p in 1 2 4 8 16 32; do \
      printf "%-11s %2d: %s\n" $b $p \
        "$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \
    done; \
  done

Before these changes, aggregate throughput across all threads doesn't
scale well with number of threads, it actually even falls sharply for
uprobe-base due to a very high contention:

  uprobe-base  1:  138.998 ± 0.650M/s
  uprobe-base  2:   70.526 ± 1.147M/s
  uprobe-base  4:   63.114 ± 0.302M/s
  uprobe-base  8:   54.177 ± 0.138M/s
  uprobe-base 16:   45.439 ± 0.057M/s
  uprobe-base 32:   37.163 ± 0.242M/s
  base         1:   16.940 ± 0.182M/s
  base         2:   19.231 ± 0.105M/s
  base         4:   21.479 ± 0.038M/s
  base         8:   23.030 ± 0.037M/s
  base        16:   22.034 ± 0.004M/s
  base        32:   18.152 ± 0.013M/s
  fentry       1:   14.794 ± 0.054M/s
  fentry       2:   17.341 ± 0.055M/s
  fentry       4:   23.792 ± 0.024M/s
  fentry       8:   21.557 ± 0.047M/s
  fentry      16:   21.121 ± 0.004M/s
  fentry      32:   17.067 ± 0.023M/s

After these changes, we see almost perfect linear scaling, as expected.
The sub-linear scaling when going from 8 to 16 threads is interesting
and consistent on my test machine, but I haven't investigated what is
causing it this peculiar slowdown (across all benchmarks, could be due
to hyperthreading effects, not sure).

  uprobe-base  1:  139.980 ± 0.648M/s
  uprobe-base  2:  270.244 ± 0.379M/s
  uprobe-base  4:  532.044 ± 1.519M/s
  uprobe-base  8: 1004.571 ± 3.174M/s
  uprobe-base 16: 1720.098 ± 0.744M/s
  uprobe-base 32: 3506.659 ± 8.549M/s
  base         1:   16.869 ± 0.071M/s
  base         2:   33.007 ± 0.092M/s
  base         4:   64.670 ± 0.203M/s
  base         8:  121.969 ± 0.210M/s
  base        16:  207.832 ± 0.112M/s
  base        32:  424.227 ± 1.477M/s
  fentry       1:   14.777 ± 0.087M/s
  fentry       2:   28.575 ± 0.146M/s
  fentry       4:   56.234 ± 0.176M/s
  fentry       8:  106.095 ± 0.385M/s
  fentry      16:  181.440 ± 0.032M/s
  fentry      32:  369.131 ± 0.693M/s

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240315213329.1161589-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-19 23:41:35 -07:00
..
accounting
arch Core x86 changes for v6.9: 2024-03-11 19:53:15 -07:00
bootconfig
bpf bpftool: Remove unnecessary source files from bootstrap version 2024-03-19 23:17:55 -07:00
build
certs
cgroup samples: introduce new samples subdir for cgroup 2023-12-10 16:51:54 -08:00
counter tools/counter: Remove unneeded semicolon 2023-12-20 11:43:31 -05:00
crypto crypto: tcrypt - add script tcrypt_speed_compare.py 2023-12-29 11:25:55 +08:00
debugging
edid
firewire
firmware
gpio
hv
iio iio: add modifiers for A and B ultraviolet light 2023-12-04 13:57:24 +00:00
include bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs 2024-03-19 23:05:34 -07:00
kvm/kvm_stat
laptop
leds
lib libbpf: add support for BPF cookie for raw_tp/tp_btf programs 2024-03-19 23:05:34 -07:00
memory-model
mm
net/ynl netlink: specs: support generating code for genl socket priv 2024-03-11 15:15:42 -07:00
objtool hardening updates for v6.9-rc1 2024-03-12 14:49:30 -07:00
pci
pcmcia
perf perf evlist: Fix evlist__new_default() for > 1 core PMU 2024-01-30 11:40:28 -03:00
power tools cpupower bench: Override CFLAGS assignments 2024-01-21 16:57:51 -07:00
rcu
scripts
spi
testing selftests/bpf: scale benchmark counting by using per-CPU counters 2024-03-19 23:41:35 -07:00
thermal tools/thermal/tmon: Fix compilation warning for wrong format 2024-01-02 09:33:19 +01:00
time
tracing tools/rtla: Exit with EXIT_SUCCESS when help is invoked 2024-02-12 10:59:09 +01:00
usb
verification tools/rv: Fix curr_reactor uninitialized variable 2024-02-12 09:58:36 +01:00
virtio tools: virtio: introduce vhost_net_test 2024-03-05 11:38:14 +01:00
wmi
workqueue workqueue: Implement BH workqueues to eventually replace tasklets 2024-02-04 11:28:06 -10:00
Makefile