linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-16 16:54:20 +08:00

History

Yonghong Song fd29a0242f bpf: Fix deadlock with rq_lock in bpf_send_signal() [ Upstream commit `1bc7896e9e` ] When experimenting with bpf_send_signal() helper in our production environment (5.2 based), we experienced a deadlock in NMI mode: #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24 #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012 #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55 #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602 #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227 #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140 #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09 #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47 #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d #17 [ffffc9002219fc90] schedule at ffffffff81a3e219 #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9 #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529 #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602 #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068 The above call stack is actually very similar to an issue reported by Commit `eac9153f2b` ("bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()") by Song Liu. The only difference is bpf_send_signal() helper instead of bpf_get_stack() helper. The above deadlock is triggered with a perf_sw_event. Similar to Commit `eac9153f2b`, the below almost identical reproducer used tracepoint point sched/sched_switch so the issue can be easily caught. /* stress_test.c / #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define THREAD_COUNT 1000 char filename; void worker(void p) { void ptr; int fd; char pptr; fd = open(filename, O_RDONLY); if (fd < 0) return NULL; while (1) { struct timespec ts = {0, 1000 + rand() % 2000}; ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0); usleep(1); if (ptr == MAP_FAILED) { printf("failed to mmap\n"); break; } munmap(ptr, 4096 * 64); usleep(1); pptr = malloc(1); usleep(1); pptr[0] = 1; usleep(1); free(pptr); usleep(1); nanosleep(&ts, NULL); } close(fd); return NULL; } int main(int argc, char argv[]) { void ptr; int i; pthread_t threads[THREAD_COUNT]; if (argc < 2) return 0; filename = argv[1]; for (i = 0; i < THREAD_COUNT; i++) { if (pthread_create(threads + i, NULL, worker, NULL)) { fprintf(stderr, "Error creating thread\n"); return 0; } } for (i = 0; i < THREAD_COUNT; i++) pthread_join(threads[i], NULL); return 0; } and the following command: 1. run `stress_test /bin/ls` in one windown 2. hack bcc trace.py with the following change: # --- a/tools/trace.py # +++ b/tools/trace.py @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s); __data.tgid = __tgid; __data.pid = __pid; bpf_get_current_comm(&__data.comm, sizeof(__data.comm)); + bpf_send_signal(10); %s %s %s.perf_submit(%s, &__data, sizeof(__data)); 3. in a different window run ./trace.py -p $(pidof stress_test) t:sched:sched_switch The deadlock can be reproduced in our production system. Similar to Song's fix, the fix is to delay sending signal if irqs is disabled to avoid deadlocks involving with rq_lock. With this change, my above stress-test in our production system won't cause deadlock any more. I also implemented a scale-down version of reproducer in the selftest (a subsequent commit). With latest bpf-next, it complains for the following potential deadlock. [ 32.832450] -> #1 (&p->pi_lock){-.-.}: [ 32.833100] _raw_spin_lock_irqsave+0x44/0x80 [ 32.833696] task_rq_lock+0x2c/0xa0 [ 32.834182] task_sched_runtime+0x59/0xd0 [ 32.834721] thread_group_cputime+0x250/0x270 [ 32.835304] thread_group_cputime_adjusted+0x2e/0x70 [ 32.835959] do_task_stat+0x8a7/0xb80 [ 32.836461] proc_single_show+0x51/0xb0 ... [ 32.839512] -> #0 (&(&sighand->siglock)->rlock){....}: [ 32.840275] __lock_acquire+0x1358/0x1a20 [ 32.840826] lock_acquire+0xc7/0x1d0 [ 32.841309] _raw_spin_lock_irqsave+0x44/0x80 [ 32.841916] __lock_task_sighand+0x79/0x160 [ 32.842465] do_send_sig_info+0x35/0x90 [ 32.842977] bpf_send_signal+0xa/0x10 [ 32.843464] bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000 [ 32.844301] trace_call_bpf+0x115/0x270 [ 32.844809] perf_trace_run_bpf_submit+0x4a/0xc0 [ 32.845411] perf_trace_sched_switch+0x10f/0x180 [ 32.846014] __schedule+0x45d/0x880 [ 32.846483] schedule+0x5f/0xd0 ... [ 32.853148] Chain exists of: [ 32.853148] &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock [ 32.853148] [ 32.854451] Possible unsafe locking scenario: [ 32.854451] [ 32.855173] CPU0 CPU1 [ 32.855745] ---- ---- [ 32.856278] lock(&rq->lock); [ 32.856671] lock(&p->pi_lock); [ 32.857332] lock(&rq->lock); [ 32.857999] lock(&(&sighand->siglock)->rlock); Deadlock happens on CPU0 when it tries to acquire &sighand->siglock but it has been held by CPU1 and CPU1 tries to grab &rq->lock and cannot get it. This is not exactly the callstack in our production environment, but sympotom is similar and both locks are using spin_lock_irqsave() to acquire the lock, and both involves rq_lock. The fix to delay sending signal when irq is disabled also fixed this issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2020-04-17 10:49:57 +02:00
..
blktrace.c	blktrace: fix dereference after null check	2020-03-12 13:00:10 +01:00
bpf_trace.c	bpf: Fix deadlock with rq_lock in bpf_send_signal()	2020-04-17 10:49:57 +02:00
fgraph.c	ftrace: Look up the address of return_to_handler() using helpers	2019-09-18 12:24:47 +10:00
ftrace_internal.h	treewide: Rename rcu_dereference_raw_notrace() to _check()	2019-08-01 14:16:21 -07:00
ftrace.c	ftrace: fpid_next() should increase position index	2020-02-24 08:36:59 +01:00
Kconfig	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2019-09-18 12:34:53 -07:00
Makefile	tracing: Add unified dynamic event framework	2018-12-08 20:54:09 -05:00
power-traces.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
preemptirq_delay_test.c	tracing: Use trace_clock_local() for looping in preemptirq_delay_test.c	2018-10-17 15:35:33 -04:00
ring_buffer_benchmark.c	tracing: Use CONFIG_PREEMPTION	2019-07-31 19:03:35 +02:00
ring_buffer.c	tracing: Do not create directories if lockdown is in affect	2020-01-12 12:21:45 +01:00
rpm-traces.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
trace_benchmark.c	rcu: Rename cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs()	2018-05-15 10:27:29 -07:00
trace_benchmark.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_branch.c	x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP	2019-04-03 11:02:24 +02:00
trace_clock.c	tracing: Add SPDX License format tags to tracing files	2018-08-16 19:08:06 -04:00
trace_dynevent.c	tracing: Add tracing_check_open_get_tr()	2019-10-12 20:44:07 -04:00
trace_dynevent.h	tracing/dynevent: Pass extra arguments to match operation	2019-08-31 12:19:38 -04:00
trace_entries.h	tracing: Change the function format to display function names by perf	2019-02-11 14:53:43 -05:00
trace_event_perf.c	tracing: Fix race in perf_trace_buf initialization	2019-10-21 19:38:28 -04:00
trace_events_filter_test.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_events_filter.c	tracing: Avoid memory leak in process_system_preds()	2020-01-09 10:20:00 +01:00
trace_events_hist.c	tracing: Fix number printing bug in print_synth_event()	2020-03-21 08:11:49 +01:00
trace_events_trigger.c	trigger_next should increase position index	2020-02-24 08:36:59 +01:00
trace_events.c	tracing: Fix lock inversion in trace_event_enable_tgid_record()	2020-01-09 10:19:59 +01:00
trace_export.c	tracing: Add trigger file for trace_markers tracefs/ftrace/print	2018-05-29 08:28:53 -04:00
trace_functions_graph.c	fgraph: Remove redundant ftrace_graph_notrace_addr() test	2019-07-30 21:50:03 -04:00
trace_functions.c	tracing: Update stack trace skipping for ORC unwinder	2018-01-23 15:57:00 -05:00
trace_hwlat.c	tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency	2019-10-12 20:49:33 -04:00
trace_irqsoff.c	The biggest change for this release is in the histogram code.	2019-03-11 17:01:32 -07:00
trace_kdb.c	tracing: Silence GCC 9 array bounds warning	2019-05-25 23:04:30 -04:00
trace_kprobe_selftest.c	selftest/ftrace: Move kprobe selftest function to separate compile unit	2018-07-30 18:41:04 -04:00
trace_kprobe_selftest.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_kprobe.c	tracing/uprobe: Fix to make trace_uprobe_filter alignment safe	2020-02-05 21:22:53 +00:00
trace_mmiotrace.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
trace_nop.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
trace_output.c	tracing: Be more clever when dumping hex in __print_hex()	2019-09-17 11:21:28 -04:00
trace_output.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_preemptirq.c	kprobes: Prohibit probing on hardirq tracers	2019-02-13 08:16:40 +01:00
trace_printk.c	tracing: Add locked_down checks to the open calls of files created for tracefs	2019-10-12 20:48:06 -04:00
trace_probe_tmpl.h	tracing/probe: Support user-space dereference	2019-05-25 23:04:42 -04:00
trace_probe.c	tracing/kprobes: Have uname use __get_str() in print_fmt	2020-02-11 04:35:07 -08:00
trace_probe.h	tracing/uprobe: Fix to make trace_uprobe_filter alignment safe	2020-02-05 21:22:53 +00:00
trace_sched_switch.c	tracing: Fix sched switch start/stop refcount racy updates	2020-02-11 04:35:07 -08:00
trace_sched_wakeup.c	kernel/trace: Fix do not unregister tracepoints when register sched_migrate_task fail	2020-01-14 20:08:22 +01:00
trace_selftest_dynamic.c	ftrace: Mark function tracer test functions noinline/noclone	2018-01-23 15:57:29 -05:00
trace_selftest.c	function_graph: Have selftest also emulate tr->reset() as it did with tr->init()	2019-04-21 19:46:56 -04:00
trace_seq.c	tracing: Add SPDX License format tags to tracing files	2018-08-16 19:08:06 -04:00
trace_stack.c	tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined	2020-01-14 20:08:22 +01:00
trace_stat.c	tracing: Fix very unlikely race of registering two stat tracers	2020-02-24 08:36:30 +01:00
trace_stat.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_syscalls.c	syscalls: Remove start and number from syscall_get_arguments() args	2019-04-05 09:26:43 -04:00
trace_uprobe.c	tracing/uprobe: Fix to make trace_uprobe_filter alignment safe	2020-02-05 21:22:53 +00:00
trace.c	tracing: Disable trace_printk() on post poned tests	2020-03-05 16:43:43 +01:00
trace.h	ftrace: Protect ftrace_graph_hash with ftrace_sync	2020-02-11 04:35:28 -08:00
tracing_map.c	tracing: Have the histogram compare functions convert to u64 first	2020-01-09 10:20:00 +01:00
tracing_map.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00