linux/kernel/trace
Yonghong Song fd29a0242f bpf: Fix deadlock with rq_lock in bpf_send_signal()
[ Upstream commit 1bc7896e9e ]

When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
   #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
   #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
   #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
   #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
   #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
  #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
  #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
  #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
  #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
  #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
  #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
  #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
  #17 [ffffc9002219fc90] schedule at ffffffff81a3e219
  #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
  #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
  #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
  #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
  #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
  #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068

The above call stack is actually very similar to an issue
reported by Commit eac9153f2b ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.

The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153f2b, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
  /* stress_test.c */
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <pthread.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>

  #define THREAD_COUNT 1000
  char *filename;
  void *worker(void *p)
  {
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
  }

  int main(int argc, char *argv[])
  {
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
  }
and the following command:
  1. run `stress_test /bin/ls` in one windown
  2. hack bcc trace.py with the following change:
#     --- a/tools/trace.py
#     +++ b/tools/trace.py
     @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
              __data.tgid = __tgid;
              __data.pid = __pid;
              bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
     +        bpf_send_signal(10);
      %s
      %s
              %s.perf_submit(%s, &__data, sizeof(__data));
  3. in a different window run
     ./trace.py -p $(pidof stress_test) t:sched:sched_switch

The deadlock can be reproduced in our production system.

Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.

I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
  [   32.832450] -> #1 (&p->pi_lock){-.-.}:
  [   32.833100]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.833696]        task_rq_lock+0x2c/0xa0
  [   32.834182]        task_sched_runtime+0x59/0xd0
  [   32.834721]        thread_group_cputime+0x250/0x270
  [   32.835304]        thread_group_cputime_adjusted+0x2e/0x70
  [   32.835959]        do_task_stat+0x8a7/0xb80
  [   32.836461]        proc_single_show+0x51/0xb0
  ...
  [   32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
  [   32.840275]        __lock_acquire+0x1358/0x1a20
  [   32.840826]        lock_acquire+0xc7/0x1d0
  [   32.841309]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.841916]        __lock_task_sighand+0x79/0x160
  [   32.842465]        do_send_sig_info+0x35/0x90
  [   32.842977]        bpf_send_signal+0xa/0x10
  [   32.843464]        bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
  [   32.844301]        trace_call_bpf+0x115/0x270
  [   32.844809]        perf_trace_run_bpf_submit+0x4a/0xc0
  [   32.845411]        perf_trace_sched_switch+0x10f/0x180
  [   32.846014]        __schedule+0x45d/0x880
  [   32.846483]        schedule+0x5f/0xd0
  ...

  [   32.853148] Chain exists of:
  [   32.853148]   &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
  [   32.853148]
  [   32.854451]  Possible unsafe locking scenario:
  [   32.854451]
  [   32.855173]        CPU0                    CPU1
  [   32.855745]        ----                    ----
  [   32.856278]   lock(&rq->lock);
  [   32.856671]                                lock(&p->pi_lock);
  [   32.857332]                                lock(&rq->lock);
  [   32.857999]   lock(&(&sighand->siglock)->rlock);

  Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
  but it has been held by CPU1 and CPU1 tries to grab &rq->lock
  and cannot get it.

  This is not exactly the callstack in our production environment,
  but sympotom is similar and both locks are using spin_lock_irqsave()
  to acquire the lock, and both involves rq_lock. The fix to delay
  sending signal when irq is disabled also fixed this issue.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17 10:49:57 +02:00
..
blktrace.c blktrace: fix dereference after null check 2020-03-12 13:00:10 +01:00
bpf_trace.c bpf: Fix deadlock with rq_lock in bpf_send_signal() 2020-04-17 10:49:57 +02:00
fgraph.c ftrace: Look up the address of return_to_handler() using helpers 2019-09-18 12:24:47 +10:00
ftrace_internal.h treewide: Rename rcu_dereference_raw_notrace() to _check() 2019-08-01 14:16:21 -07:00
ftrace.c ftrace: fpid_next() should increase position index 2020-02-24 08:36:59 +01:00
Kconfig Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2019-09-18 12:34:53 -07:00
Makefile tracing: Add unified dynamic event framework 2018-12-08 20:54:09 -05:00
power-traces.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
preemptirq_delay_test.c tracing: Use trace_clock_local() for looping in preemptirq_delay_test.c 2018-10-17 15:35:33 -04:00
ring_buffer_benchmark.c tracing: Use CONFIG_PREEMPTION 2019-07-31 19:03:35 +02:00
ring_buffer.c tracing: Do not create directories if lockdown is in affect 2020-01-12 12:21:45 +01:00
rpm-traces.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
trace_benchmark.c rcu: Rename cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs() 2018-05-15 10:27:29 -07:00
trace_benchmark.h tracing: Fix SPDX format headers to use C++ style comments 2018-08-16 19:08:06 -04:00
trace_branch.c x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP 2019-04-03 11:02:24 +02:00
trace_clock.c tracing: Add SPDX License format tags to tracing files 2018-08-16 19:08:06 -04:00
trace_dynevent.c tracing: Add tracing_check_open_get_tr() 2019-10-12 20:44:07 -04:00
trace_dynevent.h tracing/dynevent: Pass extra arguments to match operation 2019-08-31 12:19:38 -04:00
trace_entries.h tracing: Change the function format to display function names by perf 2019-02-11 14:53:43 -05:00
trace_event_perf.c tracing: Fix race in perf_trace_buf initialization 2019-10-21 19:38:28 -04:00
trace_events_filter_test.h tracing: Fix SPDX format headers to use C++ style comments 2018-08-16 19:08:06 -04:00
trace_events_filter.c tracing: Avoid memory leak in process_system_preds() 2020-01-09 10:20:00 +01:00
trace_events_hist.c tracing: Fix number printing bug in print_synth_event() 2020-03-21 08:11:49 +01:00
trace_events_trigger.c trigger_next should increase position index 2020-02-24 08:36:59 +01:00
trace_events.c tracing: Fix lock inversion in trace_event_enable_tgid_record() 2020-01-09 10:19:59 +01:00
trace_export.c tracing: Add trigger file for trace_markers tracefs/ftrace/print 2018-05-29 08:28:53 -04:00
trace_functions_graph.c fgraph: Remove redundant ftrace_graph_notrace_addr() test 2019-07-30 21:50:03 -04:00
trace_functions.c tracing: Update stack trace skipping for ORC unwinder 2018-01-23 15:57:00 -05:00
trace_hwlat.c tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency 2019-10-12 20:49:33 -04:00
trace_irqsoff.c The biggest change for this release is in the histogram code. 2019-03-11 17:01:32 -07:00
trace_kdb.c tracing: Silence GCC 9 array bounds warning 2019-05-25 23:04:30 -04:00
trace_kprobe_selftest.c selftest/ftrace: Move kprobe selftest function to separate compile unit 2018-07-30 18:41:04 -04:00
trace_kprobe_selftest.h tracing: Fix SPDX format headers to use C++ style comments 2018-08-16 19:08:06 -04:00
trace_kprobe.c tracing/uprobe: Fix to make trace_uprobe_filter alignment safe 2020-02-05 21:22:53 +00:00
trace_mmiotrace.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
trace_nop.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
trace_output.c tracing: Be more clever when dumping hex in __print_hex() 2019-09-17 11:21:28 -04:00
trace_output.h tracing: Fix SPDX format headers to use C++ style comments 2018-08-16 19:08:06 -04:00
trace_preemptirq.c kprobes: Prohibit probing on hardirq tracers 2019-02-13 08:16:40 +01:00
trace_printk.c tracing: Add locked_down checks to the open calls of files created for tracefs 2019-10-12 20:48:06 -04:00
trace_probe_tmpl.h tracing/probe: Support user-space dereference 2019-05-25 23:04:42 -04:00
trace_probe.c tracing/kprobes: Have uname use __get_str() in print_fmt 2020-02-11 04:35:07 -08:00
trace_probe.h tracing/uprobe: Fix to make trace_uprobe_filter alignment safe 2020-02-05 21:22:53 +00:00
trace_sched_switch.c tracing: Fix sched switch start/stop refcount racy updates 2020-02-11 04:35:07 -08:00
trace_sched_wakeup.c kernel/trace: Fix do not unregister tracepoints when register sched_migrate_task fail 2020-01-14 20:08:22 +01:00
trace_selftest_dynamic.c ftrace: Mark function tracer test functions noinline/noclone 2018-01-23 15:57:29 -05:00
trace_selftest.c function_graph: Have selftest also emulate tr->reset() as it did with tr->init() 2019-04-21 19:46:56 -04:00
trace_seq.c tracing: Add SPDX License format tags to tracing files 2018-08-16 19:08:06 -04:00
trace_stack.c tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined 2020-01-14 20:08:22 +01:00
trace_stat.c tracing: Fix very unlikely race of registering two stat tracers 2020-02-24 08:36:30 +01:00
trace_stat.h tracing: Fix SPDX format headers to use C++ style comments 2018-08-16 19:08:06 -04:00
trace_syscalls.c syscalls: Remove start and number from syscall_get_arguments() args 2019-04-05 09:26:43 -04:00
trace_uprobe.c tracing/uprobe: Fix to make trace_uprobe_filter alignment safe 2020-02-05 21:22:53 +00:00
trace.c tracing: Disable trace_printk() on post poned tests 2020-03-05 16:43:43 +01:00
trace.h ftrace: Protect ftrace_graph_hash with ftrace_sync 2020-02-11 04:35:28 -08:00
tracing_map.c tracing: Have the histogram compare functions convert to u64 first 2020-01-09 10:20:00 +01:00
tracing_map.h tracing: Fix SPDX format headers to use C++ style comments 2018-08-16 19:08:06 -04:00