linux/kernel
Frederic Weisbecker 6b8ca75431 rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation
[ Upstream commit 55d4669ef1 ]

When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off
rnp->qsmaskinitnext, it means that all accesses from the offline CPU
preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including
callbacks expiration and counter updates.

However interrupts can still fire after stop_machine() re-enables
interrupts and before rcutree_report_cpu_dead(). The related accesses
happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing
are _NOT_ guaranteed to be seen by rcu_barrier() without proper
ordering, especially when callbacks are invoked there to the end, making
rcutree_migrate_callback() bypass barrier_lock.

The following theoretical race example can make rcu_barrier() hang:

CPU 0                                               CPU 1
-----                                               -----
//cpu_down()
smpboot_park_threads()
//ksoftirqd is parked now
<IRQ>
rcu_sched_clock_irq()
   invoke_rcu_core()
do_softirq()
   rcu_core()
      rcu_do_batch()
         // callback storm
         // rcu_do_batch() returns
         // before completing all
         // of them
   // do_softirq also returns early because of
   // timeout. It defers to ksoftirqd but
   // it's parked
</IRQ>
stop_machine()
   take_cpu_down()
                                                    rcu_barrier()
                                                        spin_lock(barrier_lock)
                                                        // observes rcu_segcblist_n_cbs(&rdp->cblist) != 0
<IRQ>
do_softirq()
   rcu_core()
      rcu_do_batch()
         //completes all pending callbacks
         //smp_mb() implied _after_ callback number dec
</IRQ>

rcutree_report_cpu_dead()
   rnp->qsmaskinitnext &= ~rdp->grpmask;

rcutree_migrate_callback()
   // no callback, early return without locking
   // barrier_lock
                                                        //observes !rcu_rdp_cpu_online(rdp)
                                                        rcu_barrier_entrain()
                                                           rcu_segcblist_entrain()
                                                              // Observe rcu_segcblist_n_cbs(rsclp) == 0
                                                              // because no barrier between reading
                                                              // rnp->qsmaskinitnext and rsclp->len
                                                              rcu_segcblist_add_len()
                                                                 smp_mb__before_atomic()
                                                                 // will now observe the 0 count and empty
                                                                 // list, but too late, we enqueue regardless
                                                                 WRITE_ONCE(rsclp->len, rsclp->len + v);
                                                        // ignored barrier callback
                                                        // rcu barrier stall...

This could be solved with a read memory barrier, enforcing the message
passing between rnp->qsmaskinitnext and rsclp->len, matching the full
memory barrier after rsclp->len addition in rcu_segcblist_add_len()
performed at the end of rcu_do_batch().

However the rcu_barrier() is complicated enough and probably doesn't
need too many more subtleties. CPU down is a slowpath and the
barrier_lock seldom contended. Solve the issue with unconditionally
locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure
that either rcu_barrier() sees the empty queue or its entrained
callback will be migrated.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14 13:52:45 +02:00
..
bpf bpf: Synchronize dispatcher update with bpf_dispatcher_xdp_func 2024-08-03 08:49:45 +02:00
cgroup cgroup/cpuset: Prevent UAF in proc_cpuset_show() 2024-08-03 08:48:54 +02:00
configs Kbuild: add Rust support 2022-09-28 09:02:20 +02:00
debug kdb: Use the passed prompt in kdb_position_cursor() 2024-08-03 08:49:47 +02:00
dma dma: fix call order in dmam_free_coherent 2024-08-03 08:49:49 +02:00
entry entry: Respect changes to system call number by trace_sys_enter() 2024-04-03 15:19:44 +02:00
events bpf, events: Use prog to emit ksymbol event for main program 2024-08-03 08:49:49 +02:00
futex futex: Don't include process MM in futex key on no-MMU 2023-11-20 11:51:50 +01:00
gcov gcov: add support for GCC 14 2024-06-27 13:46:22 +02:00
irq irqdomain: Fixed unbalanced fwnode get and put 2024-08-11 12:35:54 +02:00
kcsan kcsan: Don't expect 64 bits atomic builtins from 32 bits architectures 2023-07-19 16:21:37 +02:00
livepatch livepatch: Fix missing newline character in klp_resolve_symbols() 2023-11-20 11:52:10 +01:00
locking locking/rwsem: Add __always_inline annotation to __down_write_common() and inlined callers 2024-08-03 08:49:08 +02:00
module modules: wait do_free_init correctly 2024-03-26 18:20:52 -04:00
power PM: s2idle: Make sure CPUs will wakeup directly on resume 2024-04-17 11:18:22 +02:00
printk printk: Update @console_may_schedule in console_trylock_spinning() 2024-04-03 15:19:44 +02:00
rcu rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation 2024-08-14 13:52:45 +02:00
sched sched/fair: Use all little CPUs for CPU-bound workloads 2024-08-03 08:49:33 +02:00
time tick/broadcast: Make takeover of broadcast hrtimer reliable 2024-08-03 08:49:30 +02:00
trace trace/pid_list: Change gfp flags in pid_list_fill_irq() 2024-08-03 08:49:34 +02:00
.gitignore
acct.c acct: fix potential integer overflow in encode_comp_t() 2022-12-31 13:32:58 +01:00
async.c async: Introduce async_schedule_dev_nocall() 2024-01-31 16:17:00 -08:00
audit_fsnotify.c audit: fix potential double free on error path from fsnotify_add_inode_mark 2022-08-22 18:50:06 -04:00
audit_tree.c
audit_watch.c audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare() 2023-11-28 17:07:08 +00:00
audit.c audit: Send netlink ACK before setting connection in auditd_set 2024-02-05 20:12:47 +00:00
audit.h audit: remove selinux_audit_rule_update() declaration 2022-09-07 11:30:15 -04:00
auditfilter.c ima: Avoid blocking in RCU read-side critical section 2024-07-11 12:47:16 +02:00
auditsc.c audit,io_uring: io_uring openat triggers audit reference count underflow 2023-10-25 12:03:04 +02:00
backtracetest.c
bounds.c bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS 2024-05-02 16:29:32 +02:00
capability.c
cfi.c cfi: Switch to -fsanitize=kcfi 2022-09-26 10:13:13 -07:00
compat.c sched_getaffinity: don't assume 'cpumask_size()' is fully initialized 2023-04-06 12:10:40 +02:00
configs.c
context_tracking.c context_tracking: Fix noinstr vs KASAN 2023-03-10 09:33:45 +01:00
cpu_pm.c context_tracking: Take IRQ eqs entrypoints over RCU 2022-07-05 13:32:59 -07:00
cpu.c cpu/hotplug: Fix dynstate assignment in __cpuhp_setup_state_cpuslocked() 2024-07-05 09:31:56 +02:00
crash_core.c vmcoreinfo: add kallsyms_num_syms symbol 2022-08-28 14:02:44 -07:00
crash_dump.c
cred.c cred: switch to using atomic_long_t 2023-12-20 17:00:20 +01:00
delayacct.c delayacct: support re-entrance detection of thrashing accounting 2022-09-26 19:46:07 -07:00
dma.c
exec_domain.c
exit.c mm: optimize the redundant loop of mm_update_owner_next() 2024-07-11 12:47:13 +02:00
extable.c context_tracking: Take NMI eqs entrypoints over RCU 2022-07-05 13:32:59 -07:00
fail_function.c kernel/fail_function: fix memory leak with using debugfs_lookup() 2023-03-11 13:55:39 +01:00
fork.c Revert "fork: defer linking file vma until vma is fully initialized" 2024-06-21 14:35:59 +02:00
freezer.c freezer,sched: Rewrite core freezer logic 2022-09-07 21:53:50 +02:00
gen_kheaders.sh kheaders: explicitly define file modes for archived headers 2024-06-27 13:46:24 +02:00
groups.c security: Add LSM hook to setgroups() syscall 2022-07-15 18:21:49 +00:00
hung_task.c sched: Fix more TASK_state comparisons 2022-09-30 16:50:39 +02:00
iomem.c
irq_work.c
jump_label.c jump_label: Fix the fix, brown paper bags galore 2024-08-14 13:52:43 +02:00
kallsyms_internal.h kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[] 2023-10-25 12:03:16 +02:00
kallsyms.c kallsyms: Add helper kallsyms_on_each_match_symbol() 2023-10-25 12:03:16 +02:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kcov: don't lose track of remote references during softirqs 2024-06-27 13:46:22 +02:00
kexec_core.c kexec: fix a memory leak in crash_shrink_memory() 2023-07-19 16:21:08 +02:00
kexec_elf.c
kexec_file.c kexec: support purgatories with .text.hot sections 2023-06-21 16:00:55 +02:00
kexec_internal.h panic, kexec: make __crash_kexec() NMI safe 2022-09-11 21:55:06 -07:00
kexec.c kernel: kexec: copy user-array safely 2023-11-28 17:06:57 +00:00
kheaders.c kheaders: Use array declaration instead of char 2023-05-11 23:03:02 +09:00
kmod.c
kprobes.c kprobes: Fix possible use-after-free issue on kprobe registration 2024-04-17 11:18:27 +02:00
ksysfs.c kexec: turn all kexec_mutex acquisitions into trylocks 2022-09-11 21:55:06 -07:00
kthread.c signal: break out of wait loops on kthread_stop() 2022-10-09 16:01:59 -07:00
latencytop.c latencytop: use the last element of latency_record of system 2022-09-11 21:55:12 -07:00
Makefile kernel/numa.c: Move logging out of numa.h 2024-06-12 11:03:16 +02:00
module_signature.c
notifier.c notifier: Add blocking/atomic_notifier_chain_register_unique_prio() 2022-05-19 19:30:30 +02:00
nsproxy.c Revert "fs/exec: allow to unshare a time namespace on vfork+exec" 2022-09-13 10:38:43 -07:00
numa.c kernel/numa.c: Move logging out of numa.h 2024-06-12 11:03:16 +02:00
padata.c padata: Disable BH when taking works lock on MT path 2024-06-27 13:46:14 +02:00
panic.c panic: Flush kernel log buffer at the end 2024-04-13 13:04:54 +02:00
params.c
pid_namespace.c zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-21 14:36:01 +02:00
pid.c gfs2: Add glockfd debugfs file 2022-06-29 13:07:16 +02:00
profile.c kernel/profile.c: simplify duplicated code in profile_setup() 2022-09-11 21:55:12 -07:00
ptrace.c freezer,sched: Rewrite core freezer logic 2022-09-07 21:53:50 +02:00
range.c
reboot.c kernel/reboot: emergency_restart: Set correct system_state 2023-11-28 17:07:13 +00:00
regset.c
relay.c relayfs: fix out-of-bounds access in relay_file_read 2023-05-11 23:03:03 +09:00
resource_kunit.c
resource.c PCI: Allow drivers to request exclusive config regions 2023-09-13 09:42:46 +02:00
rseq.c rseq: Use pr_warn_once() when deprecated/unknown ABI flags are encountered 2022-11-14 09:58:32 +01:00
scftorture.c scftorture: Forgive memory-allocation failure if KASAN 2023-09-23 11:11:00 +02:00
scs.c
seccomp.c
signal.c kernel: rerun task_work while freezing in get_signal() 2024-08-03 08:49:31 +02:00
smp.c smp,csd: Throw an error if a CSD lock is stuck for too long 2023-11-28 17:06:55 +00:00
smpboot.c smpboot: use atomic_try_cmpxchg in cpu_wait_death and cpu_report_death 2022-09-11 21:55:10 -07:00
smpboot.h
softirq.c softirq: Fix suspicious RCU usage in __do_softirq() 2024-06-12 11:03:01 +02:00
stackleak.c
stacktrace.c
static_call_inline.c
static_call.c
stop_machine.c Scheduler changes in this cycle were: 2022-05-24 11:11:13 -07:00
sys_ni.c syscalls: fix compat_sys_io_pgetevents_time64 usage 2024-07-05 09:31:59 +02:00
sys.c getrusage: use sig->stats_lock rather than lock_task_sighand() 2024-03-15 10:48:22 -04:00
sysctl-test.c kernel/sysctl-test: use SYSCTL_{ZERO/ONE_HUNDRED} instead of i_{zero/one_hundred} 2022-09-08 16:56:45 -07:00
sysctl.c proc: proc_skip_spaces() shouldn't think it is working on C strings 2022-12-05 12:09:06 -08:00
task_work.c task_work: Introduce task_work_cancel() again 2024-08-03 08:49:34 +02:00
taskstats.c genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
torture.c torture: Fix hang during kthread shutdown phase 2023-03-10 09:34:07 +01:00
tracepoint.c tracepoint: Optimize the critical region of mutex_lock in tracepoint_module_coming() 2022-09-26 13:01:18 -04:00
tsacct.c
ucount.c ucounts: Split rlimit and ucount values and max values 2022-05-18 18:24:57 -05:00
uid16.c
uid16.h
umh.c freezer,umh: Fix call_usermode_helper_exec() vs SIGKILL 2023-02-22 12:59:50 +01:00
up.c
user_namespace.c ucounts: Split rlimit and ucount values and max values 2022-10-09 16:24:05 -07:00
user-return-notifier.c
user.c
usermode_driver.c blob_to_mnt(): kern_unmount() is needed to undo kern_mount() 2022-05-19 23:25:47 -04:00
utsname_sysctl.c kernel/utsname_sysctl.c: Fix hostname polling 2022-10-23 12:01:01 -07:00
utsname.c
watch_queue.c kernel: watch_queue: copy user-array safely 2023-11-28 17:06:57 +00:00
watchdog_hld.c watchdog/perf: properly initialize the turbo mode timestamp and rearm counter 2024-08-03 08:49:42 +02:00
watchdog.c watchdog: move softlockup_panic back to early_param 2023-11-28 17:07:09 +00:00
workqueue_internal.h
workqueue.c workqueue: Provide one lock class key per work_on_cpu() callsite 2023-11-28 17:06:55 +00:00