linux/kernel
Mathias Krause b027789e5e sched/fair: Prevent dead task groups from regaining cfs_rq's
Kevin is reporting crashes which point to a use-after-free of a cfs_rq
in update_blocked_averages(). Initial debugging revealed that we've
live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
free_fair_sched_group(). However, it was unclear how that can happen.

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object
which makes it incidentally overlap with SLUB's freelist pointer.
That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
mangling, leads to a reliable access violation in form of a #GP which
made the UAF fail fast.

Michal seems to have run into the same issue[1]. He already correctly
diagnosed that commit a7b359fc6a ("sched/fair: Correctly insert
cfs_rq's to list on unthrottle") is causing the preconditions for the
UAF to happen by re-adding cfs_rq's also to task groups that have no
more running tasks, i.e. also to dead ones. His analysis, however,
misses the real root cause and it cannot be seen from the crash
backtrace only, as the real offender is tg_unthrottle_up() getting
called via sched_cfs_period_timer() via the timer interrupt at an
inconvenient time.

When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
task group, it doesn't protect itself from getting interrupted. If the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group, sched_cfs_period_timer() will execute and walk the list of
task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
dying task group. These will later -- in free_fair_sched_group() -- be
kfree()'ed while still being linked, leading to the fireworks Kevin
and Michal are seeing.

To fix this race, ensure the dying task group gets unlinked first.
However, simply switching the order of unregistering and unlinking the
task group isn't sufficient, as concurrent RCU walkers might still see
it, as can be seen below:

    CPU1:                                      CPU2:
      :                                        timer IRQ:
      :                                          do_sched_cfs_period_timer():
      :                                            :
      :                                            distribute_cfs_runtime():
      :                                              rcu_read_lock();
      :                                              :
      :                                              unthrottle_cfs_rq():
    sched_offline_group():                             :
      :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
      list_del_rcu(&tg->list);                           :
 (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
      :                                                    :
 (2)  list_del_rcu(&tg->siblings);                         :
      :                                                    tg_unthrottle_up():
      unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
        :                                                    :
        list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
        :                                                    :
        :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
 (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
      :                                                      :
      :                                                    :
      :                                                  :
      :                                                :
      :                                              :
 (4)  :                                              rcu_read_unlock();

CPU 2 walks the task group list in parallel to sched_offline_group(),
specifically, it'll read the soon to be unlinked task group entry at
(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
all cfs_rq's via list_del_leaf_cfs_rq() in
unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
these at (3), which is the cause of the UAF later on.

To prevent this additional race from happening, we need to wait until
walk_tg_tree_from() has finished traversing the task groups, i.e.
after the RCU read critical section ends in (4). Afterwards we're safe
to call unregister_fair_sched_group(), as each new walk won't see the
dying task group any more.

On top of that, we need to wait yet another RCU grace period after
unregister_fair_sched_group() to ensure print_cfs_stats(), which might
run concurrently, always sees valid objects, i.e. not already free'd
ones.

This patch survives Michal's reproducer[2] for 8h+ now, which used to
trigger within minutes before.

  [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
  [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/

Fixes: a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
[peterz: shuffle code around a bit]
Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-11-11 13:09:33 +01:00
..
bpf Merge branch 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2021-11-02 15:37:27 -07:00
cgroup Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
configs drivers/char: remove /dev/kmem for good 2021-05-07 00:26:34 -07:00
debug kdb: Adopt scheduler's task classification 2021-11-03 17:21:37 +00:00
dma dma-mapping updates for Linux 5.16 2021-11-09 10:56:41 -08:00
entry Merge branch 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2021-11-10 16:15:54 -08:00
events Core: 2021-11-02 06:20:58 -07:00
futex futex: Fix PREEMPT_RT build 2021-10-19 17:27:05 +02:00
gcov Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR 2021-06-22 11:07:18 -07:00
irq pci-v5.16-changes 2021-11-06 14:36:12 -07:00
kcsan LKMM updates: 2021-09-02 13:00:15 -07:00
livepatch Tracing updates for 5.16: 2021-11-01 20:05:19 -07:00
locking Merge branch 'akpm' (patches from Andrew) 2021-11-09 10:11:53 -08:00
power Power management updates for 5.16-rc1 2021-11-02 16:04:28 -07:00
printk Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
rcu RCU pull request for v5.16 2021-11-01 20:25:38 -07:00
sched sched/fair: Prevent dead task groups from regaining cfs_rq's 2021-11-11 13:09:33 +01:00
time posix-cpu-timers: Prevent spuriously armed 0-value itimer 2021-09-23 11:53:51 +02:00
trace Merge branch 'akpm' (patches from Andrew) 2021-11-09 10:11:53 -08:00
.gitignore .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
acct.c kernel: remove spurious blkdev.h includes 2021-10-18 06:17:01 -06:00
async.c kernel/async.c: remove async_unregister_domain() 2021-05-07 00:26:33 -07:00
audit_fsnotify.c fsnotify: clarify contract for create event hooks 2021-10-27 12:32:34 +02:00
audit_tree.c audit/stable-5.16 PR 20211101 2021-11-01 21:17:39 -07:00
audit_watch.c \n 2021-11-06 16:43:20 -07:00
audit.c lsm: separate security_task_getsecid() into subjective and objective variants 2021-03-22 15:23:32 -04:00
audit.h audit/stable-5.16 PR 20211101 2021-11-01 21:17:39 -07:00
auditfilter.c audit: add filtering for io_uring records 2021-09-19 22:34:38 -04:00
auditsc.c audit/stable-5.16 PR 20211101 2021-11-01 21:17:39 -07:00
backtracetest.c
bounds.c
capability.c
cfi.c cfi: Use rcu_read_{un}lock_sched_notrace 2021-08-11 13:11:12 -07:00
compat.c arch: remove compat_alloc_user_space 2021-09-08 15:32:35 -07:00
configs.c
context_tracking.c
cpu_pm.c PM: cpu: Make notifier chain use a raw_spinlock_t 2021-08-16 18:55:32 +02:00
cpu.c cpu/hotplug: Add debug printks for hotplug callback failures 2021-08-10 18:31:32 +02:00
crash_core.c kdump: use vmlinux_build_id to simplify 2021-07-08 11:48:22 -07:00
crash_dump.c
cred.c ucounts: In set_cred_ucounts assume new->ucounts is non-NULL 2021-10-20 10:45:34 -05:00
delayacct.c delayacct: Add sysctl to enable at runtime 2021-05-12 11:43:25 +02:00
dma.c
exec_domain.c
exit.c Merge branch 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2021-11-03 12:15:29 -07:00
extable.c extable: use is_kernel_text() helper 2021-11-09 10:02:51 -08:00
fail_function.c
fork.c Merge branch 'akpm' (patches from Andrew) 2021-11-09 10:11:53 -08:00
freezer.c sched: Add get_current_state() 2021-06-18 11:43:08 +02:00
gen_kheaders.sh kbuild: clean up ${quiet} checks in shell scripts 2021-05-27 04:01:50 +09:00
groups.c groups: simplify struct group_info allocation 2021-02-26 09:41:03 -08:00
hung_task.c Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
iomem.c
irq_work.c irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT 2021-10-15 11:25:18 +02:00
jump_label.c jump_label: Fix jump_label_text_reserved() vs __init 2021-07-05 10:46:20 +02:00
kallsyms.c kallsyms: strip LTO suffixes from static functions 2021-10-04 10:58:25 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/rwlock: Provide RT variant 2021-08-17 17:50:51 +02:00
Kconfig.preempt sched: Provide Kconfig support for default dynamic preempt mode 2021-10-05 15:51:56 +02:00
kcov.c kcov: replace local_irq_save() with a local_lock_t 2021-11-09 10:02:52 -08:00
kexec_core.c Merge branch 'rework/printk_safe-removal' into for-linus 2021-08-30 16:36:10 +02:00
kexec_elf.c
kexec_file.c memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED 2021-11-06 13:30:42 -07:00
kexec_internal.h kexec: move machine_kexec_post_load() to public interface 2021-02-22 12:33:26 +00:00
kexec.c kexec: avoid compat_alloc_user_space 2021-09-08 15:32:34 -07:00
kheaders.c
kmod.c modules: add CONFIG_MODPROBE_PATH 2021-05-07 00:26:33 -07:00
kprobes.c Tracing updates for 5.16: 2021-11-01 20:05:19 -07:00
ksysfs.c
kthread.c Merge branch 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2021-11-10 16:15:54 -08:00
latencytop.c
Makefile Tracing updates for 5.16: 2021-11-01 20:05:19 -07:00
module_signature.c
module_signing.c
module-internal.h
module.c module: change to print useful messages from elf_validity_check() 2021-11-05 15:13:10 -07:00
notifier.c notifier: Remove atomic_notifier_call_chain_robust() 2021-08-16 18:55:32 +02:00
nsproxy.c memcg: enable accounting for new namesapces and struct nsproxy 2021-09-03 09:58:12 -07:00
padata.c padata: Remove repeated verbose license text 2021-08-27 16:30:18 +08:00
panic.c Merge branch 'rework/printk_safe-removal' into for-linus 2021-08-30 16:36:10 +02:00
params.c params: lift param_set_uint_minmax to common code 2021-08-16 14:42:22 +02:00
pid_namespace.c memcg: enable accounting for new namesapces and struct nsproxy 2021-09-03 09:58:12 -07:00
pid.c pid: add pidfd_get_task() helper 2021-10-14 13:29:18 +02:00
profile.c profiling: fix shift-out-of-bounds bugs 2021-09-08 11:50:26 -07:00
ptrace.c sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
range.c
reboot.c reboot: Remove the unreachable panic after do_exit in reboot(2) 2021-10-20 13:09:51 -05:00
regset.c
relay.c
resource_kunit.c
resource.c kernel/resource: disallow access to exclusive system RAM regions 2021-11-09 10:02:52 -08:00
rseq.c KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest 2021-09-22 10:24:01 -04:00
scftorture.c scftorture: Warn on individual scf_torture_init() error conditions 2021-09-16 10:27:48 -07:00
scs.c scs: Release kasan vmalloc poison in scs_free process 2021-09-30 09:37:27 +01:00
seccomp.c Merge branch 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2021-09-01 14:52:05 -07:00
signal.c Merge branch 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2021-11-10 16:15:54 -08:00
smp.c sched: Improve wake_up_all_idle_cpus() take #2 2021-10-22 15:32:46 +02:00
smpboot.c smpboot: Replace deprecated CPU-hotplug functions. 2021-08-10 14:57:42 +02:00
smpboot.h
softirq.c genirq: Change force_irqthreads to a static key 2021-08-10 22:50:07 +02:00
stackleak.c
stacktrace.c stacktrace: move filter_irq_stacks() to kernel/stacktrace.c 2021-11-06 13:30:43 -07:00
static_call.c static_call: Fix static_call_text_reserved() vs __init 2021-07-05 10:46:33 +02:00
stop_machine.c stop_machine: Add caller debug info to queue_stop_cpus_work 2021-03-23 16:01:58 +01:00
sys_ni.c futex: Implement sys_futex_waitv() 2021-10-07 13:51:11 +02:00
sys.c Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
sysctl-test.c kernel/sysctl-test: Remove some casts which are no-longer required 2021-06-23 16:41:24 -06:00
sysctl.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
task_work.c kasan: record task_work_add() call stack 2021-04-30 11:20:42 -07:00
taskstats.c
torture.c torture: Replace deprecated CPU-hotplug functions. 2021-08-10 10:48:07 -07:00
tracepoint.c tracepoint: Fix kerneldoc comments 2021-08-16 11:39:51 -04:00
tsacct.c mm/mmap.c: fix a data race of mm->total_vm 2021-11-06 13:30:35 -07:00
ucount.c ucounts: Use atomic_long_sub_return for clarity 2021-10-20 10:45:34 -05:00
uid16.c
uid16.h
umh.c kernel/umh.c: fix some spelling mistakes 2021-05-07 00:26:34 -07:00
up.c A set of locking related fixes and updates: 2021-05-09 13:07:03 -07:00
user_namespace.c memcg: enable accounting for new namesapces and struct nsproxy 2021-09-03 09:58:12 -07:00
user-return-notifier.c
user.c fs/epoll: use a per-cpu counter for user's watches count 2021-09-08 11:50:27 -07:00
usermode_driver.c Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:41:14 -07:00
utsname_sysctl.c
utsname.c
watch_queue.c
watchdog_hld.c
watchdog.c kernel: watchdog: modify the explanation related to watchdog thread 2021-06-29 10:53:46 -07:00
workqueue_internal.h workqueue: Assign a color to barrier work items 2021-08-17 07:49:10 -10:00
workqueue.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00