linux/kernel
Linus Torvalds c40f7d74c7 sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b
Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
scheduler under high loads, starting at around the v4.18 time frame,
and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
manipulation.

Do a (manual) revert of:

  a9e7f6544b ("sched/fair: Fix O(nr_cgroups) in load balance path")

It turns out that the list_del_leaf_cfs_rq() introduced by this commit
is a surprising property that was not considered in followup commits
such as:

  9c2791f936 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

As Vincent Guittot explains:

 "I think that there is a bigger problem with commit a9e7f6544b and
  cfs_rq throttling:

  Let take the example of the following topology TG2 --> TG1 --> root:

   1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
      cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
      one path because it has never been used and can't be throttled so
      tmp_alone_branch will point to leaf_cfs_rq_list at the end.

   2) Then TG1 is throttled

   3) and we add TG3 as a new child of TG1.

   4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
      cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.

  With commit a9e7f6544b, we can del a cfs_rq from rq->leaf_cfs_rq_list.
  So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
  cfs_rq is removed from the list.
  Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
  but tmp_alone_branch still points to TG3 cfs_rq because its throttled
  parent can't be enqueued when the lock is released.
  tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

  So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
  points on another TG cfs_rq, the next TG cfs_rq that will be added,
  will be linked outside rq->leaf_cfs_rq_list - which is bad.

  In addition, we can break the ordering of the cfs_rq in
  rq->leaf_cfs_rq_list but this ordering is used to update and
  propagate the update from leaf down to root."

Instead of trying to work through all these cases and trying to reproduce
the very high loads that produced the lockup to begin with, simplify
the code temporarily by reverting a9e7f6544b - which change was clearly
not thought through completely.

This (hopefully) gives us a kernel that doesn't lock up so people
can continue to enjoy their holidays without worrying about regressions. ;-)

[ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

Analyzed-by: Xie XiuQi <xiexiuqi@huawei.com>
Analyzed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Reported-by: Sargun Dhillon <sargun@sargun.me>
Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Tested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org> # v4.13+
Cc: Bin Li <huawei.libin@huawei.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a9e7f6544b ("sched/fair: Fix O(nr_cgroups) in load balance path")
Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-30 13:54:31 +01:00
..
bpf bpf: verifier: make sure callees don't prune with caller differences 2018-12-13 10:35:40 -08:00
cgroup cgroups: Replace synchronize_sched() with synchronize_rcu() 2018-12-01 12:38:49 -08:00
configs kvm_config: add CONFIG_VIRTIO_MENU 2018-10-24 20:55:56 -04:00
debug kdb: kdb_support: mark expected switch fall-throughs 2018-11-13 20:38:50 +00:00
dma dma-direct: do not include SME mask in the DMA supported check 2018-12-17 18:02:11 +01:00
events Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 14:45:18 -08:00
gcov gcov: remove CONFIG_GCOV_FORMAT_AUTODETECT 2018-06-08 18:56:02 +09:00
irq genirq/affinity: Add is_managed to struct irq_affinity_desc 2018-12-19 11:32:08 +01:00
livepatch livepatch: Replace synchronize_sched() with synchronize_rcu() 2018-12-01 12:38:50 -08:00
locking Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 14:25:52 -08:00
power Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 14:56:10 -08:00
printk mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
rcu rcutorture: Don't do busted forward-progress testing 2018-12-01 12:45:42 -08:00
sched sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b 2018-12-30 13:54:31 +01:00
time Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-25 15:44:08 -08:00
trace Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 13:07:19 -08:00
.gitignore
acct.c
async.c kernel/async.c: revert "async: simplify lowest_in_progress()" 2018-02-06 18:32:44 -08:00
audit_fsnotify.c fsnotify: add fsnotify_add_inode_mark() wrappers 2018-05-18 14:58:22 +02:00
audit_tree.c \n 2018-08-17 09:41:28 -07:00
audit_watch.c audit: fix use-after-free in audit_add_watch 2018-07-18 11:43:36 -04:00
audit.c audit: use ktime_get_coarse_real_ts64() for timestamps 2018-07-17 14:45:08 -04:00
audit.h audit: track the owner of the command mutex ourselves 2018-02-23 11:22:22 -05:00
auditfilter.c audit: rename FILTER_TYPE to FILTER_EXCLUDE 2018-06-19 10:39:54 -04:00
auditsc.c audit/stable-4.18 PR 20180814 2018-08-15 10:46:54 -07:00
backtracetest.c
bounds.c kbuild: fix kernel/bounds.c 'W=1' warning 2018-10-31 08:54:14 -07:00
capability.c
compat.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
configs.c
context_tracking.c
cpu_pm.c
cpu.c x86/speculation: Rework SMT state change 2018-11-28 11:57:07 +01:00
crash_core.c kernel/crash_core.c: print timestamp using time64_t 2018-08-22 10:52:47 -07:00
crash_dump.c
cred.c
delayacct.c delayacct: track delays from thrashing cache pages 2018-10-26 16:26:32 -07:00
dma.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
elfcore.c
exec_domain.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
exit.c signal: Pass pid type into group_send_sig_info 2018-07-21 12:57:35 -05:00
extable.c extable: Make init_kernel_text() global 2018-02-21 16:54:06 +01:00
fail_function.c kernel/fail_function.c: remove meaningless null pointer check before debugfs_remove_recursive 2018-10-31 08:54:12 -07:00
fork.c fork,memcg: fix crash in free_thread_stack on memcg charge fail 2018-12-21 14:51:18 -08:00
freezer.c PM / reboot: Eliminate race between reboot and suspend 2018-08-06 12:35:20 +02:00
futex_compat.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
futex.c futex: Cure exit race 2018-12-18 23:13:15 +01:00
groups.c
hung_task.c kernel: hung_task.c: disable on suspend 2018-10-25 18:45:08 +02:00
iomem.c memremap: split devm_memremap_pages() and memremap() infrastructure 2018-05-15 23:08:33 -07:00
irq_work.c
jump_label.c Merge branch 'x86/build' into locking/core, to pick up dependent patches and unify jump-label work 2018-10-16 17:30:11 +02:00
kallsyms.c kallsyms: reduce size a little on 64-bit 2018-09-10 22:54:33 +09:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt kconfig: include kernel/Kconfig.preempt from init/Kconfig 2018-08-02 08:06:54 +09:00
kcov.c kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace 2018-11-30 14:56:14 -08:00
kexec_core.c kexec: Allocate decrypted control pages for kdump if SME is enabled 2018-10-06 12:01:51 +02:00
kexec_file.c kexec_file: kexec_walk_memblock() only walks a dedicated region at kdump 2018-12-06 14:38:50 +00:00
kexec_internal.h
kexec.c kexec: add call to LSM hook in original kexec_load syscall 2018-07-16 12:31:57 -07:00
kmod.c
kprobes.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 14:45:18 -08:00
ksysfs.c
kthread.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-08-13 11:25:07 -07:00
latencytop.c
Makefile x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls 2018-09-04 10:35:47 -07:00
memremap.c Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
module_signing.c modsign: log module name in the event of an error 2018-07-02 11:36:17 +02:00
module-internal.h modsign: log module name in the event of an error 2018-07-02 11:36:17 +02:00
module.c modules: Replace synchronize_sched() and call_rcu_sched() 2018-11-27 09:21:43 -08:00
notifier.c
nsproxy.c
padata.c
panic.c kernel/panic.c: filter out a potential trailing newline 2018-10-31 08:54:14 -07:00
params.c kernel/params.c: downgrade warning for unsafe parameters 2018-04-11 10:28:37 -07:00
pid_namespace.c signal: Use group_send_sig_info to kill all processes in a pid namespace 2018-09-16 16:08:25 +02:00
pid.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
profile.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
ptrace.c ptrace: Remove unused ptrace_may_access_sched() and MODE_IBRS 2018-11-28 11:57:11 +01:00
range.c
reboot.c kernel/reboot.c: export pm_power_off_prepare 2018-09-11 16:13:24 +01:00
relay.c kernel/relay.c: change return type to vm_fault_t 2018-06-15 07:55:24 +09:00
resource.c resource/docs: Complete kernel-doc style function documentation 2018-11-07 16:47:47 +01:00
rseq.c rseq: uapi: Declare rseq_cs field as union, update includes 2018-07-10 22:18:52 +02:00
seccomp.c Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2018-10-24 11:49:35 +01:00
signal.c kernel/signal.c: fix a comment error 2018-10-31 08:54:14 -07:00
smp.c smp,cpumask: introduce on_each_cpu_cond_mask 2018-10-09 16:51:11 +02:00
smpboot.c smpboot: Remove cpumask from the API 2018-07-03 09:20:44 +02:00
smpboot.h
softirq.c Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-25 11:43:47 -07:00
stackleak.c stackleak: Mark stackleak_track_stack() as notrace 2018-12-05 19:31:44 -08:00
stacktrace.c
stop_machine.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-08-13 11:25:07 -07:00
sys_ni.c Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-06-10 10:17:09 -07:00
sys.c arm64: add prctl control for resetting ptrauth keys 2018-12-13 16:42:46 +00:00
sysctl_binary.c staging: irda: remove remaining remants of irda code removal 2018-04-16 11:26:49 +02:00
sysctl.c kernel/sysctl.c: remove duplicated include 2018-11-03 10:09:37 -07:00
task_work.c
taskstats.c pids: introduce find_get_task_by_vpid() helper 2018-02-06 18:32:46 -08:00
test_kprobes.c kprobes: Remove jprobe API implementation 2018-06-21 12:33:05 +02:00
torture.c torture: Remove unnecessary "ret" variables 2018-12-01 12:45:35 -08:00
tracepoint.c tracing: Replace synchronize_sched() and call_rcu_sched() 2018-11-27 09:21:41 -08:00
tsacct.c
ucount.c headers: untangle kmemleak.h from mm.h 2018-04-05 21:36:27 -07:00
uid16.c fs: add do_fchownat(), ksys_fchown() helpers and ksys_{,l}chown() wrappers 2018-04-02 20:15:59 +02:00
uid16.h kernel: provide ksys_*() wrappers for syscalls called by kernel/uid16.c 2018-04-02 20:15:30 +02:00
umh.c umh: Add command line to user mode helpers 2018-10-22 19:37:36 -07:00
up.c smp,cpumask: introduce on_each_cpu_cond_mask 2018-10-09 16:51:11 +02:00
user_namespace.c userns: also map extents in the reverse map to kernel IDs 2018-11-07 23:51:16 -06:00
user-return-notifier.c
user.c userns: use irqsave variant of refcount_dec_and_lock() 2018-08-22 10:52:47 -07:00
utsname_sysctl.c sys: don't hold uts_sem while accessing userspace memory 2018-08-11 02:05:53 -05:00
utsname.c uts: create "struct uts_namespace" from kmem_cache 2018-04-11 10:28:35 -07:00
watchdog_hld.c watchdog: Mark watchdog touch functions as notrace 2018-08-30 12:56:40 +02:00
watchdog.c watchdog: Mark watchdog touch functions as notrace 2018-08-30 12:56:40 +02:00
workqueue_internal.h workqueue: Set worker->desc to workqueue name by default 2018-05-18 08:47:13 -07:00
workqueue.c workqueue: Replace call_rcu_sched() with call_rcu() 2018-11-27 09:21:44 -08:00