2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-27 08:05:27 +08:00
linux-next/kernel/sched
Mel Gorman 23e6082a52 sched: Limit the amount of NUMA imbalance that can exist at fork time
At fork time currently, a local node can be allowed to fill completely
and allow the periodic load balancer to fix the problem. This can be
problematic in cases where a task creates lots of threads that idle until
woken as part of a worker poll causing a memory bandwidth problem.

However, a "real" workload suffers badly from this behaviour. The workload
in question is mostly NUMA aware but spawns large numbers of threads
that act as a worker pool that can be called from anywhere. These need
to spread early to get reasonable behaviour.

This patch limits how much a local node can fill before spilling over
to another node and it will not be a universal win. Specifically,
very short-lived workloads that fit within a NUMA node would prefer
the memory bandwidth.

As I cannot describe the "real" workload, the best proxy measure I found
for illustration was a page fault microbenchmark. It's not representative
of the workload but demonstrates the hazard of the current behaviour.

pft timings
                                 5.10.0-rc2             5.10.0-rc2
                          imbalancefloat-v2          forkspread-v2
Amean     elapsed-1        46.37 (   0.00%)       46.05 *   0.69%*
Amean     elapsed-4        12.43 (   0.00%)       12.49 *  -0.47%*
Amean     elapsed-7         7.61 (   0.00%)        7.55 *   0.81%*
Amean     elapsed-12        4.79 (   0.00%)        4.80 (  -0.17%)
Amean     elapsed-21        3.13 (   0.00%)        2.89 *   7.74%*
Amean     elapsed-30        3.65 (   0.00%)        2.27 *  37.62%*
Amean     elapsed-48        3.08 (   0.00%)        2.13 *  30.69%*
Amean     elapsed-79        2.00 (   0.00%)        1.90 *   4.95%*
Amean     elapsed-80        2.00 (   0.00%)        1.90 *   4.70%*

This is showing the time to fault regions belonging to threads. The target
machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
machine is approximately the point where one node becomes fully utilised.
The slower results are borderline noise.

Kernel building shows similar benefits around the same balance point.
Generally performance was either neutral or better in the tests conducted.
The main consideration with this patch is the point where fork stops
spreading a task so some workloads may benefit from different balance
points but it would be a risky tuning parameter.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.net
2020-11-24 16:47:48 +01:00
..
autogroup.c sched/autogroup: Make autogroup_path() always available 2019-06-24 19:23:40 +02:00
autogroup.h sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
clock.c sched/clock: Use static_branch_likely() with sched_clock_running 2019-11-29 08:10:54 +01:00
completion.c completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() 2020-03-23 18:40:25 +01:00
core.c sched: Make migrate_disable/enable() independent of RT 2020-11-24 11:25:44 +01:00
cpuacct.c sched/cpuacct: Fix charge cpuacct.usage_sys 2020-05-19 20:34:14 +02:00
cpudeadline.c sched,rt: Use the full cpumask for balancing 2020-11-10 18:39:00 +01:00
cpudeadline.h sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
cpufreq_schedutil.c sched/topology,schedutil: Wrap sched domains rebuild 2020-11-19 11:25:47 +01:00
cpufreq.c cpufreq: Avoid leaving stale IRQ work items during CPU offline 2019-12-12 17:59:43 +01:00
cpupri.c Merge branch 'sched/migrate-disable' 2020-11-10 18:39:04 +01:00
cpupri.h sched/cpupri: Add CPUPRI_HIGHER 2020-10-29 11:00:30 +01:00
cputime.c sched/cputime: Improve cputime_adjust() 2020-06-15 14:10:00 +02:00
deadline.c sched: Remove select_task_rq()'s sd_flag parameter 2020-11-10 18:39:06 +01:00
debug.c sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL 2020-09-09 10:09:03 +02:00
fair.c sched: Limit the amount of NUMA imbalance that can exist at fork time 2020-11-24 16:47:48 +01:00
features.h sched/rt: Disable RT_RUNTIME_SHARE by default 2020-09-25 14:23:24 +02:00
idle.c sched: Remove select_task_rq()'s sd_flag parameter 2020-11-10 18:39:06 +01:00
isolation.c isolcpus: Affine unbound kernel threads to housekeeping cpus 2020-06-15 14:10:03 +02:00
loadavg.c sched: nohz: stop passing around unused "ticks" parameter. 2020-07-22 10:22:04 +02:00
Makefile kcsan: Improve various small stylistic details 2019-11-20 10:47:23 +01:00
membarrier.c sched: membarrier: document memory ordering scenarios 2020-10-29 11:00:31 +01:00
pelt.c sched: Add a tracepoint to track rq->nr_running 2020-07-08 11:39:02 +02:00
pelt.h sched/pelt: Cleanup PELT divider 2020-06-15 14:10:06 +02:00
psi.c sched,psi: Convert to sched_set_fifo_low() 2020-06-15 14:10:25 +02:00
rt.c sched: Remove select_task_rq()'s sd_flag parameter 2020-11-10 18:39:06 +01:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched: Make migrate_disable/enable() independent of RT 2020-11-24 11:25:44 +01:00
smp.h sched/headers: Split out open-coded prototypes into kernel/sched/smp.h 2020-05-28 11:03:20 +02:00
stats.c proc: introduce proc_create_seq{,_data} 2018-05-16 07:23:35 +02:00
stats.h psi: Move PF_MEMSTALL out of task->flags 2020-03-20 13:06:19 +01:00
stop_task.c sched: Remove select_task_rq()'s sd_flag parameter 2020-11-10 18:39:06 +01:00
swait.c sched/swait: Prepare usage in completions 2020-03-21 16:00:23 +01:00
topology.c sched/topology: Condition EAS enablement on FIE support 2020-11-19 11:25:47 +01:00
wait_bit.c sched/wait: fix ___wait_var_event(exclusive) 2019-12-17 13:32:50 +01:00
wait.c list: add "list_del_init_careful()" to go with "list_empty_careful()" 2020-08-02 20:39:44 -07:00