linux-next

mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-24 22:55:35 +08:00

History

Mel Gorman 52262ee567 sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression The following XFS commit: `8ab39f11d9` ("xfs: prevent CIL push holdoff in log recovery") changed the logic from using bound workqueues to using unbound workqueues. Functionally this makes sense but it was observed at the time that the dbench performance dropped quite a lot and CPU migrations were increased. The current pattern of the task migration is straight-forward. With XFS, an IO issuer delegates work to xlog_cil_push_work ()on an unbound kworker. This runs on a nearby CPU and on completion, dbench wakes up on its old CPU as it is still idle and no migration occurs. dbench then queues the real IO on the blk_mq_requeue_work() work item which runs on a bound kworker which is forced to run on the same CPU as dbench. When IO completes, the bound kworker wakes dbench but as the kworker is a bound but, real task, the CPU is not considered idle and dbench gets migrated by select_idle_sibling() to a new CPU. dbench may ping-pong between two CPUs for a while but ultimately it starts a round-robin of all CPUs sharing the same LLC. High-frequency migration on each IO completion has poor performance overall. It has negative implications both in commication costs and power management. mpstat confirmed that at low thread counts that all CPUs sharing an LLC has low level of activity. Note that even if the CIL patch was reverted, there still would be migrations but the impact is less noticeable. It turns out that individually the scheduler, XFS, blk-mq and workqueues all made sensible decisions but in combination, the overall effect was sub-optimal. This patch special cases the IO issue/completion pattern and allows a bound kworker waker and a task wakee to stack on the same CPU if there is a strong chance they are directly related. The expectation is that the kworker is likely going back to sleep shortly. This is not guaranteed as the IO could be queued asynchronously but there is a very strong relationship between the task and kworker in this case that would justify stacking on the same CPU instead of migrating. There should be few concerns about kworker starvation given that the special casing is only when the kworker is the waker. DBench on XFS MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem UMA machine with 8 cores sharing LLC 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack Amean 1 22.63 ( 0.00%) 20.54 * 9.23%* Amean 2 25.56 ( 0.00%) 23.40 * 8.44%* Amean 4 28.63 ( 0.00%) 27.85 * 2.70%* Amean 8 37.66 ( 0.00%) 37.68 ( -0.05%) Amean 64 469.47 ( 0.00%) 468.26 ( 0.26%) Stddev 1 1.00 ( 0.00%) 0.72 ( 28.12%) Stddev 2 1.62 ( 0.00%) 1.97 ( -21.54%) Stddev 4 2.53 ( 0.00%) 3.58 ( -41.19%) Stddev 8 5.30 ( 0.00%) 5.20 ( 1.92%) Stddev 64 86.36 ( 0.00%) 94.53 ( -9.46%) NUMA machine, 48 CPUs total, 24 CPUs share cache 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack-v1r2 Amean 1 58.69 ( 0.00%) 30.21 * 48.53%* Amean 2 60.90 ( 0.00%) 35.29 * 42.05%* Amean 4 66.77 ( 0.00%) 46.55 * 30.28%* Amean 8 81.41 ( 0.00%) 68.46 * 15.91%* Amean 16 113.29 ( 0.00%) 107.79 * 4.85%* Amean 32 199.10 ( 0.00%) 198.22 * 0.44%* Amean 64 478.99 ( 0.00%) 477.06 * 0.40%* Amean 128 1345.26 ( 0.00%) 1372.64 * -2.04%* Stddev 1 2.64 ( 0.00%) 4.17 ( -58.08%) Stddev 2 4.35 ( 0.00%) 5.38 ( -23.73%) Stddev 4 6.77 ( 0.00%) 6.56 ( 3.00%) Stddev 8 11.61 ( 0.00%) 10.91 ( 6.04%) Stddev 16 18.63 ( 0.00%) 19.19 ( -3.01%) Stddev 32 38.71 ( 0.00%) 38.30 ( 1.06%) Stddev 64 100.28 ( 0.00%) 91.24 ( 9.02%) Stddev 128 186.87 ( 0.00%) 160.34 ( 14.20%) Dbench has been modified to report the time to complete a single "load file". This is a more meaningful metric for dbench that a throughput metric as the benchmark makes many different system calls that are not throughput-related Patch shows a 9.23% and 48.53% reduction in the time to process a load file with the difference partially explained by the number of CPUs sharing a LLC. In a separate run, task migrations were almost eliminated by the patch for low client counts. In case people have issue with the metric used for the benchmark, this is a comparison of the throughputs as reported by dbench on the NUMA machine. dbench4 Throughput (misleading but traditional) 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack-v1r2 Hmean 1 321.41 ( 0.00%) 617.82 * 92.22%* Hmean 2 622.87 ( 0.00%) 1066.80 * 71.27%* Hmean 4 1134.56 ( 0.00%) 1623.74 * 43.12%* Hmean 8 1869.96 ( 0.00%) 2212.67 * 18.33%* Hmean 16 2673.11 ( 0.00%) 2806.13 * 4.98%* Hmean 32 3032.74 ( 0.00%) 3039.54 ( 0.22%) Hmean 64 2514.25 ( 0.00%) 2498.96 * -0.61%* Hmean 128 1778.49 ( 0.00%) 1746.05 * -1.82%* Note that this is somewhat specific to XFS and ext4 shows no performance difference as it does not rely on kworkers in the same way. No major problem was observed running other workloads on different machines although not all tests have completed yet. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200128154006.GD3466@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>		2020-02-10 11:24:37 +01:00
..
autogroup.c	sched/autogroup: Make autogroup_path() always available	2019-06-24 19:23:40 +02:00
autogroup.h	sched/headers: Simplify and clean up header usage in the scheduler	2018-03-04 12:39:29 +01:00
clock.c	sched/clock: Use static_branch_likely() with sched_clock_running	2019-11-29 08:10:54 +01:00
completion.c	sched/Documentation: Update wake_up() & co. memory-barrier guarantees	2018-07-17 09:30:34 +02:00
core.c	sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression	2020-02-10 11:24:37 +01:00
cpuacct.c	sched/headers: Simplify and clean up header usage in the scheduler	2018-03-04 12:39:29 +01:00
cpudeadline.c	Linux 5.2-rc5	2019-06-17 12:12:27 +02:00
cpudeadline.h	sched/headers: Simplify and clean up header usage in the scheduler	2018-03-04 12:39:29 +01:00
cpufreq_schedutil.c	sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()	2019-12-25 10:42:08 +01:00
cpufreq.c	cpufreq: Avoid leaving stale IRQ work items during CPU offline	2019-12-12 17:59:43 +01:00
cpupri.c	sched/rt: Make RT capacity-aware	2019-12-25 10:42:10 +01:00
cpupri.h	sched/rt: Make RT capacity-aware	2019-12-25 10:42:10 +01:00
cputime.c	sched/cputime: move rq parameter in irqtime_account_process_tick	2020-01-17 10:19:21 +01:00
deadline.c	sched/core: Further clarify sched_class::set_next_task()	2019-11-11 08:35:21 +01:00
debug.c	sched/debug: Reset watchdog on all CPUs while processing sysrq-t	2020-01-17 10:19:20 +01:00
fair.c	sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression	2020-02-10 11:24:37 +01:00
features.h	sched/fair/util_est: Implement faster ramp-up EWMA on utilization increases	2019-10-29 10:01:07 +01:00
idle.c	idle: fix spelling mistake "iterrupts" -> "interrupts"	2020-01-17 10:19:22 +01:00
isolation.c	sched/isolation: Prefer housekeeping CPU in local node	2019-07-25 15:51:55 +02:00
loadavg.c	timers/nohz: Update NOHZ load in remote tick	2020-01-28 21:36:44 +01:00
Makefile	psi: pressure stall information for CPU, memory, and IO	2018-10-26 16:26:32 -07:00
membarrier.c	membarrier: Fix RCU locking bug caused by faulty merge	2019-10-01 21:27:50 +02:00
pelt.c	schied/fair: Skip calculating @contrib without load	2019-12-17 13:32:51 +01:00
pelt.h	sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()	2019-06-24 19:23:39 +02:00
psi.c	sched/psi: create /proc/pressure and /proc/pressure/{io\|memory\|cpu} only when psi enabled	2020-01-17 10:19:22 +01:00
rt.c	sched/rt: Make RT capacity-aware	2019-12-25 10:42:10 +01:00
sched-pelt.h	sched/fair: Fix "runnable_avg_yN_inv" not used warnings	2019-06-17 12:15:58 +02:00
sched.h	sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression	2020-02-10 11:24:37 +01:00
stats.c	proc: introduce proc_create_seq{,_data}	2018-05-16 07:23:35 +02:00
stats.h	sched/stats: Fix unlikely() use of sched_info_on()	2019-07-25 15:51:55 +02:00
stop_task.c	sched/core: Further clarify sched_class::set_next_task()	2019-11-11 08:35:21 +01:00
swait.c	kernel/sched/: remove caller signal_pending branch predictions	2019-01-04 13:13:48 -08:00
topology.c	sched/topology: Assert non-NUMA topology masks don't (partially) overlap	2020-01-17 10:19:23 +01:00
wait_bit.c	sched/wait: fix ___wait_var_event(exclusive)	2019-12-17 13:32:50 +01:00
wait.c	Add wake_up_interruptible_sync_poll_locked()	2019-10-31 15:12:23 +00:00