mirror of
https://github.com/edk2-porting/linux-next.git
synced 2024-12-23 04:34:11 +08:00
24d0c1d6e6
2137 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Peter Zijlstra
|
24d0c1d6e6 |
sched/fair: Do not migrate due to a sync wakeup on exit
When a task exits, it notifies the parent that it has exited. This is a sync wakeup and the exiting task may pull the parent towards the wakers CPU. For simple workloads like using a shell, it was observed that the shell is pulled across nodes by exiting processes. This is daft as the parent may be long-lived and properly placed. This patch special cases a sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range of workloads and machines showed very little differences in performance although there was a small 3% boost on some machines running a shellscript intensive workload (git regression test suite). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
082f764a2f |
sched/fair: Do not migrate on wake_affine_weight() if weights are equal
wake_affine_weight() will consider migrating a task to, or near, the current CPU if there is a load imbalance. If the CPUs share LLC then either CPU is valid as a search-for-idle-sibling target and equally appropriate for stacking two tasks on one CPU if an idle sibling is unavailable. If they do not share cache then a cross-node migration potentially impacts locality so while they are equal from a CPU capacity point of view, they are not equal in terms of memory locality. In either case, it's more appropriate to migrate only if there is a difference in their effective load. This patch modifies wake_affine_weight() to only consider migrating a task if there is a load imbalance for normal wakeups but will allow potential stacking if the loads are equal and it's a sync wakeup. For the most part, the different in performance is marginal. For example, on a 4-socket server running netperf UDP_STREAM on localhost the differences are as follows: 4.15.0 4.15.0 16rc0 noequal-v1r23 Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%) Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%) Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%) Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%) Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%) Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%) Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%) Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%) Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%) Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%) Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%) Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%) Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%) Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%) Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%) Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%) Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%) Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%) Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%) Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%) Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%) Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%) Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%) Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%) Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%) Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%) Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%) Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%) Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%) Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%) Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%) Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%) Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%) Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%) Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%) Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%) The difference in overall performance is marginal but note that most measurements are less variable. There were similar observations for other netperf comparisons. hackbench with sockets or threads with processes or threads showed minor difference with some reduction of migration. tbench showed only marginal differences that were within the noise. dbench, regardless of filesystem, showed minor differences all of which are within noise. Multiple machines, both UMA and NUMA were tested without any regressions showing up. The biggest risk with a patch like this is affecting wakeup latencies. However, the schbench load from Facebook which is very sensitive to wakeup latency showed a mixed result with mostly improvements in wakeup latency: 4.15.0 4.15.0 16rc0 noequal-v1r23 Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%) Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%) Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%) Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%) Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%) Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%) Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%) Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%) Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%) Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%) Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%) Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%) Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%) Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%) Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%) Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%) Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%) Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%) Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%) Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%) Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%) Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%) Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%) Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%) Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%) Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%) Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%) Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%) Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%) Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%) Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%) Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%) Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%) Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%) Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%) Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%) Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%) Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%) Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%) Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%) Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%) Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%) Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%) Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%) Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%) Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%) Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
eeb6039863 |
sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed
On sync wakeups, the previous CPU effective load may not be used so delay the calculation until it's needed. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
7ebb66a12f |
sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
The only caller of wake_affine() knows the CPU ID. Pass it in instead of rechecking it. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Ingo Molnar
|
ed02934395 |
Linux 4.16-rc2
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqKKI0eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGRNAH/0v3+nuJ0oiHE1Cl fH89F9Ma17j8oTo28byRPi7X5XJfJAqANhHa209rguvnC27y3ew/l9k93HoxG12i ttvyKFDQulQbytfJZXw8lhUyYGXVsTpyNaihPe/NtqPdIxNgfrXsUN9EIEtcnuS2 SiAj51jUySDRNR4ST6TOx4ulDm1zLrmA28WHOBNOTvDi4jTQMt1TsngHfF5AySBB lD4RTRDDwWDWtdMI7euYSq019TiDXCxmwQ94vZjrqmjmSQcl/yCK/JzEV33SZslg 4WqGIllxONvP/UlwxZLaJ+RrslqxNgDVqQKwJdfYhGaWvpgPFtS1s86zW6IgyXny 02jJfD0= =DLWn -----END PGP SIGNATURE----- Merge tag 'v4.16-rc2' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Leo Yan
|
43d1b29b27 |
sched/cpufreq: Remove unused SUGOV_KTHREAD_PRIORITY macro
Since schedutil kernel thread directly set priority to 0, the macro SUGOV_KTHREAD_PRIORITY is not used. So remove it. Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/1518097702-9665-1-git-send-email-leo.yan@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
269d599271 |
sched/core: Fix DEBUG_SPINLOCK annotation for rq->lock
Mark noticed that he had sporadic "spinlock recursion" warnings from the DEBUG_SPINLOCK code. Now rq->lock is special in that the owner changes in the middle of a context switch. It so happens that we fix up the lock.owner too late, @prev can run (remotely) the moment prev->on_cpu is cleared, this then allows @prev to again try and acquire this rq->lock and trigger this warning. So we have to switch lock.owner before clearing prev->on_cpu. Do this by moving the DEBUG_SPINLOCK annotation from after switch_to() to before switch_to() and collect all lockdep annotations there into prepare_lock_switch() to mirror the existing finish_lock_switch(). Debugged-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Wen Yang
|
a7711602c7 |
sched/rt: Make update_curr_rt() more accurate
rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517882008-44552-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Wen Yang
|
6fe0ce1eb0 |
sched/deadline: Make update_curr_dl() more accurate
rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_dl(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517882148-44599-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Vincent Guittot
|
387f77cc82 |
sched/fair: Remove stray space in #ifdef
Remove a useless space in # ifdef and align it with others. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Linus Torvalds
|
a2e5790d84 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: - kasan updates - procfs - lib/bitmap updates - other lib/ updates - checkpatch tweaks - rapidio - ubsan - pipe fixes and cleanups - lots of other misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) Documentation/sysctl/user.txt: fix typo MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns MAINTAINERS: update various PALM patterns MAINTAINERS: update "ARM/OXNAS platform support" patterns MAINTAINERS: update Cortina/Gemini patterns MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern MAINTAINERS: remove ANDROID ION pattern mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors mm: docs: fix parameter names mismatch mm: docs: fixup punctuation pipe: read buffer limits atomically pipe: simplify round_pipe_size() pipe: reject F_SETPIPE_SZ with size over UINT_MAX pipe: fix off-by-one error when checking buffer limits pipe: actually allow root to exceed the pipe buffer limits pipe, sysctl: remove pipe_proc_fn() pipe, sysctl: drop 'min' parameter from pipe-max-size converter kasan: rework Kconfig settings crash_dump: is_kdump_kernel can be boolean kernel/mutex: mutex_is_locked can be boolean ... |
||
Alexey Dobriyan
|
4de373a12f |
cpumask: make cpumask_size() return "unsigned int"
CPUmasks are never big enough to warrant 64-bit code. Space savings: add/remove: 0/0 grow/shrink: 1/4 up/down: 3/-17 (-14) Function old new delta sched_init_numa 1530 1533 +3 compat_sys_sched_setaffinity 160 159 -1 sys_sched_getaffinity 197 195 -2 sys_sched_setaffinity 183 176 -7 compat_sys_sched_getaffinity 179 172 -7 Link: http://lkml.kernel.org/r/20171204165531.GA8221@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Ingo Molnar
|
8284507916 |
Merge branch 'linus' into sched/urgent, to resolve conflicts
Conflicts: arch/arm64/kernel/entry.S arch/x86/Kconfig include/linux/sched/mm.h kernel/fork.c Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
32e839dda3 |
sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
The select_idle_sibling() (SIS) rewrite in commit:
|
||
Mel Gorman
|
806486c377 |
sched/fair: Do not migrate if the prev_cpu is idle
wake_affine_idle() prefers to move a task to the current CPU if the wakeup is due to an interrupt. The expectation is that the interrupt data is cache hot and relevant to the waking task as well as avoiding a search. However, there is no way to determine if there was cache hot data on the previous CPU that may exceed the interrupt data. Furthermore, round-robin delivery of interrupts can migrate tasks around a socket where each CPU is under-utilised. This can interact badly with cpufreq which makes decisions based on per-cpu data. It has been observed on machines with HWP that p-states are not boosted to their maximum levels even though the workload is latency and throughput sensitive. This patch uses the previous CPU for the task if it's idle and cache-affine with the current CPU even if the current CPU is idle due to the wakup being related to the interrupt. This reduces migrations at the cost of the interrupt data not being cache hot when the task wakes. A variety of workloads were tested on various machines and no adverse impact was noticed that was outside noise. dbench on ext4 on UMA showed roughly 10% reduction in the number of CPU migrations and it is a case where interrupts are frequent for IO competions. In most cases, the difference in performance is quite small but variability is often reduced. For example, this is the result for pgbench running on a UMA machine with different numbers of clients. 4.15.0-rc9 4.15.0-rc9 baseline waprev-v1 Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%) Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%) Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%) Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%) Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%) Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%) Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%) Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%) Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%) Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%) CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%) Note that the different in performance is marginal but for low utilisation, there is less variability. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
3b76c4a339 |
sched/fair: Restructure wake_affine*() to return a CPU id
This is a preparation patch that has wake_affine*() return a CPU ID instead of a boolean. The intent is to allow the wake_affine() helpers to be avoided if a decision is already made. This patch has no functional change. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
89a55f56fd |
sched/fair: Remove unnecessary parameters from wake_affine_idle()
wake_affine_idle() takes parameters it never uses so clean it up. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Wen Yang
|
e7ad203166 |
sched/rt: Make update_curr_rt() more accurate
rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Steven Rostedt (VMware)
|
364f566537 |
sched/rt: Up the root domain ref count when passing it around via IPIs
When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.
Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
||
Steven Rostedt (VMware)
|
ad0f1d9d65 |
sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
When the rto_push_irq_work_func() is called, it looks at the RT overloaded
bitmask in the root domain via the runqueue (rq->rd). The problem is that
during CPU up and down, nothing here stops rq->rd from changing between
taking the rq->rd->rto_lock and releasing it. That means the lock that is
released is not the same lock that was taken.
Instead of using this_rq()->rd to get the root domain, as the irq work is
part of the root domain, we can simply get the root domain from the irq work
that is passed to the routine:
container_of(work, struct root_domain, rto_push_work)
This keeps the root domain consistent.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
||
Peter Zijlstra
|
2ed41a5502 |
sched/core: Optimize update_stats_*()
These functions are already gated by schedstats_enabled(), there is no point in then issuing another static_branch for every individual update in them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
b85c8b71bf |
sched/core: Optimize ttwu_stat()
The whole of ttwu_stat() is guarded by a single schedstat_enabled(), there is absolutely no point in then issuing another static_branch for every single schedstat_inc() in there. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mathieu Desnoyers
|
70216e18e5 |
membarrier: Provide core serializing command, *_SYNC_CORE
Provide core serializing membarrier command to support memory reclaim by JIT. Each architecture needs to explicitly opt into that support by documenting in their architecture code how they provide the core serializing instructions required when returning from the membarrier IPI, and after the scheduler has updated the curr->mm pointer (before going back to user-space). They should then select ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on their architecture. Architectures selecting this feature need to either document that they issue core serializing instructions when returning to user-space, or implement their architecture-specific sync_core_before_usermode(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Andrew Hunter <ahh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Avi Kivity <avi@scylladb.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Dave Watson <davejwatson@fb.com> Cc: David Sehr <sehr@google.com> Cc: Greg Hackmann <ghackmann@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maged Michael <maged.michael@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mathieu Desnoyers
|
c5f58bd58f |
membarrier: Provide GLOBAL_EXPEDITED command
Allow expedited membarrier to be used for data shared between processes through shared memory. Processes wishing to receive the membarriers register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED. This allows extremely simple kernel-level implementation: we have almost everything we need with the PRIVATE_EXPEDITED barrier code. All we need to do is to add a flag in the mm_struct that will be used to check whether we need to send the IPI to the current thread of each CPU. There is a slight downside to this approach compared to targeting specific shared memory users: when performing a membarrier operation, all registered "global" receivers will get the barrier, even if they don't share a memory mapping with the sender issuing MEMBARRIER_CMD_GLOBAL_EXPEDITED. This registration approach seems to fit the requirement of not disturbing processes that really deeply care about real-time: they simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED" command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward compatibility. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Andrew Hunter <ahh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Avi Kivity <avi@scylladb.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Dave Watson <davejwatson@fb.com> Cc: David Sehr <sehr@google.com> Cc: Greg Hackmann <ghackmann@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maged Michael <maged.michael@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-api@vger.kernel.org Link: http://lkml.kernel.org/r/20180129202020.8515-5-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mathieu Desnoyers
|
306e060435 |
membarrier: Document scheduler barrier requirements
Document the membarrier requirement on having a full memory barrier in __schedule() after coming from user-space, before storing to rq->curr. It is provided by smp_mb__after_spinlock() in __schedule(). Document that membarrier requires a full barrier on transition from kernel thread to userspace thread. We currently have an implicit barrier from atomic_dec_and_test() in mmdrop() that ensures this. The x86 switch_mm_irqs_off() full barrier is currently provided by many cpumask update operations as well as write_cr3(). Document that write_cr3() provides this barrier. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Andrew Hunter <ahh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Avi Kivity <avi@scylladb.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Dave Watson <davejwatson@fb.com> Cc: David Sehr <sehr@google.com> Cc: Greg Hackmann <ghackmann@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maged Michael <maged.michael@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-api@vger.kernel.org Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mathieu Desnoyers
|
3ccfebedd8 |
powerpc, membarrier: Skip memory barrier in switch_mm()
Allow PowerPC to skip the full memory barrier in switch_mm(), and only issue the barrier when scheduling into a task belonging to a process that has registered to use expedited private. Threads targeting the same VM but which belong to different thread groups is a tricky case. It has a few consequences: It turns out that we cannot rely on get_nr_threads(p) to count the number of threads using a VM. We can use (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) instead to skip the synchronize_sched() for cases where the VM only has a single user, and that user only has a single thread. It also turns out that we cannot use for_each_thread() to set thread flags in all threads using a VM, as it only iterates on the thread group. Therefore, test the membarrier state variable directly rather than relying on thread flags. This means membarrier_register_private_expedited() needs to set the MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows private expedited membarrier commands to succeed. membarrier_arch_switch_mm() now tests for the MEMBARRIER_STATE_PRIVATE_EXPEDITED flag. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Andrew Hunter <ahh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Avi Kivity <avi@scylladb.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Dave Watson <davejwatson@fb.com> Cc: David Sehr <sehr@google.com> Cc: Greg Hackmann <ghackmann@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maged Michael <maged.michael@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Linus Torvalds
|
ab486bc9a5 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek: - Add a console_msg_format command line option: The value "default" keeps the old "[time stamp] text\n" format. The value "syslog" allows to see the syslog-like "<log level>[timestamp] text" format. This feature was requested by people doing regression tests, for example, 0day robot. They want to have both filtered and full logs at hands. - Reduce the risk of softlockup: Pass the console owner in a busy loop. This is a new approach to the old problem. It was first proposed by Steven Rostedt on Kernel Summit 2017. It marks a context in which the console_lock owner calls console drivers and could not sleep. On the other side, printk() callers could detect this state and use a busy wait instead of a simple console_trylock(). Finally, the console_lock owner checks if there is a busy waiter at the end of the special context and eventually passes the console_lock to the waiter. The hand-off works surprisingly well and helps in many situations. Well, there is still a possibility of the softlockup, for example, when the flood of messages stops and the last owner still has too much to flush. There is increasing number of people having problems with printk-related softlockups. We might eventually need to get better solution. Anyway, this looks like a good start and promising direction. - Do not allow to schedule in console_unlock() called from printk(): This reverts an older controversial commit. The reschedule helped to avoid softlockups. But it also slowed down the console output. This patch is obsoleted by the new console waiter logic described above. In fact, the reschedule made the hand-off less effective. - Deprecate "%pf" and "%pF" format specifier: It was needed on ia64, ppc64 and parisc64 to dereference function descriptors and show the real function address. It is done transparently by "%ps" and "pS" format specifier now. Sergey Senozhatsky found that all the function descriptors were in a special elf section and could be easily detected. - Remove printk_symbol() API: It has been obsoleted by "%pS" format specifier, and this change helped to remove few continuous lines and a less intuitive old API. - Remove redundant memsets: Sergey removed unnecessary memset when processing printk.devkmsg command line option. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (27 commits) printk: drop redundant devkmsg_log_str memsets printk: Never set console_may_schedule in console_trylock() printk: Hide console waiter logic into helpers printk: Add console owner and waiter logic to load balance console writes kallsyms: remove print_symbol() function checkpatch: add pF/pf deprecation warning symbol lookup: introduce dereference_symbol_descriptor() parisc64: Add .opd based function descriptor dereference powerpc64: Add .opd based function descriptor dereference ia64: Add .opd based function descriptor dereference sections: split dereference_function_descriptor() openrisc: Fix conflicting types for _exext and _stext lib: do not use print_symbol() irq debug: do not use print_symbol() sysfs: do not use print_symbol() drivers: do not use print_symbol() x86: do not use print_symbol() unicore32: do not use print_symbol() sh: do not use print_symbol() mn10300: do not use print_symbol() ... |
||
Linus Torvalds
|
af8c5e2d60 |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Implement frequency/CPU invariance and OPP selection for SCHED_DEADLINE (Juri Lelli) - Tweak the task migration logic for better multi-tasking workload scalability (Mel Gorman) - Misc cleanups, fixes and improvements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Make bandwidth enforcement scale-invariant sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter sched/cpufreq: Always consider all CPUs when deciding next freq sched/cpufreq: Split utilization signals sched/cpufreq: Change the worker kthread to SCHED_DEADLINE sched/deadline: Move CPU frequency selection triggering points sched/cpufreq: Use the DEADLINE utilization signal sched/deadline: Implement "runtime overrun signal" support sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache sched/fair: Correct obsolete comment about cpufreq_update_util() sched/fair: Remove impossible condition from find_idlest_group_cpu() sched/cpufreq: Don't pass flags to sugov_set_iowait_boost() sched/cpufreq: Initialize sg_cpu->flags to 0 sched/fair: Consider RT/IRQ pressure in capacity_spare_wake() sched/fair: Use 'unsigned long' for utilization, consistently sched/core: Rework and clarify prepare_lock_switch() sched/fair: Remove unused 'curr' parameter from wakeup_gran sched/headers: Constify object_is_on_stack() |
||
Linus Torvalds
|
d772794637 |
Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar: "The main RCU changes in this cycle were: - Updates to use cond_resched() instead of cond_resched_rcu_qs() where feasible (currently everywhere except in kernel/rcu and in kernel/torture.c). Also a couple of fixes to avoid sending IPIs to offline CPUs. - Updates to simplify RCU's dyntick-idle handling. - Updates to remove almost all uses of smp_read_barrier_depends() and read_barrier_depends(). - Torture-test updates. - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) torture: Save a line in stutter_wait(): while -> for torture: Eliminate torture_runnable and perf_runnable torture: Make stutter less vulnerable to compilers and races locking/locktorture: Fix num reader/writer corner cases locking/locktorture: Fix rwsem reader_delay torture: Place all torture-test modules in one MAINTAINERS group rcutorture/kvm-build.sh: Skip build directory check rcutorture: Simplify functions.sh include path rcutorture: Simplify logging rcutorture/kvm-recheck-*: Improve result directory readability check rcutorture/kvm.sh: Support execution from any directory rcutorture/kvm.sh: Use consistent help text for --qemu-args rcutorture/kvm.sh: Remove unused variable, `alldone` rcutorture: Remove unused script, config2frag.sh rcutorture/configinit: Fix build directory error message rcutorture: Preempt RCU-preempt readers more vigorously torture: Reduce #ifdefs for preempt_schedule() rcu: Remove have_rcu_nocb_mask from tree_plugin.h rcu: Add comment giving debug strategy for double call_rcu() tracing, rcu: Hide trace event rcu_nocb_wake when not used ... |
||
Peter Zijlstra
|
ce48c14649 |
sched/core: Fix cpu.max vs. cpuhotplug deadlock
Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion: tg_set_cfs_bandwidth() get_online_cpus() cpus_read_lock() cfs_bandwidth_usage_inc() static_key_slow_inc() cpus_read_lock() Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Josh Snyder
|
c96f5471ce |
delayacct: Account blkio completion on the correct task
Before commit: |
||
Linus Torvalds
|
67549d46d4 |
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar: "A Kconfig fix, a build fix and a membarrier bug fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: membarrier: Disable preemption when calling smp_call_function_many() sched/isolation: Make CONFIG_CPU_ISOLATION=y depend on SMP or COMPILE_TEST ia64, sched/cputime: Fix build error if CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y |
||
Juri Lelli
|
07881166a8 |
sched/deadline: Make bandwidth enforcement scale-invariant
Apply frequency and CPU scale-invariance correction factor to bandwidth enforcement (similar to what we already do to fair utilization tracking). Each delta_exec gets scaled considering current frequency and maximum CPU capacity; which means that the reservation runtime parameter (that need to be specified profiling the task execution at max frequency on biggest capacity core) gets thus scaled accordingly. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
7e1a9208f6 |
sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP
Currently, frequency and cpu capacity scaling is only performed on CONFIG_SMP systems (as CFS PELT signals are only present for such systems). However, other scheduling classes want to do freq/cpu scaling, and for !CONFIG_SMP configurations as well. arch_scale_freq_capacity() is useful to implement frequency scaling even on !CONFIG_SMP platforms, so we simply move it outside CONFIG_SMP ifdeffery. Even if arch_scale_cpu_capacity() is not useful on !CONFIG_SMP platforms, we make a default implementation available for such configurations anyway to simplify scheduler code doing CPU scale invariance. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-8-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
7673c8a4c7 |
sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to see where information coming from scheduling domains might help doing frequency invariance scaling). Remove it; also in anticipation of moving arch_scale_freq_capacity() outside CONFIG_SMP. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-7-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
0fa7d181f1 |
sched/cpufreq: Always consider all CPUs when deciding next freq
No assumption can be made upon the rate at which frequency updates get triggered, as there are scheduling policies (like SCHED_DEADLINE) which don't trigger them so frequently. Remove such assumption from the code, by always considering SCHED_DEADLINE utilization signal as not stale. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-6-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
d18be45dbf |
sched/cpufreq: Split utilization signals
To be able to treat utilization signals of different scheduling classes in different ways (e.g., CFS signal might be stale while DEADLINE signal is never stale by design) we need to split sugov_cpu::util signal in two: util_cfs and util_dl. This patch does that by also changing sugov_get_util() parameter list. After this change, aggregation of the different signals has to be performed by sugov_get_util() users (so that they can decide what to do with the different signals). Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-5-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
794a56ebd9 |
sched/cpufreq: Change the worker kthread to SCHED_DEADLINE
Worker kthread needs to be able to change frequency for all other threads. Make it special, just under STOP class. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-4-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
e0367b1267 |
sched/deadline: Move CPU frequency selection triggering points
Since SCHED_DEADLINE doesn't track utilization signal (but reserves a fraction of CPU bandwidth to tasks admitted to the system), there is no point in evaluating frequency changes during each tick event. Move frequency selection triggering points to where running_bw changes. Co-authored-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-3-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
d4edd662ac |
sched/cpufreq: Use the DEADLINE utilization signal
SCHED_DEADLINE tracks active utilization signal with a per dl_rq variable named running_bw. Make use of that to drive CPU frequency selection: add up FAIR and DEADLINE contribution to get the required CPU capacity to handle both requirements (while RT still selects max frequency). Co-authored-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-2-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
34be39305a |
sched/deadline: Implement "runtime overrun signal" support
This patch adds the possibility of getting the delivery of a SIGXCPU signal whenever there is a runtime overrun. The request is done through the sched_flags field within the sched_attr structure. Forward port of https://lkml.org/lkml/2009/10/16/170 Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1513077024-25461-1-git-send-email-claudio@evidence.eu.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
7332dec055 |
sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
If waking from an idle CPU due to an interrupt then it's possible that the waker task will be pulled to wake on the current CPU. Unfortunately, depending on the type of interrupt and IRQ configuration, there may not be a strong relationship between the CPU an interrupt was delivered on and the CPU a task was running on. For example, the interrupts could all be delivered to CPUs on one particular node due to the machine topology or IRQ affinity configuration. Another example is an interrupt for an IO completion which can be delivered to any CPU where there is no guarantee the data is either cache hot or even local. This patch was motivated by the observation that an IO workload was being pulled cross-node on a frequent basis when IO completed. From a wakeup latency perspective, it's still useful to know that an idle CPU is immediately available for use but lets only consider an automatic migration if the CPUs share cache to limit damage due to NUMA migrations. Migrations may still occur if wake_affine_weight determines it's appropriate. These are the throughput results for dbench running on ext4 comparing 4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO completions can happen on any CPU. 4.15.0-rc3 4.15.0-rc3 vanilla lessmigrate Hmean 1 854.64 ( 0.00%) 865.01 ( 1.21%) Hmean 2 1229.60 ( 0.00%) 1274.44 ( 3.65%) Hmean 4 1591.81 ( 0.00%) 1628.08 ( 2.28%) Hmean 8 1845.04 ( 0.00%) 1831.80 ( -0.72%) Hmean 16 2038.61 ( 0.00%) 2091.44 ( 2.59%) Hmean 32 2327.19 ( 0.00%) 2430.29 ( 4.43%) Hmean 64 2570.61 ( 0.00%) 2568.54 ( -0.08%) Hmean 128 2481.89 ( 0.00%) 2499.28 ( 0.70%) Stddev 1 14.31 ( 0.00%) 5.35 ( 62.65%) Stddev 2 21.29 ( 0.00%) 11.09 ( 47.92%) Stddev 4 7.22 ( 0.00%) 6.80 ( 5.92%) Stddev 8 26.70 ( 0.00%) 9.41 ( 64.76%) Stddev 16 22.40 ( 0.00%) 20.01 ( 10.70%) Stddev 32 45.13 ( 0.00%) 44.74 ( 0.85%) Stddev 64 93.10 ( 0.00%) 93.18 ( -0.09%) Stddev 128 184.28 ( 0.00%) 177.85 ( 3.49%) Note the small increase in throughput for low thread counts but also note that the standard deviation for each sample during the test run is lower. The throughput figures for dbench can be misleading so the benchmark is actually modified to time the latency of the processing of one load file with many samples taken. The difference in latency is 4.15.0-rc3 4.15.0-rc3 vanilla lessmigrate Amean 1 21.71 ( 0.00%) 21.47 ( 1.08%) Amean 2 30.89 ( 0.00%) 29.58 ( 4.26%) Amean 4 47.54 ( 0.00%) 46.61 ( 1.97%) Amean 8 82.71 ( 0.00%) 82.81 ( -0.12%) Amean 16 149.45 ( 0.00%) 145.01 ( 2.97%) Amean 32 265.49 ( 0.00%) 248.43 ( 6.42%) Amean 64 463.23 ( 0.00%) 463.55 ( -0.07%) Amean 128 933.97 ( 0.00%) 935.50 ( -0.16%) Stddev 1 1.58 ( 0.00%) 1.54 ( 2.26%) Stddev 2 2.84 ( 0.00%) 2.95 ( -4.15%) Stddev 4 6.78 ( 0.00%) 6.85 ( -0.99%) Stddev 8 16.85 ( 0.00%) 16.37 ( 2.85%) Stddev 16 41.59 ( 0.00%) 41.04 ( 1.32%) Stddev 32 111.05 ( 0.00%) 105.11 ( 5.35%) Stddev 64 285.94 ( 0.00%) 288.01 ( -0.72%) Stddev 128 803.39 ( 0.00%) 809.73 ( -0.79%) It's a small improvement which is not surprising given that migrations that migrate to a different node as not that common. However, it is noticeable in the CPU migration statistics which are reduced by 24%. There was a query for v1 of this patch about NAS so here are the results for C-class using MPI for parallelisation on the same machine nas-mpi 4.15.0-rc3 4.15.0-rc3 vanilla noirq Time cg.C 24.25 ( 0.00%) 23.17 ( 4.45%) Time ep.C 8.22 ( 0.00%) 8.29 ( -0.85%) Time ft.C 22.67 ( 0.00%) 20.34 ( 10.28%) Time is.C 1.42 ( 0.00%) 1.47 ( -3.52%) Time lu.C 55.62 ( 0.00%) 54.81 ( 1.46%) Time mg.C 7.93 ( 0.00%) 7.91 ( 0.25%) 4.15.0-rc3 4.15.0-rc3 vanilla noirq-v1r1 User 3799.96 3748.34 System 672.10 626.15 Elapsed 91.91 79.49 lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small regressions but in terms of absolute time, the difference is small and likely within run-to-run variance. System CPU usage is slightly reduced. schbench from Facebook was also requested. This is a bit of a mixed bag but it's important to note that this workload should not be heavily impacted by wakeups from interrupt context. 4.15.0-rc3 4.15.0-rc3 vanilla noirq-v1r1 Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%) Lat 90.00th-qrtle-1 43.00 ( 0.00%) 44.00 ( -2.33%) Lat 95.00th-qrtle-1 44.00 ( 0.00%) 46.00 ( -4.55%) Lat 99.00th-qrtle-1 57.00 ( 0.00%) 58.00 ( -1.75%) Lat 99.50th-qrtle-1 59.00 ( 0.00%) 59.00 ( 0.00%) Lat 99.90th-qrtle-1 67.00 ( 0.00%) 78.00 ( -16.42%) Lat 50.00th-qrtle-2 40.00 ( 0.00%) 51.00 ( -27.50%) Lat 75.00th-qrtle-2 45.00 ( 0.00%) 56.00 ( -24.44%) Lat 90.00th-qrtle-2 53.00 ( 0.00%) 59.00 ( -11.32%) Lat 95.00th-qrtle-2 57.00 ( 0.00%) 61.00 ( -7.02%) Lat 99.00th-qrtle-2 67.00 ( 0.00%) 71.00 ( -5.97%) Lat 99.50th-qrtle-2 69.00 ( 0.00%) 74.00 ( -7.25%) Lat 99.90th-qrtle-2 83.00 ( 0.00%) 77.00 ( 7.23%) Lat 50.00th-qrtle-4 51.00 ( 0.00%) 51.00 ( 0.00%) Lat 75.00th-qrtle-4 57.00 ( 0.00%) 56.00 ( 1.75%) Lat 90.00th-qrtle-4 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 95.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%) Lat 99.00th-qrtle-4 73.00 ( 0.00%) 72.00 ( 1.37%) Lat 99.50th-qrtle-4 76.00 ( 0.00%) 74.00 ( 2.63%) Lat 99.90th-qrtle-4 85.00 ( 0.00%) 78.00 ( 8.24%) Lat 50.00th-qrtle-8 54.00 ( 0.00%) 58.00 ( -7.41%) Lat 75.00th-qrtle-8 59.00 ( 0.00%) 62.00 ( -5.08%) Lat 90.00th-qrtle-8 65.00 ( 0.00%) 66.00 ( -1.54%) Lat 95.00th-qrtle-8 67.00 ( 0.00%) 70.00 ( -4.48%) Lat 99.00th-qrtle-8 78.00 ( 0.00%) 79.00 ( -1.28%) Lat 99.50th-qrtle-8 81.00 ( 0.00%) 80.00 ( 1.23%) Lat 99.90th-qrtle-8 116.00 ( 0.00%) 83.00 ( 28.45%) Lat 50.00th-qrtle-16 65.00 ( 0.00%) 64.00 ( 1.54%) Lat 75.00th-qrtle-16 77.00 ( 0.00%) 71.00 ( 7.79%) Lat 90.00th-qrtle-16 83.00 ( 0.00%) 82.00 ( 1.20%) Lat 95.00th-qrtle-16 87.00 ( 0.00%) 87.00 ( 0.00%) Lat 99.00th-qrtle-16 95.00 ( 0.00%) 96.00 ( -1.05%) Lat 99.50th-qrtle-16 99.00 ( 0.00%) 103.00 ( -4.04%) Lat 99.90th-qrtle-16 104.00 ( 0.00%) 122.00 ( -17.31%) Lat 50.00th-qrtle-32 71.00 ( 0.00%) 73.00 ( -2.82%) Lat 75.00th-qrtle-32 91.00 ( 0.00%) 92.00 ( -1.10%) Lat 90.00th-qrtle-32 108.00 ( 0.00%) 107.00 ( 0.93%) Lat 95.00th-qrtle-32 118.00 ( 0.00%) 115.00 ( 2.54%) Lat 99.00th-qrtle-32 134.00 ( 0.00%) 129.00 ( 3.73%) Lat 99.50th-qrtle-32 138.00 ( 0.00%) 133.00 ( 3.62%) Lat 99.90th-qrtle-32 149.00 ( 0.00%) 146.00 ( 2.01%) Lat 50.00th-qrtle-39 83.00 ( 0.00%) 81.00 ( 2.41%) Lat 75.00th-qrtle-39 105.00 ( 0.00%) 102.00 ( 2.86%) Lat 90.00th-qrtle-39 120.00 ( 0.00%) 119.00 ( 0.83%) Lat 95.00th-qrtle-39 129.00 ( 0.00%) 128.00 ( 0.78%) Lat 99.00th-qrtle-39 153.00 ( 0.00%) 149.00 ( 2.61%) Lat 99.50th-qrtle-39 166.00 ( 0.00%) 156.00 ( 6.02%) Lat 99.90th-qrtle-39 12304.00 ( 0.00%) 12848.00 ( -4.42%) When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there are small gains in many cases. Otherwise it depends on the quartile used where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results are probably a co-incidence. For this workload, much depends on what node the threads get placed on and their relative locality and not wakeups from interrupt context. A larger component on how it behaves would be automatic NUMA balancing where a fault incurred to measure locality would be a much larger contributer to latency than the wakeup path. This is the results from an almost identical machine that happened to run the same test. They only differ in terms of storage which is irrelevant for this test. 4.15.0-rc3 4.15.0-rc3 vanilla noirq-v1r1 Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%) Lat 90.00th-qrtle-1 44.00 ( 0.00%) 43.00 ( 2.27%) Lat 95.00th-qrtle-1 53.00 ( 0.00%) 45.00 ( 15.09%) Lat 99.00th-qrtle-1 59.00 ( 0.00%) 58.00 ( 1.69%) Lat 99.50th-qrtle-1 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 99.90th-qrtle-1 86.00 ( 0.00%) 61.00 ( 29.07%) Lat 50.00th-qrtle-2 52.00 ( 0.00%) 41.00 ( 21.15%) Lat 75.00th-qrtle-2 57.00 ( 0.00%) 46.00 ( 19.30%) Lat 90.00th-qrtle-2 60.00 ( 0.00%) 53.00 ( 11.67%) Lat 95.00th-qrtle-2 62.00 ( 0.00%) 57.00 ( 8.06%) Lat 99.00th-qrtle-2 73.00 ( 0.00%) 68.00 ( 6.85%) Lat 99.50th-qrtle-2 74.00 ( 0.00%) 71.00 ( 4.05%) Lat 99.90th-qrtle-2 90.00 ( 0.00%) 75.00 ( 16.67%) Lat 50.00th-qrtle-4 57.00 ( 0.00%) 52.00 ( 8.77%) Lat 75.00th-qrtle-4 60.00 ( 0.00%) 58.00 ( 3.33%) Lat 90.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%) Lat 95.00th-qrtle-4 65.00 ( 0.00%) 65.00 ( 0.00%) Lat 99.00th-qrtle-4 76.00 ( 0.00%) 75.00 ( 1.32%) Lat 99.50th-qrtle-4 77.00 ( 0.00%) 77.00 ( 0.00%) Lat 99.90th-qrtle-4 87.00 ( 0.00%) 81.00 ( 6.90%) Lat 50.00th-qrtle-8 59.00 ( 0.00%) 57.00 ( 3.39%) Lat 75.00th-qrtle-8 63.00 ( 0.00%) 62.00 ( 1.59%) Lat 90.00th-qrtle-8 66.00 ( 0.00%) 67.00 ( -1.52%) Lat 95.00th-qrtle-8 68.00 ( 0.00%) 70.00 ( -2.94%) Lat 99.00th-qrtle-8 79.00 ( 0.00%) 80.00 ( -1.27%) Lat 99.50th-qrtle-8 80.00 ( 0.00%) 84.00 ( -5.00%) Lat 99.90th-qrtle-8 84.00 ( 0.00%) 90.00 ( -7.14%) Lat 50.00th-qrtle-16 65.00 ( 0.00%) 65.00 ( 0.00%) Lat 75.00th-qrtle-16 77.00 ( 0.00%) 75.00 ( 2.60%) Lat 90.00th-qrtle-16 84.00 ( 0.00%) 83.00 ( 1.19%) Lat 95.00th-qrtle-16 88.00 ( 0.00%) 87.00 ( 1.14%) Lat 99.00th-qrtle-16 97.00 ( 0.00%) 96.00 ( 1.03%) Lat 99.50th-qrtle-16 100.00 ( 0.00%) 104.00 ( -4.00%) Lat 99.90th-qrtle-16 110.00 ( 0.00%) 126.00 ( -14.55%) Lat 50.00th-qrtle-32 70.00 ( 0.00%) 71.00 ( -1.43%) Lat 75.00th-qrtle-32 92.00 ( 0.00%) 94.00 ( -2.17%) Lat 90.00th-qrtle-32 110.00 ( 0.00%) 110.00 ( 0.00%) Lat 95.00th-qrtle-32 121.00 ( 0.00%) 118.00 ( 2.48%) Lat 99.00th-qrtle-32 135.00 ( 0.00%) 137.00 ( -1.48%) Lat 99.50th-qrtle-32 140.00 ( 0.00%) 146.00 ( -4.29%) Lat 99.90th-qrtle-32 150.00 ( 0.00%) 160.00 ( -6.67%) Lat 50.00th-qrtle-39 80.00 ( 0.00%) 71.00 ( 11.25%) Lat 75.00th-qrtle-39 102.00 ( 0.00%) 91.00 ( 10.78%) Lat 90.00th-qrtle-39 118.00 ( 0.00%) 108.00 ( 8.47%) Lat 95.00th-qrtle-39 128.00 ( 0.00%) 117.00 ( 8.59%) Lat 99.00th-qrtle-39 149.00 ( 0.00%) 133.00 ( 10.74%) Lat 99.50th-qrtle-39 160.00 ( 0.00%) 139.00 ( 13.12%) Lat 99.90th-qrtle-39 13808.00 ( 0.00%) 4920.00 ( 64.37%) Despite being nearly identical, it showed a variety of major gains so I'm not convinced that heavy emphasis should be placed on this particular workload in terms of evaluating this particular patch. Further evidence of this is the fact that testing on a UMA machine showed small gains/losses even though the patch should be a no-op on UMA. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Joel Fernandes
|
9783be2c0e |
sched/fair: Correct obsolete comment about cpufreq_update_util()
Since the remote cpufreq callback work, the cpufreq_update_util() call can happen from remote CPUs. The comment about local CPUs is thus obsolete. Update it accordingly. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Android Kernel <kernel-team@android.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: EAS Dev <eas-dev@lists.linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rohit Jain <rohit.k.jain@oracle.com> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Joel Fernandes
|
18cec7e0dd |
sched/fair: Remove impossible condition from find_idlest_group_cpu()
find_idlest_group_cpu() goes through CPUs of a group previous selected by find_idlest_group(). find_idlest_group() returns NULL if the local group is the selected one and doesn't execute find_idlest_group_cpu if the group to which 'cpu' belongs to is chosen. So we're always guaranteed to call find_idlest_group_cpu() with a group to which 'cpu' is non-local. This makes one of the conditions in find_idlest_group_cpu() an impossible one, which we can get rid off. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Android Kernel <kernel-team@android.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: EAS Dev <eas-dev@lists.linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rohit Jain <rohit.k.jain@oracle.com> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Viresh Kumar
|
5083452f8c |
sched/cpufreq: Don't pass flags to sugov_set_iowait_boost()
We are already passing sg_cpu as argument to sugov_set_iowait_boost() helper and the same can be used to retrieve the flags value. Get rid of the redundant argument. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael Wysocki <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: tkjos@android.com Link: http://lkml.kernel.org/r/4ec5562b1a87e146ebab11fb5dde1ca9c763a7fb.1513158452.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Viresh Kumar
|
6257e70478 |
sched/cpufreq: Initialize sg_cpu->flags to 0
Initializing sg_cpu->flags to SCHED_CPUFREQ_RT has no obvious benefit. The flags field wouldn't be used until the utilization update handler is called for the first time, and once that is called we will overwrite flags anyway. Initialize it to 0. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael Wysocki <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: morten.rasmussen@arm.com Cc: tkjos@android.com Link: http://lkml.kernel.org/r/763feda6424ced8486b25a0c52979634e6104478.1513158452.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Joel Fernandes
|
f453ae2200 |
sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
capacity_spare_wake() in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up. Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake(). Tests results from improvements with this change are below. More tests were also done by myself and Matt Fleming to ensure no degradation in different benchmarks. 1) Rohit ran barrier.c test (details below) with following improvements: ------------------------------------------------------------------------ This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees. barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ |Threads | Without patch | With patch | | | | | +--------+--------+---------+-----------------+---------+ | | Mean | Std Dev | Mean | Std Dev | +--------+--------+---------+-----------------+---------+ |1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 | |2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 | |4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 | |8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 | |16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 | |32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 | |64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 | +--------+--------+---------+-----------------+---------+ 2) ping+hackbench test on bare-metal sever (by Rohit) ----------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better). +--------------+-----------------+--------------------------+ |Task Groups | Without patch | With patch | | +-------+---------+----------------+---------+ |(Groups of 40)| Mean | Std Dev | Mean | Std Dev | +--------------+-------+---------+----------------+---------+ |1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 | |2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 | |4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 | |8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 | |16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 | |25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 | +--------------+-------+---------+----------------+---------+ [1] https://patchwork.kernel.org/patch/9991635/ Matt Fleming also ran several different hackbench tests and cyclic test to santiy-check that the patch doesn't harm other usecases. Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Tested-by: Rohit Jain <rohit.k.jain@oracle.com> Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Brendan Jackman <brendan.jackman@arm.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Patrick Bellasi
|
f01415fdbf |
sched/fair: Use 'unsigned long' for utilization, consistently
Utilization and capacity are tracked as 'unsigned long', however some functions using them return an 'int' which is ultimately assigned back to 'unsigned long' variables. Since there is not scope on using a different and signed type, consolidate the signature of functions returning utilization to always use the native type. This change improves code consistency, and it also benefits code paths where utilizations should be clamped by avoiding further type conversions or ugly type casts. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chris Redpath <chris.redpath@arm.com> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@android.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
rodrigosiqueira
|
31cb1bc0dc |
sched/core: Rework and clarify prepare_lock_switch()
The prepare_lock_switch() function has an unused parameter, and also the function name was not descriptive. To improve readability and remove the extra parameter, do the following changes: * Move prepare_lock_switch() from kernel/sched/sched.h to kernel/sched/core.c, rename it to prepare_task(), and remove the unused parameter. * Split the smp_store_release() out from finish_lock_switch() to a function named finish_task. * Comments ajdustments. Signed-off-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171215140603.gxe5i2y6fg5ojfpp@smtp.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Ingo Molnar
|
cb1f34ddcc |
Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |