Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
The affinity of managed interrupts is completely handled in the kernel and
cannot be changed via the /proc/irq/* interfaces from user space. As the
kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent
vector exhaustion, it can happen that a managed interrupt whose affinity
mask contains both isolated and housekeeping CPUs is routed to an isolated
CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
on the isolated CPU.
Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
logic in the interrupt affinity selection code.
The subparameter indicates to the interrupt affinity selection logic that
it should try to avoid the above scenario.
This isolation is best effort and only effective if the automatically
assigned interrupt mask of a device queue contains isolated and
housekeeping CPUs. If housekeeping CPUs are online then such interrupts are
directed to the housekeeping CPU so that IO submitted on the housekeeping
CPU cannot disturb the isolated CPU.
If a queue's affinity mask contains only isolated CPUs then this parameter
has no effect on the interrupt routing decision, though interrupts are only
happening when tasks running on those isolated CPUs submit IO. IO submitted
on housekeeping CPUs has no influence on those queues.
If the affinity mask contains both housekeeping and isolated CPUs, but none
of the contained housekeeping CPUs is online, then the interrupt is also
routed to an isolated CPU. Interrupts are only delivered when one of the
isolated CPUs in the affinity mask submits IO. If one of the contained
housekeeping CPUs comes online, the CPU hotplug logic migrates the
interrupt automatically back to the upcoming housekeeping CPU. Depending on
the type of interrupt controller, this can require that at least one
interrupt is delivered to the isolated CPU in order to complete the
migration.
[ tglx: Removed unused parameter, added and edited comments/documentation
and rephrased the changelog so it contains more details. ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
sched_idle_cpu() isn't used for non SMP configuration and with a recent
change, we have started getting following warning:
kernel/sched/fair.c:5221:12: warning: ‘sched_idle_cpu’ defined but not used [-Wunused-function]
Fix that by defining sched_idle_cpu() only for SMP configurations.
Fixes: 323af6deaf ("sched/fair: Load balance aggressively for SCHED_IDLE CPUs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/f0554f590687478b33914a4aff9f0e6a62886d44.1579499907.git.viresh.kumar@linaro.org
topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.
In his example (8 CPUs, 2-node system), we end up with:
MC span for CPU3 == 3-7
MC span for CPU4 == 4-7
The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:
3 -> 4 -> 5 -> 6 -> 7
^ /
`----------------'
And a later pass through get_group(4, sdd@MC) will "corrupt" that to:
3 -> 4 -> 5 -> 6 -> 7
^ /
`-----------'
which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.
There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.
Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.
Testing
=======
Dietmar managed to reproduce this using the following qemu incantation:
$ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
-append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1
alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):
8<---
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
if (cpuid_topo->package_id != cpu_topo->package_id)
continue;
+ if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+ continue;
+
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
8<---
[1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com
Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
There is a spelling misake in comments of cpuidle_idle_call. Fix it.
Signed-off-by: Hewenliang <hewenliang4@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20200110025604.34373-1-hewenliang4@huawei.com
With commit
bef69dd878 ("sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()")
update_load_avg() has become the central point for calling cpufreq
(not including the update of blocked load). This change helps to
simplify further the number of calls to cpufreq_update_util() and to
remove last redundant ones. With update_load_avg(), we are now sure
that cpufreq_update_util() will be called after every task attachment
to a cfs_rq and especially after propagating this event down to the
util_avg of the root cfs_rq, which is the level that is used by
cpufreq governors like schedutil to set the frequency of a CPU.
The SCHED_CPUFREQ_MIGRATION flag forces an early call to cpufreq when
the migration happens in a cgroup whereas util_avg of root cfs_rq is
not yet updated and this call is duplicated with the one that happens
immediately after when the migration event reaches the root cfs_rq.
The dedicated flag SCHED_CPUFREQ_MIGRATION is now useless and can be
removed. The interface of attach_entity_load_avg() can also be
simplified accordingly.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/1579083620-24943-1-git-send-email-vincent.guittot@linaro.org
when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0,
I think we should not create /proc/pressure and
/proc/pressure/{io|memory|cpu}.
In the future, user maybe determine whether the psi feature is enabled by
checking the existence of the /proc/pressure dir or
/proc/pressure/{io|memory|cpu} files.
Signed-off-by: Wang Long <w@laoqinren.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1576672698-32504-1-git-send-email-w@laoqinren.net
commit bf475ce0a3 ("sched/fair: Add per-CPU min capacity to
sched_group_capacity") introduced per-cpu min_capacity.
commit e3d6d0cb66 ("sched/fair: Add sched_group per-CPU max capacity")
introduced per-cpu max_capacity.
In the SD_OVERLAP case, the local variable 'capacity' represents the sum
of CPU capacity of all CPUs in the first sched group (sg) of the sched
domain (sd).
It is erroneously used to calculate sg's min and max CPU capacity.
To fix this use capacity_of(cpu) instead of 'capacity'.
The code which achieves this via cpu_rq(cpu)->sd->groups->sgc->capacity
(for rq->sd != NULL) can be removed since it delivers the same value as
capacity_of(cpu) which is currently only used for the (!rq->sd) case
(see update_cpu_capacity()).
An sg of the lowest sd (rq->sd or sd->child == NULL) represents a single
CPU (and hence sg->sgc->capacity == capacity_of(cpu)).
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20200104130828.GA7718@iZj6chx1xj0e0buvshuecpZ
Move the code of calculation for delta_sum/delta_avg to where
it is really needed to be done.
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200103114400.17668-1-rocking@linux.alibaba.com
Every time we call irqtime_account_process_tick() is in a interrupt,
Every caller will get and assign a parameter rq = this_rq(), This is
unnecessary and increase the code size a little bit. Move the rq getting
action to irqtime_account_process_tick internally is better.
base with this patch
cputime.o 578792 bytes 577888 bytes
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1577959674-255537-1-git-send-email-alex.shi@linux.alibaba.com
Lengthy output of sysrq-t may take a lot of time on slow serial console
with lots of processes and CPUs.
So we need to reset NMI-watchdog to avoid spurious lockup messages, and
we also reset softlockup watchdogs on all other CPUs since another CPU
might be blocked waiting for us to process an IPI or stop_machine.
Add to sysrq_sched_debug_show() as what we did in show_state_filter().
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com
rq::uclamp is an array of struct uclamp_rq, make sure we clear the
whole thing.
Fixes: 69842cba9a ("sched/uclamp: Add CPU's clamp buckets refcountinga")
Signed-off-by: Li Guanglei <guanglei.li@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.
Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.
Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.
The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.
By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.
Fixes: 0b60ba2dd3 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
The fair scheduler performs periodic load balance on every CPU to check
if it can pull some tasks from other busy CPUs. The duration of this
periodic load balance is set to sd->balance_interval for the idle CPUs
and is calculated by multiplying the sd->balance_interval with the
sd->busy_factor (set to 32 by default) for the busy CPUs. The
multiplication is done for busy CPUs to avoid doing load balance too
often and rather spend more time executing actual task. While that is
the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
tasks.
With the recent enhancements in the fair scheduler around SCHED_IDLE
CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE
CPU instead of other busy or idle CPUs. The same reasoning should be
applied to the load balancer as well to make it migrate tasks more
aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling
latency of the migrated (SCHED_OTHER) tasks.
This patch makes minimal changes to the fair scheduler to do the next
load balance soon after the last non SCHED_IDLE task is dequeued from a
runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is
ignored while calculating the balance_interval for such CPUs. This is
done to avoid delaying the periodic load balance by few hundred
milliseconds for SCHED_IDLE CPUs.
This is tested on ARM64 Hikey620 platform (octa-core) with the help of
rt-app and it is verified, using kernel traces, that the newly
SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
and pulls tasks from other busy CPUs.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/e485827eb8fe7db0943d6f3f6e0f5a4a70272781.1578471925.git.viresh.kumar@linaro.org
Similarly to calculate_imbalance() and find_busiest_group(), using the
number of idle CPUs when there is only 1 CPU in the group is not efficient
because we can't make a difference between a CPU running 1 task and a CPU
running dozens of small tasks competing for the same CPU but not enough
to overload it. More generally speaking, we should use the number of
running tasks when there is the same number of idle CPUs in a group instead
of blindly select the 1st one.
When the groups have spare capacity and the same number of idle CPUs, we
compare the number of running tasks to select the busiest group.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1576839893-26930-1-git-send-email-vincent.guittot@linaro.org
Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.
In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.
The definition of task.requirement is dependent on the scheduling class.
For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.
capacity_orig_of(cpu) >= cfs_task.util
DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:
https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.
ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.
capacity_orig_of(cpu) >= rt_task.uclamp_min
Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.
Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.
On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.
[ mingo: Minor edits to comments. ]
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_fits_capacity() has just been made uclamp-aware, and
find_energy_efficient_cpu() needs to go through the same treatment.
Things are somewhat different here however - using the task max clamp isn't
sufficient. Consider the following setup:
The target runqueue, rq:
rq.cpu_capacity_orig = 512
rq.cfs.avg.util_avg = 200
rq.uclamp.max = 768 // the max p.uclamp.max of all enqueued p's is 768
The waking task, p (not yet enqueued on rq):
p.util_est = 600
p.uclamp.max = 100
Now, consider the following code which doesn't use the rq clamps:
util = uclamp_task_util(p);
// Does the task fit in the spare CPU capacity?
cpu = cpu_of(rq);
fits_capacity(util, cpu_capacity(cpu) - cpu_util(cpu))
This would lead to:
util = 100;
fits_capacity(100, 512 - 200)
fits_capacity() would return true. However, enqueuing p on that CPU *will*
cause it to become overutilized since rq clamp values are max-aggregated,
so we'd remain with
rq.uclamp.max = 768
which comes from the other tasks already enqueued on rq. Thus, we could
select a high enough frequency to reach beyond 0.8 * 512 utilization
(== overutilized) after enqueuing p on rq. What find_energy_efficient_cpu()
needs here is uclamp_rq_util_with() which lets us peek at the future
utilization landscape, including rq-wide uclamp values.
Make find_energy_efficient_cpu() use uclamp_rq_util_with() for its
fits_capacity() check. This is in line with what compute_energy() ends up
using for estimating utilization.
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-6-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_fits_capacity() drives CPU selection at wakeup time, and is also used
to detect misfit tasks. Right now it does so by comparing task_util_est()
with a CPU's capacity, but doesn't take into account uclamp restrictions.
There's a few interesting uses that can come out of doing this. For
instance, a low uclamp.max value could prevent certain tasks from being
flagged as misfit tasks, so they could merrily remain on low-capacity CPUs.
Similarly, a high uclamp.min value would steer tasks towards high capacity
CPUs at wakeup (and, should that fail, later steered via misfit balancing),
so such "boosted" tasks would favor CPUs of higher capacity.
Introduce uclamp_task_util() and make task_fits_capacity() use it.
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-5-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current helper returns (CPU) rq utilization with uclamp restrictions
taken into account. A uclamp task utilization helper would be quite
helpful, but this requires some renaming.
Prepare the code for the introduction of a uclamp_task_util() by renaming
the existing uclamp_util_with() to uclamp_rq_util_with().
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-4-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vincent pointed out recently that the canonical type for utilization
values is 'unsigned long'. Internally uclamp uses 'unsigned int' values for
cache optimization, but this doesn't have to be exported to its users.
Make the uclamp helpers that deal with utilization use and return unsigned
long values.
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The sole user of uclamp_util(), schedutil_cpu_util(), was made to use
uclamp_util_with() instead in commit:
af24bde8df ("sched/uclamp: Add uclamp support to energy_compute()")
From then on, uclamp_util() has remained unused. Being a simple wrapper
around uclamp_util_with(), we can get rid of it and win back a few lines.
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are instances where we keep searching for an idle CPU despite
already having a sched-idle CPU (in find_idlest_group_cpu(),
select_idle_smt() and select_idle_cpu() and then there are places where
we don't necessarily do that and return a sched-idle CPU as soon as we
find one (in select_idle_sibling()). This looks a bit inconsistent and
it may be worth having the same policy everywhere.
On the other hand, choosing a sched-idle CPU over a idle one shall be
beneficial from performance and power point of view as well, as we don't
need to get the CPU online from a deep idle state which wastes quite a
lot of time and energy and delays the scheduling of the newly woken up
task.
This patch tries to simplify code around sched-idle CPU selection and
make it consistent throughout.
Testing is done with the help of rt-app on hikey board (ARM64 octa-core,
2 clusters, 0-3 and 4-7). The cpufreq governor was set to performance to
avoid any side affects from CPU frequency. Following are the tests
performed:
Test 1: 1-cfs-task:
A single SCHED_NORMAL task is pinned to CPU5 which runs for 2333 us
out of 7777 us (so gives time for the cluster to go in deep idle
state).
Test 2: 1-cfs-1-idle-task:
A single SCHED_NORMAL task is pinned on CPU5 and single SCHED_IDLE
task is pinned on CPU6 (to make sure cluster 1 doesn't go in deep idle
state).
Test 3: 1-cfs-8-idle-task:
A single SCHED_NORMAL task is pinned on CPU5 and eight SCHED_IDLE
tasks are created which run forever (not pinned anywhere, so they run
on all CPUs). Checked with kernelshark that as soon as NORMAL task
sleeps, the SCHED_IDLE task starts running on CPU5.
And here are the results on mean latency (in us), using the "st" tool.
$ st 1-cfs-task/rt-app-cfs_thread-0.log
N min max sum mean stddev
642 90 592 197180 307.134 109.906
$ st 1-cfs-1-idle-task/rt-app-cfs_thread-0.log
N min max sum mean stddev
642 67 311 113850 177.336 41.4251
$ st 1-cfs-8-idle-task/rt-app-cfs_thread-0.log
N min max sum mean stddev
643 29 173 41364 64.3297 13.2344
The mean latency when we need to:
- wakeup from deep idle state is 307 us.
- wakeup from shallow idle state is 177 us.
- preempt a SCHED_IDLE task is 64 us.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/b90cbcce608cef4e02a7bbfe178335f76d201bab.1573728344.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit left behind an unused variable:
5443a0be61 ("sched: Use fair:prio_changed() instead of ad-hoc implementation") left behind an unused variable.
kernel/sched/core.c: In function 'set_user_nice':
kernel/sched/core.c:4507:16: warning: variable 'delta' set but not used
int old_prio, delta;
^~~~~
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5443a0be61 ("sched: Use fair:prio_changed() instead of ad-hoc implementation")
Link: https://lkml.kernel.org/r/20191219140314.1252-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: a (rare) PSI crash fix, a CPU affinity related balancing
fix, and a toning down of active migration attempts"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cfs: fix spurious active migration
sched/fair: Fix find_idlest_group() to handle CPU affinity
psi: Fix a division error in psi poll()
sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime
Because of the:
if (!load)
runnable = running = 0;
clause in ___update_load_sum(), all the actual users of @contrib in
accumulate_sum():
if (load)
sa->load_sum += load * contrib;
if (runnable)
sa->runnable_load_sum += runnable * contrib;
if (running)
sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
don't happen, and therefore we don't care what @contrib actually is and
calculating it is pointless.
If we count the times when @load equals zero and not as below:
if (load) {
load_is_not_zero_count++;
contrib = __accumulate_pelt_segments(periods,
1024 - sa->period_contrib,delta);
} else
load_is_zero_count++;
As we can see, load_is_zero_count is much bigger than
load_is_zero_count, and the gap is gradually widening:
load_is_zero_count: 6016044 times
load_is_not_zero_count: 244316 times
19:50:43 up 1 min, 1 user, load average: 0.09, 0.06, 0.02
load_is_zero_count: 7956168 times
load_is_not_zero_count: 261472 times
19:51:42 up 2 min, 1 user, load average: 0.03, 0.05, 0.01
load_is_zero_count: 10199896 times
load_is_not_zero_count: 278364 times
19:52:51 up 3 min, 1 user, load average: 0.06, 0.05, 0.01
load_is_zero_count: 14333700 times
load_is_not_zero_count: 318424 times
19:54:53 up 5 min, 1 user, load average: 0.01, 0.03, 0.00
Perhaps we can gain some performance advantage by saving these
unnecessary calculation.
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot < vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1576208740-35609-1-git-send-email-rocking@linux.alibaba.com
select_idle_cpu() will scan the LLC domain for idle CPUs,
it's always expensive. so the next commit :
1ad3aaf3fc ("sched/core: Implement new approach to scale select_idle_cpu()")
introduces a way to limit how many CPUs we scan.
But it consume some CPUs out of 'nr' that are not allowed
for the task and thus waste our attempts. The function
always return nr_cpumask_bits, and we can't find a CPU
which our task is allowed to run.
Cpumask may be too big, similar to select_idle_core(), use
per_cpu_ptr 'select_idle_mask' to prevent stack overflow.
Fixes: 1ad3aaf3fc ("sched/core: Implement new approach to scale select_idle_cpu()")
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
init_wait_var_entry() forgets to initialize wq_entry->flags.
Currently not a problem, we don't have wait_var_event_exclusive().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20191210191902.GB14449@redhat.com
set_user_nice() implements its own version of fair::prio_changed() and
therefore misses a specific optimization towards nohz_full CPUs that
avoid sending an resched IPI to a reniced task running alone. Use the
proper callback instead.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191203160106.18806-3-frederic@kernel.org
The runqueue of a fair task being remotely reniced is going to get a
resched IPI in order to reassess which task should be the current
running on the CPU. However that evaluation is useless if the fair task
is running alone, in which case we can spare that IPI, preventing
nohz_full CPUs from being disturbed.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191203160106.18806-2-frederic@kernel.org
The load balance can fail to find a suitable task during the periodic check
because the imbalance is smaller than half of the load of the waiting
tasks. This results in the increase of the number of failed load balance,
which can end up to start an active migration. This active migration is
useless because the current running task is not a better choice than the
waiting ones. In fact, the current task was probably not running but
waiting for the CPU during one of the previous attempts and it had already
not been selected.
When load balance fails too many times to migrate a task, we should relax
the contraint on the maximum load of the tasks that can be migrated
similarly to what is done with cache hotness.
Before the rework, load balance used to set the imbalance to the average
load_per_task in order to mitigate such situation. This increased the
likelihood of migrating a task but also of selecting a larger task than
needed while more appropriate ones were in the list.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1575036287-6052-1-git-send-email-vincent.guittot@linaro.org
Because of CPU affinity, the local group can be skipped which breaks the
assumption that statistics are always collected for local group. With
uninitialized local_sgs, the comparison is meaningless and the behavior
unpredictable. This can even end up to use local pointer which is to
NULL in this case.
If the local group has been skipped because of CPU affinity, we return
the idlest group.
Fixes: 57abff067a ("sched/fair: Rework find_idlest_group()")
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: John Stultz <john.stultz@linaro.org>
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: mingo@redhat.com
Cc: mgorman@suse.de
Cc: juri.lelli@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: bsegall@google.com
Cc: qais.yousef@arm.com
Link: https://lkml.kernel.org/r/1575483700-22153-1-git-send-email-vincent.guittot@linaro.org
The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.
Use div64_u64.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
Jingfeng reports rare div0 crashes in psi on systems with some uptime:
[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330
The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.
We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.
The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.
Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.
Reported-by: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.
Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.
If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point. Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever. In particular, a system-wide
deadlock may occur during CPU online as a result of that.
The failing scenario is as follows. CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu). It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline. The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above. Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3. When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.
To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.
Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.
Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.
Fixes: 99d14d0e16 ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang <anson.huang@nxp.com>
Tested-by: Anson Huang <anson.huang@nxp.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull thermal management updates from Zhang Rui:
- Fix a deadlock regression in thermal core framework, which was
introduced in 5.3 (Wei Wang)
- Initialize thermal control framework earlier to enable thermal
mitigation during boot (Amit Kucheria)
- Convert the Intelligent Power Allocator (IPA) thermal governor to
follow the generic PM_EM instead of its own Energy Model (Quentin
Perret)
- Introduce a new Amlogic soc thermal driver (Guillaume La Roque)
- Add interrupt support for tsens thermal driver (Amit Kucheria)
- Add support for MSM8956/8976 in tsens thermal driver
(AngeloGioacchino Del Regno)
- Add support for r8a774b1 in rcar thermal driver (Biju Das)
- Add support for Thermal Monitor Unit v2 in qoriq thermal driver
(Yuantian Tang)
- Some other fixes/cleanups on thermal core framework and soc thermal
drivers (Colin Ian King, Daniel Lezcano, Hsin-Yi Wang, Tian Tao)
* 'thermal/next' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: (32 commits)
thermal: Fix deadlock in thermal thermal_zone_device_check
thermal: cpu_cooling: Migrate to using the EM framework
thermal: cpu_cooling: Make the power-related code depend on IPA
PM / EM: Declare EM data types unconditionally
arm64: defconfig: Enable CONFIG_ENERGY_MODEL
drivers: thermal: tsens: fix potential integer overflow on multiply
thermal: cpu_cooling: Reorder the header file
thermal: cpu_cooling: Remove pointless dependency on CONFIG_OF
thermal: no need to set .owner when using module_platform_driver
thermal: qcom: tsens-v1: Fix kfree of a non-pointer value
cpufreq: qcom-hw: Move driver initialization earlier
clk: qcom: Initialize clock drivers earlier
cpufreq: Initialize cpufreq-dt driver earlier
cpufreq: Initialize the governors in core_initcall
thermal: Initialize thermal subsystem earlier
thermal: Remove netlink support
dt: thermal: tsens: Document compatible for MSM8976/56
thermal: qcom: tsens-v1: Add support for MSM8956 and MSM8976
MAINTAINERS: add entry for Amlogic Thermal driver
thermal: amlogic: Add thermal driver to support G12 SoCs
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3O0OoACgkQ+7dXa6fL
C2tAwA//VH9Y81azemXFdflDF90sSH3TCASlKHVYHbBNAkH/QP5F00G4BEM4nNqH
F3x7qcU9vzfGdumF1pc90Yt6XSYlsQEGF+xMyMw/VS2wKs40yv+b/doVbzOWbN9C
NfrklgHeuuBk+JzU2llDisVqKRTLt4SmDpYu1ZdcchUQFZCCl3BpgdSEC+xXrHay
+KlRPVNMSd2kXMCDuSWrr71lVNdCTdf3nNC5p1i780+VrgpIBIG/jmiNdCcd7PLH
1aesPlr8UZY3+bmRtqe587fVRAhT2qA2xibKtyf9R0hrDtUKR4NSnpPmaeIjb26e
LhVntcChhYxQqzy/T4ScTDNVjpSlwi6QMo5DwAwzNGf2nf/v5/CZ+vGYDVdXRFHj
tgH1+8eDpHsi7jJp6E4cmZjiolsUx/ePDDTrQ4qbdDMO7fmIV6YQKFAMTLJepLBY
qnJVqoBq3qn40zv6tVZmKgWiXQ65jEkBItZhEUmcQRBiSbBDPweIdEzx/mwzkX7U
1gShGdut6YP4GX7BnOhkiQmzucS85mgkUfG43+mBfYXb+4zNTEjhhkqhEduz2SQP
xnjHxEM+MTGCj3PozIpJxNKzMTEceYY7cAUdNEMDQcHog7OCnIdGBIc7BPnsN8yA
CPzntwP4mmLfK3weq3PIGC6d9xfc9PpmiR9docxQOvE6sk2Ifeo=
=FKC7
-----END PGP SIGNATURE-----
Merge tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull pipe rework from David Howells:
"This is my set of preparatory patches for building a general
notification queue on top of pipes. It makes a number of significant
changes:
- It removes the nr_exclusive argument from __wake_up_sync_key() as
this is always 1. This prepares for the next step:
- Adds wake_up_interruptible_sync_poll_locked() so that poll can be
woken up from a function that's holding the poll waitqueue
spinlock.
- Change the pipe buffer ring to be managed in terms of unbounded
head and tail indices rather than bounded index and length. This
means that reading the pipe only needs to modify one index, not
two.
- A selection of helper functions are provided to query the state of
the pipe buffer, plus a couple to apply updates to the pipe
indices.
- The pipe ring is allowed to have kernel-reserved slots. This allows
many notification messages to be spliced in by the kernel without
allowing userspace to pin too many pages if it writes to the same
pipe.
- Advance the head and tail indices inside the pipe waitqueue lock
and use wake_up_interruptible_sync_poll_locked() to poke poll
without having to take the lock twice.
- Rearrange pipe_write() to preallocate the buffer it is going to
write into and then drop the spinlock. This allows kernel
notifications to then be added the ring whilst it is filling the
buffer it allocated. The read side is stalled because the pipe
mutex is still held.
- Don't wake up readers on a pipe if there was already data in it
when we added more.
- Don't wake up writers on a pipe if the ring wasn't full before we
removed a buffer"
* tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
pipe: Remove sync on wake_ups
pipe: Increase the writer-wakeup threshold to reduce context-switch count
pipe: Check for ring full inside of the spinlock in pipe_write()
pipe: Remove redundant wakeup from pipe_write()
pipe: Rearrange sequence in pipe_write() to preallocate slot
pipe: Conditionalise wakeup in pipe_read()
pipe: Advance tail pointer inside of wait spinlock in pipe_read()
pipe: Allow pipes to have kernel-reserved slots
pipe: Use head and tail pointers for the ring, not cursor and length
Add wake_up_interruptible_sync_poll_locked()
Remove the nr_exclusive argument from __wake_up_sync_key()
pipe: Reduce #inclusion of pipe_fs_i.h
sched_clock_running is enabled early at bootup stage and never
disabled. So hint that to the compiler by using static_branch_likely()
rather than static_branch_unlikely().
The branch probability mis-annotation was introduced in the original
commit that converted the plain sched_clock_running flag to a static key:
46457ea464 ("sched/clock: Use static key for sched_clock_running")
Steve further notes:
| Looks like the confusion was the moving of the "!":
|
| - if (unlikely(!sched_clock_running))
| + if (!static_branch_unlikely(&sched_clock_running))
|
| Where, it was unlikely that !sched_clock_running would be true, but
| because the "!" was moved outside the "unlikely()" it makes the test
| "likely()". That is, if we added an intermediate step, it would have
| been:
|
| if (!likely(sched_clock_running))
|
| which would have prevented the mistake that this patch fixes.
[ mingo: Edited the changelog. ]
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: mgorman@suse.de
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/1574843848-26825-1-git-send-email-zhenzhong.duan@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Use nanoseconds (instead of microseconds) as the unit of time in
the cpuidle core and simplify checks for disabled idle states in
the idle loop (Rafael Wysocki).
- Fix and clean up the teo cpuidle governor (Rafael Wysocki).
- Fix the cpuidle registration error code path (Zhenzhong Duan).
- Avoid excessive vmexits in the ACPI cpuidle driver (Yin Fengwei).
- Extend the idle injection infrastructure to be able to measure the
requested duration in nanoseconds and to allow an exit latency
limit for idle states to be specified (Daniel Lezcano).
- Fix cpufreq driver registration and clarify a comment in the
cpufreq core (Viresh Kumar).
- Add NULL checks to the show() and store() methods of sysfs
attributes exposed by cpufreq (Kai Shen).
- Update cpufreq drivers:
* Fix for a plain int as pointer warning from sparse in
intel_pstate (Jamal Shareef).
* Fix for a hardcoded number of CPUs and stack bloat in the
powernv driver (John Hubbard).
* Updates to the ti-cpufreq driver and DT files to support new
platforms and migrate bindings from opp-v1 to opp-v2 (Adam Ford,
H. Nikolaus Schaller).
* Merging of the arm_big_little and vexpress-spc drivers and
related cleanup (Sudeep Holla).
* Fix for imx's default speed grade value (Anson Huang).
* Minor cleanup of the s3c64xx driver (Nathan Chancellor).
* CPU speed bin detection fix for sun50i (Ondrej Jirman).
- Appoint Chanwoo Choi as the new devfreq maintainer.
- Update the devfreq core:
* Check NULL governor in available_governors_show sysfs to prevent
showing wrong governor information and fix a race condition
between devfreq_update_status() and trans_stat_show() (Leonard
Crestez).
* Add new 'interrupt-driven' flag for devfreq governors to allow
interrupt-driven governors to prevent the devfreq core from
polling devices for status (Dmitry Osipenko).
* Improve an error message in devfreq_add_device() (Matthias
Kaehlcke).
- Update devfreq drivers:
* tegra30 driver fixes and cleanups (Dmitry Osipenko).
* Removal of unused property from dt-binding documentation for
the exynos-bus driver (Kamil Konieczny).
* exynos-ppmu cleanup and DT bindings update (Lukasz Luba, Marek
Szyprowski).
- Add new CPU IDs for CometLake Mobile and Desktop to the Intel RAPL
power capping driver (Zhang Rui).
- Allow device initialization in the generic power domains (genpd)
framework to be more straightforward and clean it up (Ulf Hansson).
- Add support for adjusting OPP voltages at run time to the OPP
framework (Stephen Boyd).
- Avoid freeing memory that has never been allocated in the
hibernation core (Andy Whitcroft).
- Clean up function headers in a header file and coding style in the
wakeup IRQs handling code (Ulf Hansson, Xiaofei Tan).
- Clean up the SmartReflex adaptive voltage scaling (AVS) driver for
ARM (Ben Dooks, Geert Uytterhoeven).
- Wrap power management documentation to fit in 80 columns (Bjorn
Helgaas).
- Add pm-graph utility entry to MAINTAINERS (Todd Brandt).
- Update the cpupower utility:
* Fix the handling of set and info subcommands (Abhishek Goel).
* Fix build warnings (Nathan Chancellor).
* Improve mperf_monitor handling (Janakarajan Natarajan).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl3dHGYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxMcgP/1bMSkxlRHFOXYSRwS4YcvkUjlBHrCSi
3qGRyYwhc+eRLqRc+2tcmQeQEeQRBqUt8etp7/9WxqS3nic/3Vdf6AFuhSpmJzo1
6JTEutHMU5eP8lwQuKoUCJncCNdIfEOkd5T35E12W/ar5PwyJio0UByZJBnJBjD/
p7/713ucq6ZH95OGncmCJ1S1UslFCZrSS2RRigDInu8gpEssnwN9zwaJbzUYrZHj
BmnKpBpT8FdLmkpbOtmmiT7q2ZGpUEHhkaO916Knf/+BFdvydTXoR90FVvXKy8Zr
QpOxaTdQB2ADifUa5zs8klVP6otmZhEO9vz8hVMUWGziqagObykQngzl8tqrKEBh
hLI8eEG1IkEBCv5ThQbLcoaRXNpwriXXfvWPTPB8s84HJxNZ09F6pXsv1SLh96qC
lj8Q5Yy2a3tlpsg4LB58XoJ54gOtlh8bWKkM0FytrFI/IP+HT4TUu/Rxgp1nDbGd
tKzLvpn4Yo2h10seeDbYk3l79mogUYj50RmwjjPn+9RwS/Df4eIpNb6ibllGZUN/
zcPZH5xlVfQRl2LKDufVN0nYSnoMZY/fU05p9XbUiJWd80LHYOb4Em1N6h/FNOyl
alDhVwlxEvc2BQwL/gjYmN6Qxc7SsPTBrSGVwjWYY+FghOYQd/wBDQqQUeM21QKg
ChOE3z/F/26r
=GJvT
-----END PGP SIGNATURE-----
Merge tag 'pm-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These include cpuidle changes to use nanoseconds (instead of
microseconds) as the unit of time and to simplify checks for disabled
idle states in the idle loop, some cpuidle fixes and governor updates,
assorted cpufreq updates (driver updates mostly and a few core fixes
and cleanups), devfreq updates (dominated by the tegra30 driver
changes), new CPU IDs for the RAPL power capping driver, relatively
minor updates of the generic power domains (genpd) and operation
performance points (OPP) frameworks, and assorted fixes and cleanups.
There are also two maintainer information updates: Chanwoo Choi will
be maintaining the devfreq subsystem going forward and Todd Brandt is
going to maintain the pm-graph utility (created by him).
Specifics:
- Use nanoseconds (instead of microseconds) as the unit of time in
the cpuidle core and simplify checks for disabled idle states in
the idle loop (Rafael Wysocki)
- Fix and clean up the teo cpuidle governor (Rafael Wysocki)
- Fix the cpuidle registration error code path (Zhenzhong Duan)
- Avoid excessive vmexits in the ACPI cpuidle driver (Yin Fengwei)
- Extend the idle injection infrastructure to be able to measure the
requested duration in nanoseconds and to allow an exit latency
limit for idle states to be specified (Daniel Lezcano)
- Fix cpufreq driver registration and clarify a comment in the
cpufreq core (Viresh Kumar)
- Add NULL checks to the show() and store() methods of sysfs
attributes exposed by cpufreq (Kai Shen)
- Update cpufreq drivers:
* Fix for a plain int as pointer warning from sparse in
intel_pstate (Jamal Shareef)
* Fix for a hardcoded number of CPUs and stack bloat in the
powernv driver (John Hubbard)
* Updates to the ti-cpufreq driver and DT files to support new
platforms and migrate bindings from opp-v1 to opp-v2 (Adam Ford,
H. Nikolaus Schaller)
* Merging of the arm_big_little and vexpress-spc drivers and
related cleanup (Sudeep Holla)
* Fix for imx's default speed grade value (Anson Huang)
* Minor cleanup of the s3c64xx driver (Nathan Chancellor)
* CPU speed bin detection fix for sun50i (Ondrej Jirman)
- Appoint Chanwoo Choi as the new devfreq maintainer.
- Update the devfreq core:
* Check NULL governor in available_governors_show sysfs to prevent
showing wrong governor information and fix a race condition
between devfreq_update_status() and trans_stat_show() (Leonard
Crestez)
* Add new 'interrupt-driven' flag for devfreq governors to allow
interrupt-driven governors to prevent the devfreq core from
polling devices for status (Dmitry Osipenko)
* Improve an error message in devfreq_add_device() (Matthias
Kaehlcke)
- Update devfreq drivers:
* tegra30 driver fixes and cleanups (Dmitry Osipenko)
* Removal of unused property from dt-binding documentation for the
exynos-bus driver (Kamil Konieczny)
* exynos-ppmu cleanup and DT bindings update (Lukasz Luba, Marek
Szyprowski)
- Add new CPU IDs for CometLake Mobile and Desktop to the Intel RAPL
power capping driver (Zhang Rui)
- Allow device initialization in the generic power domains (genpd)
framework to be more straightforward and clean it up (Ulf Hansson)
- Add support for adjusting OPP voltages at run time to the OPP
framework (Stephen Boyd)
- Avoid freeing memory that has never been allocated in the
hibernation core (Andy Whitcroft)
- Clean up function headers in a header file and coding style in the
wakeup IRQs handling code (Ulf Hansson, Xiaofei Tan)
- Clean up the SmartReflex adaptive voltage scaling (AVS) driver for
ARM (Ben Dooks, Geert Uytterhoeven)
- Wrap power management documentation to fit in 80 columns (Bjorn
Helgaas)
- Add pm-graph utility entry to MAINTAINERS (Todd Brandt)
- Update the cpupower utility:
* Fix the handling of set and info subcommands (Abhishek Goel)
* Fix build warnings (Nathan Chancellor)
* Improve mperf_monitor handling (Janakarajan Natarajan)"
* tag 'pm-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (83 commits)
PM: Wrap documentation to fit in 80 columns
cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()
cpuidle: Allow idle injection to apply exit latency limit
cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks
cpuidle: teo: Avoid code duplication in conditionals
cpufreq: Register drivers only after CPU devices have been registered
cpuidle: teo: Avoid using "early hits" incorrectly
cpuidle: teo: Exclude cpuidle overhead from computations
PM / Domains: Convert to dev_to_genpd_safe() in genpd_syscore_switch()
mmc: tmio: Avoid boilerplate code in ->runtime_suspend()
PM / Domains: Implement the ->start() callback for genpd
PM / Domains: Introduce dev_pm_domain_start()
ARM: OMAP2+: SmartReflex: add omap_sr_pdata definition
PM / wakeirq: remove unnecessary parentheses
power: avs: smartreflex: Remove superfluous cast in debugfs_create_file() call
cpuidle: Use nanoseconds as the unit of time
PM / OPP: Support adjusting OPP voltages at runtime
PM / core: Clean up some function headers in power.h
cpufreq: Add NULL checks to show() and store() methods of cpufreq
cpufreq: intel_pstate: Fix plain int as pointer warning from sparse
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- A comprehensive rewrite of the robust/PI futex code's exit handling
to fix various exit races. (Thomas Gleixner et al)
- Rework the generic REFCOUNT_FULL implementation using
atomic_fetch_* operations so that the performance impact of the
cmpxchg() loops is mitigated for common refcount operations.
With these performance improvements the generic implementation of
refcount_t should be good enough for everybody - and this got
confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
REFCOUNT_FULL entirely, leaving the generic implementation enabled
unconditionally. (Will Deacon)
- Other misc changes, fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
lkdtm: Remove references to CONFIG_REFCOUNT_FULL
locking/refcount: Remove unused 'refcount_error_report()' function
locking/refcount: Consolidate implementations of refcount_t
locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
locking/refcount: Move saturation warnings out of line
locking/refcount: Improve performance of generic REFCOUNT_FULL code
locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
locking/refcount: Remove unused refcount_*_checked() variants
locking/refcount: Ensure integer operands are treated as signed
locking/refcount: Define constants for saturation and max refcount values
futex: Prevent exit livelock
futex: Provide distinct return value when owner is exiting
futex: Add mutex around futex exit
futex: Provide state handling for exec() as well
futex: Sanitize exit state handling
futex: Mark the begin of futex exit explicitly
futex: Set task::futex_state to DEAD right after handling futex exit
futex: Split futex_mm_release() for exit/exec
exit/exec: Seperate mm_release()
futex: Replace PF_EXITPIDONE with a state
...
Pull scheduler updates from Ingo Molnar:
"The biggest changes in this cycle were:
- Make kcpustat vtime aware (Frederic Weisbecker)
- Rework the CFS load_balance() logic (Vincent Guittot)
- Misc cleanups, smaller enhancements, fixes.
The load-balancing rework is the most intrusive change: it replaces
the old heuristics that have become less meaningful after the
introduction of the PELT metrics, with a grounds-up load-balancing
algorithm.
As such it's not really an iterative series, but replaces the old
load-balancing logic with the new one. We hope there are no
performance regressions left - but statistically it's highly probable
that there *is* going to be some workload that is hurting from these
chnages. If so then we'd prefer to have a look at that workload and
fix its scheduling, instead of reverting the changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
rackmeter: Use vtime aware kcpustat accessor
leds: Use all-in-one vtime aware kcpustat accessor
cpufreq: Use vtime aware kcpustat accessors for user time
procfs: Use all-in-one vtime aware kcpustat accessor
sched/vtime: Bring up complete kcpustat accessor
sched/cputime: Support other fields on kcpustat_field()
sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()
sched/fair: Add comments for group_type and balancing at SD_NUMA level
sched/fair: Fix rework of find_idlest_group()
sched/uclamp: Fix overzealous type replacement
sched/Kconfig: Fix spelling mistake in user-visible help text
sched/core: Further clarify sched_class::set_next_task()
sched/fair: Use mul_u32_u32()
sched/core: Simplify sched_class::pick_next_task()
sched/core: Optimize pick_next_task()
sched/core: Make pick_next_task_idle() more consistent
sched/fair: Better document newidle_balance()
leds: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
cpufreq: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
procfs: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
...
* pm-cpuidle:
cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()
cpuidle: Allow idle injection to apply exit latency limit
cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks
cpuidle: teo: Avoid code duplication in conditionals
cpuidle: teo: Avoid using "early hits" incorrectly
cpuidle: teo: Exclude cpuidle overhead from computations
cpuidle: Use nanoseconds as the unit of time
cpuidle: Consolidate disabled state checks
ACPI: processor_idle: Skip dummy wait if kernel is in guest
cpuidle: Do not unset the driver if it is there already
cpuidle: teo: Fix "early hits" handling for disabled idle states
cpuidle: teo: Consider hits and misses metrics of disabled states
cpuidle: teo: Rename local variable in teo_select()
cpuidle: teo: Ignore disabled idle states that are too deep
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxNwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps4kD/9SIDXhYhhE8fNqeAF7Uouu8fxgwnkY3hSI
43vJwCziiDxWWJH5mYW7/83VNOMZKHIbiYMnU6iEUsRQ/sG/wI0wEfAQZDHLzCKt
cko2q7zAC1/4rtoslwJ3q04hE2Ap/nb93ELZBVr7fOAuODBNFUp/vifAojvsMPKz
hNMNPq/vYg7c/iYMZKSBdtjE3tqceFNBjAVNMB9dHKQLeexEy4ve7AjBeawWsSi7
GesnQ5w5u5LqkMYwLslpv/oVjHiiFWgGnDAvBNvykQvVy+DfB54KSqMV11W1aqdU
l6L+ENfZasEvlk1yMAth2Foq4vlscm5MKEb6VdJhXWHHXtXkcBmz7RBqPmjSvXCY
wS5GZRw8oYtTcid0aQf+t/wgRNTDJsGsnsT32qto41No3Z7vlIDHUDxHZGTA+gEL
E8j9rDx6EXMTo3EFbC8XZcfsorhPJ1HKAyw1YFczHtYzJEQUR9jJe3f/Q9u6K2Vy
s/EhkVeHa/lEd7kb6mI+6lQjGe1FXl7AHauDuaaEfIOZA/xJB3Bad5Wjq1va1cUO
TX+37zjzFzJghhSIBGYq7G7iT4AMecPQgxHzCdCyYfW5S4Uur9tMmIElwVPI/Pjl
kDZ9gdg9lm6JifZ9Ab8QcGhuQQTF3frwX9VfgrVgcqyvm38AiYzVgL9ZJnxRS/Cy
ZfLNkACXqQ==
=YZ9s
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"A lot of stuff has been going on this cycle, with improving the
support for networked IO (and hence unbounded request completion
times) being one of the major themes. There's been a set of fixes done
this week, I'll send those out as well once we're certain we're fully
happy with them.
This contains:
- Unification of the "normal" submit path and the SQPOLL path (Pavel)
- Support for sparse (and bigger) file sets, and updating of those
file sets without needing to unregister/register again.
- Independently sized CQ ring, instead of just making it always 2x
the SQ ring size. This makes it more flexible for networked
applications.
- Support for overflowed CQ ring, never dropping events but providing
backpressure on submits.
- Add support for absolute timeouts, not just relative ones.
- Support for generic cancellations. This divorces io_uring from
workqueues as well, which additionally gets us one step closer to
generic async system call support.
- With cancellations, we can support grabbing the process file table
as well, just like we do mm context. This allows support for system
calls that create file descriptors, like accept4() support that's
built on top of that.
- Support for io_uring tracing (Dmitrii)
- Support for linked timeouts. These abort an operation if it isn't
completed by the time noted in the linke timeout.
- Speedup tracking of poll requests
- Various cleanups making the coder easier to follow (Jackie, Pavel,
Bob, YueHaibing, me)
- Update MAINTAINERS with new io_uring list"
* tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
io_uring: make POLL_ADD/POLL_REMOVE scale better
io-wq: remove now redundant struct io_wq_nulls_list
io_uring: Fix getting file for non-fd opcodes
io_uring: introduce req_need_defer()
io_uring: clean up io_uring_cancel_files()
io-wq: ensure free/busy list browsing see all items
io-wq: ensure we have a stable view of ->cur_work for cancellations
io_wq: add get/put_work handlers to io_wq_create()
io_uring: check for validity of ->rings in teardown
io_uring: fix potential deadlock in io_poll_wake()
io_uring: use correct "is IO worker" helper
io_uring: fix -ENOENT issue with linked timer with short timeout
io_uring: don't do flush cancel under inflight_lock
io_uring: flag SQPOLL busy condition to userspace
io_uring: make ASYNC_CANCEL work with poll and timeout
io_uring: provide fallback request for OOM situations
io_uring: convert accept4() -ERESTARTSYS into -EINTR
io_uring: fix error clear of ->file_table in io_sqe_files_register()
io_uring: separate the io_free_req and io_free_req_find_next interface
io_uring: keep io_put_req only responsible for release and put req
...
Many callsites want to fetch the values of system, user, user_nice, guest
or guest_nice kcpustat fields altogether or at least a pair of these.
In that case calling kcpustat_field() for each requested field brings
unecessary overhead when we could fetch all of them in a row.
So provide kcpustat_cpu_fetch() that fetches the whole kcpustat array
in a vtime safe way under the same RCU and seqcount block.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191121024430.19938-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Provide support for user, nice, guest and guest_nice fields through
kcpustat_field().
Whether we account the delta to a nice or not nice field is decided on
top of the nice value snapshot taken at the time we call kcpustat_field().
If the nice value of the task has been changed since the last vtime
update, we may have inacurrate distribution of the nice VS unnice
cputime.
However this is considered as a minor issue compared to the proper fix
that would involve interrupting the target on nice updates, which is
undesired on nohz_full CPUs.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191121024430.19938-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Modify cpuidle_use_deepest_state() to take an additional exit latency
limit argument to be passed to find_deepest_idle_state() and make
cpuidle_idle_call() pass dev->forced_idle_latency_limit_ns to it for
forced idle.
Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Rebase and rearrange code, subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In some cases it may be useful to specify an exit latency limit for
the idle state to be used during CPU idle time injection.
Instead of duplicating the information in struct cpuidle_device
or propagating the latency limit in the call stack, replace the
use_deepest_state field with forced_latency_limit_ns to represent
that limit, so that the deepest idle state with exit latency within
that limit is forced (i.e. no governors) when it is set.
A zero exit latency limit for forced idle means to use governors in
the usual way (analogous to use_deepest_state equal to "false" before
this change).
Additionally, add play_idle_precise() taking two arguments, the
duration of forced idle and the idle state exit latency limit, both
in nanoseconds, and redefine play_idle() as a wrapper around that
new function.
This change is preparatory, no functional impact is expected.
Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Subject, changelog, cpuidle_use_deepest_state() kerneldoc, whitespace ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays,
which might be inefficient when the cpufreq driver has rate limitation.
When a task is attached on a CPU, we have this call path:
update_load_avg()
update_cfs_rq_load_avg()
cfs_rq_util_change -- > trig frequency update
attach_entity_load_avg()
cfs_rq_util_change -- > trig frequency update
The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discarded because of rate
limitation of the cpufreq driver.
update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
cfs_rq_util_change/cpufreq_update_util() into these two functions.
It's also interesting to note that update_load_avg() already calls
cfs_rq_util_change() directly for the !SMP case.
This change will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update, but only
IRQ, RT or DL PELT signals.
[ mingo: Minor updates. ]
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@redhat.com
Cc: linux-pm@vger.kernel.org
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Cc: sargun@sargun.me
Cc: srinivas.pandruvada@linux.intel.com
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Fixes: 039ae8bcf7 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>