There may be special requirements on CPU response time, like if a
interrupt is pinned to a CPU, that CPU should not go into excessively
deep idle states. For this reason, add a mechanism for adding
PM QoS resume latency constraints for individual CPUs and modify the
menu governor to take them into account.
To that end, extend the device PM QoS pm_qos_resume_latency attribute
to CPUs, which is possible, because the exit latency for CPUs is
effectively equivalent to the resume latency for devices.
Signed-off-by: Alex Shi <alex.shi@linaro.org>
Acked-by: Rik van Riel <riel@redhat.com>
[ rjw : Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Obsolete commit 71abbbf856 (cpuidle: extend cpuidle and menu governor
to handle dynamic states) wanted to introduce dynamic C-states, but that
idea was dropped long ago. The nonsense deeper C-state checking
remained, though.
Since both target_residency and exit_latency are longer for deeper
idle state, there's no need to waste CPU time on useless checks.
Signed-off-by: Alex Shi <alex.shi@linaro.org>
Acked-by: Rik van Riel <riel@redhat.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The governor's code use try_module_get() and put_module() to refcount
the governor's module. But the governors are not compiled as module.
The refcount does not prevent to switch the governor or unload
a module as they aren't compiled as modules. The code is pointless,
so remove it.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit a9ceb78bc7 (cpuidle,menu: use interactivity_req to disable
polling) changed the behavior of the fallback state selection part
of menu_select() so it looks at interactivity_req instead of
data->next_timer_us when it makes its decision. That effectively
caused polling to be used more often as fallback idle which led to
significant increases of energy consumption in some cases.
Commit e132b9b3bc (cpuidle: menu: use high confidence factors
only when considering polling) changed that logic again to be more
predictable, but that didn't help with the increased energy
consumption problem.
For this reason, go back to making decisions on which state to fall
back to based on data->next_timer_us which is the time we know for
sure something will happen rather than a prediction (which may be
inaccurate and turns out to be so often enough to be problematic).
However, take the target residency of the first proper idle state
(C1) into account, so that state is not used as the fallback one
if its target residency is greater than data->next_timer_us.
Fixes: a9ceb78bc7 (cpuidle,menu: use interactivity_req to disable polling)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
The menu governor uses five different factors to pick the
idle state:
- the user configured latency_req
- the time until the next timer (next_timer_us)
- the typical sleep interval, as measured recently
- an estimate of sleep time by dividing next_timer_us by an observed factor
- a load corrected version of the above, divided again by load
Only the first three items are known with enough confidence that
we can use them to consider polling, instead of an actual CPU
idle state, because the cost of being wrong about polling can be
excessive power use.
The latter two are used in the menu governor's main selection
loop, and can result in choosing a shallower idle state when
the system is expected to be busy again soon.
This pushes a busy system in the "performance" direction of
the performance<>power tradeoff, when choosing between idle
states, but stays more strictly on the "power" state when
deciding between polling and C1.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We know that the avg variable actually ends up holding a 32 bit
quantity, since it's an average of such numbers. It is only a u64
because it is temporarily used to hold the sum. Making it an actual
u32 allows gcc to generate slightly better code, e.g. when computing
the square, it can do a 32x32->64 multiply.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Computing the integer square root is a rather expensive operation, at
least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64
bit platforms, doing an extra comparison to a constant (variance <=
U64_MAX/36).
On 64 bit platforms, this does mean that we add a restriction on the
range of the variance where we end up using the estimate (since
previously the stddev <= ULONG_MAX was a tautology), but on the other
hand, we extend the range quite substantially on 32 bit platforms - in
both cases, we now allow standard deviations up to 715 seconds, which
is for example guaranteed if all observations are less than 1430
seconds.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If menu_select() cannot find a suitable state to return, it will
return the state index stored in data->last_state_idx. This
means that it is pointless to look at the states whose indices
are less than or equal to data->last_state_idx in the main loop,
so don't do that.
Given that those checks are done on every idle state selection, this
change can save quite a bit of completely unnecessary overhead.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
The menu governor is currently the default on all systems. However the
documentation claims that the ladder governor is preferred on ticking
systems. So bump the rating of the ladder governor when NO_HZ is
disabled, or when booting with nohz=off.
This fixes the first half of kernel BZ #65531.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=65531
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit a9ceb78bc7 (cpuidle,menu: use interactivity_req to disable
polling) exposed a bug in menu_select() causing it to return -1
on systems with CPUIDLE_DRIVER_STATE_START equal to zero, although
it should have returned 0. As a result, idle states are not entered
by CPUs on those systems.
Namely, on the systems in question data->last_state_idx is initially
equal to -1 and the above commit modified the condition that would
have caused it to be changed to 0 to be less likely to trigger which
exposed the problem. However, setting data->last_state_idx initially
to -1 doesn't make sense at all and on the affected systems it should
always be set to CPUIDLE_DRIVER_STATE_START (ie. 0) unconditionally,
so make that happen.
Fixes: a9ceb78bc7 (cpuidle,menu: use interactivity_req to disable polling)
Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpuidle state tables contain the maximum exit latency for each
cpuidle state. On x86, that is the exit latency for when the entire
package goes into that same idle state.
However, a lot of the time we only go into the core idle state,
not the package idle state. This means we see a much smaller exit
latency.
We have no way to detect whether we went into the core or package
idle state while idle, and that is ok.
However, the current menu_update logic does have the potential to
trip up the repeating pattern detection in get_typical_interval.
If the system is experiencing an exit latency near the idle state's
exit latency, some of the samples will have exit_us subtracted,
while others will not. This turns a repeating pattern into mush,
potentially breaking get_typical_interval.
Furthermore, for smaller sleep intervals, we know the chance that
all the cores in the package went to the same idle state are fairly
small. Dividing the measured_us by two, instead of subtracting the
full exit latency when hitting a small measured_us, will reduce the
error.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor carefully figures out how much time we typically
sleep for an estimated sleep interval, or whether there is a repeating
pattern going on, and corrects that estimate for the CPU load.
Then it proceeds to ignore that information when determining whether
or not to consider polling. This is not a big deal on most x86 CPUs,
which have very low C1 latencies, and the patch should not have any
effect on those CPUs.
However, certain CPUs (eg. Atom) have much higher C1 latencies, and
it would be good to not waste performance and power on those CPUs if
we are expecting a very low wakeup latency.
Disable polling based on the estimated interactivity requirement, not
on the time to the next timer interrupt.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpuidle menu governor has a forced cut-off for polling at 5us,
in order to deal with firmware that gives the OS bad information
on cpuidle states, leading to the system spending way too much time
in polling.
However, at least one x86 CPU family (Atom) has chips that have
a 20us break-even point for C1. Forcing the polling cut-off to
less than that wastes performance and power.
Increase the polling cut-off to 20us.
Systems with a lower C1 latency will be found in the states table by
the menu governor, which will pick those states as appropriate.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Avoid calling the governor's ->reflect method if the state index
passed to cpuidle_reflect() is negative.
This allows the analogous check to be dropped from menu_reflect(),
so do that too, and ensures that arbitrary error codes can be
passed to cpuidle_reflect() as the index with no adverse
consequences.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Now that the kernel provides DIV_ROUND_CLOSEST_ULL(), drop the internal
implementation and use the kernel one.
Signed-off-by: Javi Merino <javi.merino@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the ladder governor sees the CPUIDLE_FLAG_TIME_INVALID flag,
it unconditionally causes a state promotion by setting last_residency
to a number higher than the state's promotion_time:
last_residency = last_state->threshold.promotion_time + 1
It does this for fear that cpuidle_get_last_residency()
will be in-accurate, because cpuidle_enter_state() invoked
a state with CPUIDLE_FLAG_TIME_INVALID.
But the only state with CPUIDLE_FLAG_TIME_INVALID is
acpi_safe_halt(), which may return well after its actual
idle duration because it enables interrupts, so cpuidle_enter_state()
also measures interrupt service time.
So what? In ladder, a huge invalid last_residency has exactly
the same effect as the current code -- it unconditionally
causes a state promotion.
In the case where the idle residency plus measured interrupt
handling time is less than the state's demotion_time -- we should
use that timestamp to give ladder a chance to demote, rather than
unconditionally promoting.
This can be done by simply ignoring the CPUIDLE_FLAG_TIME_INVALID,
and using the "invalid" time, as it is either equal to what we are
doing today, or better.
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When menu sees CPUIDLE_FLAG_TIME_INVALID, it ignores its timestamps,
and assumes that idle lasted as long as the time till next predicted
timer expiration.
But if an interrupt was seen and serviced before that duration,
it would actually be more accurate to use the measured time
rather than rounding up to the next predicted timer expiration.
And if an interrupt is seen and serviced such that the mesured time
exceeds the time till next predicted timer expiration, then
truncating to that expiration is the right thing to do --
since we can never stay idle past that timer expiration.
So the code can do a better job without
checking for CPUIDLE_FLAG_TIME_INVALID.
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The only place where the time is invalid is when the ACPI_CSTATE_FFH entry
method is not set. Otherwise for all the drivers, the time can be correctly
measured.
Instead of duplicating the CPUIDLE_FLAG_TIME_VALID flag in all the drivers
for all the states, just invert the logic by replacing it by the flag
CPUIDLE_FLAG_TIME_INVALID, hence we can set this flag only for the acpi idle
driver, remove the former flag from all the drivers and invert the logic with
this flag in the different governor.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
All of these are for address calculation. Replace with
this_cpu_ptr().
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
[cpufreq changes]
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix for an ACPI-based device hotplug regression introduced in 3.14
that causes a kernel panic to trigger when memory hot-remove is
attempted with CONFIG_ACPI_HOTPLUG_MEMORY unset from Tang Chen.
- Fix for a cpufreq regression introduced in 3.16 that triggers a
"sleeping function called from invalid context" bug in
dev_pm_opp_init_cpufreq_table() from Stephen Boyd.
- ACPI battery driver fix for a warning message added in 3.16 that
prints silly stuff sometimes from Mariusz Ceier.
- Hibernation fix for safer handling of mismatches in the 820 memory
map between the configurations during image creation and during
the subsequent restore from Chun-Yi Lee.
- ACPI processor driver fix to handle CPU hotplug notifications
correctly during system suspend/resume from Lan Tianyu.
- Series of four cpuidle menu governor cleanups that also should
speed it up a bit from Mel Gorman.
- Fixes for the speedstep-smi, integrator, cpu0 and arm_big_little
cpufreq drivers from Hans Wennborg, Himangi Saraogi, Markus Pargmann
and Uwe Kleine-König.
- Version 3.0 of the analyze_suspend.py suspend profiling tool
from Todd E Brandt.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJT7UnNAAoJEILEb/54YlRxcxIP/ROFeak3+5tt3hkvZCevxpUh
AMPccgUoqsF2dognO3pcR4AgGP+meM6Qw0zBjPDNx6oo87hw7P1HlngfaRPHnWPh
iAkY2p1QhGAZW29vqxqBIdLVP9M+Nje0tvOX8/6QEsQgo2y6YCbJU0zITmvb8Tsk
183cXiz6xXDezt4sPeIVg2QVfngVFtOeNVgHDIhldQSF6zUQJP/3+BVutvaj3olt
2O3qpNfwJjFh9p6LWQ+CAalq3hJyNZ6ettLNCvudeq4kqRo49WAdjHaRW+qju/NR
dWybO29MfviczABVQ1ReqSnz0MJOqhZNxkEi5KqnYBb3fx8e2XffsBFzFzTp6BJi
bp4ALcFIu9r5ctWVxQhmgEC6uhYMIXZ681sH99HyIdzk2cNRgMxRj6u2aVe/Cczu
Bb489CRHmOrZyXrkmENg+LkOYBNoXcT+RepH9Ex8R+TNBlKLEBKMMgPrfbFeVKWB
Vm621tHNATJG8nJcs3zJulM2FQ0q8c2irw6WwhUxzbSOxmqSvO5zN3OgYt+c+gWk
MmA8IhUpQBLkqBx1FMi0lOOdIW3qKZJFrU39VQEjoP4P1nXgf373NPlfgzMvEvqM
qQ8srMKFUjYxH3g0ftWk5a2MwEjyHQpvZe0djsMCN7ZkFLwUe1ri/R9Ja2LLQcIZ
SyVkFbbO+moXTRMA1yA9
=kpiw
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
"These are a couple of regression fixes, cpuidle menu governor
optimizations, fixes for ACPI proccessor and battery drivers,
hibernation fix to avoid problems related to the e820 memory map,
fixes for a few cpufreq drivers and a new version of the suspend
profiling tool analyze_suspend.py.
Specifics:
- Fix for an ACPI-based device hotplug regression introduced in 3.14
that causes a kernel panic to trigger when memory hot-remove is
attempted with CONFIG_ACPI_HOTPLUG_MEMORY unset from Tang Chen
- Fix for a cpufreq regression introduced in 3.16 that triggers a
"sleeping function called from invalid context" bug in
dev_pm_opp_init_cpufreq_table() from Stephen Boyd
- ACPI battery driver fix for a warning message added in 3.16 that
prints silly stuff sometimes from Mariusz Ceier
- Hibernation fix for safer handling of mismatches in the 820 memory
map between the configurations during image creation and during the
subsequent restore from Chun-Yi Lee
- ACPI processor driver fix to handle CPU hotplug notifications
correctly during system suspend/resume from Lan Tianyu
- Series of four cpuidle menu governor cleanups that also should
speed it up a bit from Mel Gorman
- Fixes for the speedstep-smi, integrator, cpu0 and arm_big_little
cpufreq drivers from Hans Wennborg, Himangi Saraogi, Markus
Pargmann and Uwe Kleine-König
- Version 3.0 of the analyze_suspend.py suspend profiling tool from
Todd E Brandt"
* tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / battery: Fix warning message in acpi_battery_get_state()
PM / tools: analyze_suspend.py: update to v3.0
cpufreq: arm_big_little: fix module license spec
cpufreq: speedstep-smi: fix decimal printf specifiers
ACPI / hotplug: Check scan handlers in acpi_scan_hot_remove()
cpufreq: OPP: Avoid sleeping while atomic
cpufreq: cpu0: Do not print error message when deferring
cpufreq: integrator: Use set_cpus_allowed_ptr
PM / hibernate: avoid unsafe pages in e820 reserved regions
ACPI / processor: Make acpi_cpu_soft_notify() process CPU FROZEN events
cpuidle: menu: Lookup CPU runqueues less
cpuidle: menu: Call nr_iowait_cpu less times
cpuidle: menu: Use ktime_to_us instead of reinventing the wheel
cpuidle: menu: Use shifts when calculating averages where possible
Pull trivial tree changes from Jiri Kosina:
"Summer edition of trivial tree updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: fix two typos in watchdog-api.txt
irq-gic: remove file name from heading comment
MAINTAINERS: Add miscdevice.h to file list for char/misc drivers.
scsi: mvsas: mv_sas.c: Fix for possible null pointer dereference
doc: replace "practise" with "practice" in Documentation
befs: remove check for CONFIG_BEFS_RW
scsi: doc: fix 'SCSI_NCR_SETUP_MASTER_PARITY'
drivers/usb/phy/phy.c: remove a leading space
mfd: fix comment
cpuidle: fix comment
doc: hpfall.c: fix missing null-terminate after strncpy call
usb: doc: hotplug.txt code typos
kbuild: fix comment in Makefile.modinst
SH: add proper prompt to SH_MAGIC_PANEL_R2_VERSION
ARM: msm: Remove MSM_SCM
crypto: Remove MPILIB_EXTRA
doc: CN: remove dead link, kerneltrap.org no longer works
media: update reference, kerneltrap.org no longer works
hexagon: update reference, kerneltrap.org no longer works
doc: LSM: update reference, kerneltrap.org no longer works
...
The menu governer makes separate lookups of the CPU runqueue to get
load and number of IO waiters but it can be done with a single lookup.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
menu_select() via inline functions calls nr_iowait_cpu() twice as much
as necessary.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The ktime_to_us implementation is slightly better than the one implemented
in menu.c. Use it
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We use do_div even though the divisor will usually be a power-of-two
unless there are unusual outliers. Use shifts where possible
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
use CPUIDLE_DRIVER_STATE_START, instead of hardcoded value 0. As,
CPUIDLE_DRIVER_STATE_START can be 1/0 based on
CONFIG_ARCH_HAS_CPU_RELAX is defined or not.
Signed-off-by: Mohammad Merajul Islam Molla <meraj.enigma@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
STDDEV_THRESH was once defined and used in menu governor. But now its no longer
used anywhere. So removing the define.
Signed-off-by: Mohammad Merajul Islam Molla <meraj.enigma@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In menu_select function we check for correction factor every time.
If it is zero we are initializing to unity. Hence move it to init function
and initialise by unity, hence avoid repeated comparisons.
Signed-off-by: Chander Kashyap <chander.kashyap@linaro.org>
Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If there is a PM QoS latency limit and all of the sufficiently shallow
C-states are disabled, the cpuidle menu governor returns 0 which on
some systems is CPUIDLE_DRIVER_STATE_START and shouldn't be returned
if that C-state has been disabled.
Fix the issue by modifying the menu governor to return (-1) in such
situations.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor performance multiplier defines a minimum predicted
idle duration to latency ratio. Instead of checking this separately
in every iteration of the state selection loop, adjust the overall
latency restriction for the whole loop if this restriction is tighter
than what is set by the QoS subsystem.
The original test
s->exit_latency * multiplier > data->predicted_us
becomes
s->exit_latency > data->predicted_us / multiplier
by dividing both sides of the comparison by "multiplier".
While division is likely to be several times slower than multiplication,
the minor performance hit allows making a generic sleep state selection
function based on (sleep duration, maximum latency) tuple.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor statistics update function tries to determine the
amount of time between entry to low power state and the occurrence
of the wakeup event. However, the time measured by the framework
includes exit latency on top of the desired value. This exit latency
is substracted from the measured value to obtain the desired value.
When measured value is not available, the menu governor assumes
the wakeup was caused by the timer and the time is equal to remaining
timer length. No exit latency should be substracted from this value.
This patch prevents the erroneous substraction and clarifies the
associated comment. It also removes one intermediate variable that
serves no purpose.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor uses coefficients as one method of actual idle
period length estimation. The coefficients are, as detailed below,
multipliers giving expected idle period length from time until next
timer expiry. The multipliers are supposed to have domain of (0..1].
The coefficients are fractions where only the numerators are stored
and denominators are a shared constant RESOLUTION*DECAY. Since the
value of the coefficient should always be greater than 0 and less
than or equal to 1, the numerator must have a value greater than
0 and less than or equal to RESOLUTION*DECAY.
If the coefficients are updated with measured idle durations exceeding
timer length, the multiplier may reach values exceeding unity (i.e.
the stored numerator exceeds RESOLUTION*DECAY). This patch ensures that
the multipliers are updated with durations capped to timer length.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently menu governor records the exit latency of the state it has
chosen for the idle period. The stored latency value is then later
used to calculate the actual length of the idle period. This value
may however be incorrect, as the entered state may not be the one
chosen by the governor. The entered state information is available,
so we can use that to obtain the real exit latency.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The field expected_us is used to store the time remaining until next
timer expiry. The name is inaccurate, as we really do not expect all
wakeups to be caused by timers. In addition, another field with a very
similar name (predicted_us) is used to store the predicted time
remaining until any wakeup source being active.
This patch renames expected_us to next_timer_us in order to better
reflect the contained information.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Field predicted_us value can never exceed expected_us value, but it has
a potentially larger type. As there is no need for additional 32 bits of
zeroes on 32 bit plaforms, change the type of predicted_us to match the
type of expected_us.
Field correction_factor is used to store a value that cannot exceed the
product of RESOLUTION and DECAY (default 1024*8 = 8192). The constants
cannot in practice be incremented to such values, that they'd overflow
unsigned int even on 32 bit systems, so the type is changed to avoid
unnecessary 64 bit arithmetic on 32 bit systems.
One multiplication of (now) 32 bit values needs an added cast to avoid
truncation of the result and has been added.
In order to avoid another multiplication from 32 bit domain to 64 bit
domain, the new correction_factor calculation has been changed from
new = old * (DECAY-1) / DECAY
to
new = old - old / DECAY,
which with infinite precision would yeild exactly the same result, but
now changes the direction of rounding. The impact is not significant as
the maximum accumulated difference cannot exceed the value of DECAY,
which is relatively small compared to product of RESOLUTION and DECAY
(8 / 8192).
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor has a number of tunable constants that may be changed
in the source. If certain combination of values are chosen, an overflow
is possible when the correction_factor is being recalculated.
This patch adds a warning regarding this possibility and describes the
change needed for fixing the issue. The change should not be permanently
enabled, as it will hurt performance when it is not needed.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor uses a static function get_typical_interval() to
try to detect a repeating pattern of wakeups. The previous interval
durations are stored as an array of unsigned ints, but the arithmetic
in the function is performed exclusively as 64 bit values, even when
the value stored in a variable is known not to exceed unsigned int,
which may be smaller and more efficient on some platforms.
This patch changes the types of varibles used to store some
intermediates, the maximum and and the cutoff threshold to unsigned
ints. Average and standard deviation are still treated as 64 bit values,
even when the values are known to be within the domain of unsigned int,
to avoid casts to ensure correct integer promotion for arithmetic
operations.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Struct menu_device member intervals is declared as u32, but the value
stored is (unsigned) int. The type is changed to match the value being
stored.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function get_typical_interval() initializes a number of variables
that are immediately after declarations assigned constant values.
In addition, there are multiple assignments on a single line, which
is explicitly forbidden by Documentation/CodingStyle.
This patch removes redundant initial values for the variables and
breaks up the multiple assignment line.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
get_typical_interval() uses int_sqrt() in calculation of standard
deviation. The formal parameter of int_sqrt() is unsigned long, which
may on some platforms be smaller than the 64 bit unsigned integer used
as the actual parameter. The overflow can occur frequently when actual
idle period lengths are in hundreds of milliseconds.
This patch adds a check for such overflow and rejects the candidate
average when an overflow would occur.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch rearranges a if-return-elsif-goto-fi-return sequence into
if-return-fi-if-return-fi-goto sequence. The functionality remains the
same. Also, a lengthy comment that did not describe the functionality
in the order it occurs is split into half and top half is moved closer
to actual implementation it describes.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch prevents cpuidle menu governor from using repeating interval
prediction result if the idle period predicted is longer than the one
allowed by shortest running timer.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit e11538d1 (cpuidle: Quickly notice prediction failure in
general case), since it depends on commit 69a37be (cpuidle: Quickly
notice prediction failure for repeat mode) that has been identified
as the source of a significant performance regression in v3.8 and
later.
Requested-by: Jeremy Eder <jeder@redhat.com>
Tested-by: Len Brown <len.brown@intel.com>
Cc: 3.8+ <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq governors are defined as modules in the code, but the Kconfig
options do not allow them to be built as modules. This is not really
a problem, but the cpuidle init ordering is: the cpuidle init
functions (framework and driver) and then the governors. That leads
to some weirdness in the cpuidle framework.
Namely, cpuidle_register_device() calls cpuidle_enable_device() which
fails at the first attempt, because governors have not been registered
yet. When a governor is registered, the framework calls
cpuidle_enable_device() again which runs __cpuidle_register_device()
only then. Of course, for that to work, the cpuidle_enable_device()
return value has to be ignored by cpuidle_register_device().
Instead of having this cyclic call graph and relying on a positive
side effects of the hackish back and forth cpuidle_enable_device()
calls it is better to fix the cpuidle init ordering.
To that end, replace the module init code with postcore_initcall()
so we have:
* cpuidle framework : core_initcall
* cpuidle governors : postcore_initcall
* cpuidle drivers : device_initcall
and remove the corresponding module exit code as it is dead anyway
(governors can't be built as modules).
[rjw: Changelog]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We realized that the power usage field is never filled and when it
is filled for tegra, the power_specified flag is not set causing all
of these values to be reset when the driver is initialized with
set_power_state().
However, the power_specified flag can be simply removed under the
assumption that the states are always backward sorted, which is the
case with the current code.
This change allows the menu governor select function and the
cpuidle_play_dead() to be simplified. Moreover, the
set_power_states() function can removed as it does not make sense
any more.
Drop the power_specified flag from struct cpuidle_driver and make
the related changes as described above.
As a consequence, this also fixes the bug where on the dynamic
C-states system, the power fields are not initialized.
[rjw: Changelog]
References: https://bugzilla.kernel.org/show_bug.cgi?id=42870
References: https://bugzilla.kernel.org/show_bug.cgi?id=43349
References: https://lkml.org/lkml/2012/10/16/518
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since cpuidle_state.power_usage is a signed value, use INT_MAX (instead
of -1) to init the local copies so that functions that tries to find
cpuidle states with minimum power usage works correctly even if they use
non-negative values.
Signed-off-by: Sivaram Nair <sivaramn@nvidia.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function detect_repeating_patterns was not very useful for
workloads with alternating long and short pauses, for example
virtual machines handling network requests for each other (say
a web and database server).
Instead, try to find a recent sleep interval that is somewhere
between the median and the mode sleep time, by discarding outliers
to the up side and recalculating the average and standard deviation
until that is no longer required.
This should do something sane with a sleep interval series like:
200 180 210 10000 30 1000 170 200
The current code would simply discard such a series, while the
new code will guess a typical sleep interval just shy of 200.
The original patch come from Rik van Riel <riel@redhat.com>.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
The patch extends to general case that prediction logic get a small predicted
residency, so it choose a shallow C-state though the expected residency is large
. Once the prediction will be fail, the CPU will keep staying at shallow C-state
for a long time. Acutally, the CPU has change enter into deep C-state.
So when the expected residency is long enough but governor choose a shallow
C-state, an timer will be added in order to monitor if the prediction failure.
When C-state is waken up prior to the adding timer, the timer will be cancelled
initiatively. When the timer is triggered and menu governor will quickly notice
prediction failure and re-evaluates deeper C-states possibility.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.
There is a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
governor will predict it is repeat mode and there is another IPI wake up idle
CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.
In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state. The timer is set to a time out value that greater
than predicted time, and we conclude repeat mode prediction failure if timer is
triggered. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly notice that the repeat mode prediction fails and then re-evaluates
deeper C-states possibility.
Below is another case which will clearly show the patch much benefit:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>
volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;
void usage(void)
{
fprintf(stderr,
"Usage: idle_predict [options]\n"
" --help -h Print this help\n"
" --thread -n Thread number\n"
" --loop -l Loop times in shallow Cstate\n"
" --delay -t Sleep time (uS)in shallow Cstate\n");
}
void *simple_loop() {
int idle_num = 1;
while (!(*shutdown)) {
*count = *count + 1;
if (idle_num % loop)
usleep(delay);
else {
/* sleep 1 second */
usleep(1000000);
idle_num = 0;
}
idle_num++;
}
}
static void sighand(int sig)
{
*shutdown = 1;
}
int main(int argc, char *argv[])
{
sigset_t sigset;
int signum = SIGALRM;
int i, c, er = 0, thread_num = 8;
pthread_t pt[1024];
static char optstr[] = "n:l:t:h:";
while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'n':
thread_num = atoi(optarg);
break;
case 'l':
loop = atoi(optarg);
break;
case 't':
delay = atoi(optarg);
break;
case 'h':
default:
usage();
exit(1);
}
printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
count = malloc(sizeof(long));
shutdown = malloc(sizeof(int));
*count = 0;
*shutdown = 0;
sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigprocmask (SIG_BLOCK, &sigset, NULL);
signal(SIGINT, sighand);
signal(SIGTERM, sighand);
for(i = 0; i < thread_num ; i++)
pthread_create(&pt[i], NULL, simple_loop, NULL);
for (i = 0; i < thread_num; i++)
pthread_join(pt[i], NULL);
exit(0);
}
Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop
We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.
While after patched the kernel, we find that deep C-state will keep >99.6%.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
For the mechanism introduced by commit cbc9ef0 (PM / Domains: Add
preliminary support for cpuidle, v2) to work with the ladder
governor, that governor should respect the "disabled" state flag
added by that commit. Change the ladder governor accordingly.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
There are two cpuidle governors ladder and menu. While the ladder
governor is always available, if CONFIG_CPU_IDLE is selected, the
menu governor additionally requires CONFIG_NO_HZ.
A particular C state can be disabled by writing to the sysfs file
/sys/devices/system/cpu/cpuN/cpuidle/stateN/disable, but this mechanism
is only implemented in the menu governor. Thus, in a system where
CONFIG_NO_HZ is not selected, the ladder governor becomes default and
always will walk through all sleep states - irrespective of whether the
C state was disabled via sysfs or not. The only way to select a specific
C state was to write the related latency to /dev/cpu_dma_latency and
keep the file open as long as this setting was required - not very
practical and not suitable for setting a single core in an SMP system.
With this patch, the ladder governor only will promote to the next
C state, if it has not been disabled, and it will demote, if the
current C state was disabled.
Note that the patch does not make the setting of the sysfs variable
"disable" coherent, i.e. if one is disabling a light state, then all
deeper states are disabled as well, but the "disable" variable does not
reflect it. Likewise, if one enables a deep state but a lighter state
still is disabled, then this has no effect. A related section has been
added to the documentation.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
On some systems there are CPU cores located in the same power
domains as I/O devices. Then, power can only be removed from the
domain if all I/O devices in it are not in use and the CPU core
is idle. Add preliminary support for that to the generic PM domains
framework.
First, the platform is expected to provide a cpuidle driver with one
extra state designated for use with the generic PM domains code.
This state should be initially disabled and its exit_latency value
should be set to whatever time is needed to bring up the CPU core
itself after restoring power to it, not including the domain's
power on latency. Its .enter() callback should point to a procedure
that will remove power from the domain containing the CPU core at
the end of the CPU power transition.
The remaining characteristics of the extra cpuidle state, referred to
as the "domain" cpuidle state below, (e.g. power usage, target
residency) should be populated in accordance with the properties of
the hardware.
Next, the platform should execute genpd_attach_cpuidle() on the PM
domain containing the CPU core. That will cause the generic PM
domains framework to treat that domain in a special way such that:
* When all devices in the domain have been suspended and it is about
to be turned off, the states of the devices will be saved, but
power will not be removed from the domain. Instead, the "domain"
cpuidle state will be enabled so that power can be removed from
the domain when the CPU core is idle and the state has been chosen
as the target by the cpuidle governor.
* When the first I/O device in the domain is resumed and
__pm_genpd_poweron(() is called for the first time after
power has been removed from the domain, the "domain" cpuidle
state will be disabled to avoid subsequent surprise power removals
via cpuidle.
The effective exit_latency value of the "domain" cpuidle state
depends on the time needed to bring up the CPU core itself after
restoring power to it as well as on the power on latency of the
domain containing the CPU core. Thus the "domain" cpuidle state's
exit_latency has to be recomputed every time the domain's power on
latency is updated, which may happen every time power is restored
to the domain, if the measured power on latency is greater than
the latency stored in the corresponding generic_pm_domain structure.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Andrew J.Schorr raises a question. When he changes the disable setting on
a single CPU, it affects all the other CPUs. Basically, currently, the
disable field is per-driver instead of per-cpu. All the C states of the
same driver are shared by all CPU in the same machine.
The patch changes the `disable' field to per-cpu, so we could set this
separately for each cpu.
Signed-off-by: ShuoX Liu <shuox.liu@intel.com>
Reported-by: Andrew J.Schorr <aschorr@telemetry-investments.com>
Reviewed-by: Yanmin Zhang <yanmin_zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
power_usage is always assigned a negative value and should be declared
a signed integer
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Some C states of new CPU might be not good. One reason is BIOS might
configure them incorrectly. To help developers root cause it quickly, the
patch adds a new sysfs entry, so developers could disable specific C state
manually.
In addition, C state might have much impact on performance tuning, as it
takes much time to enter/exit C states, which might delay interrupt
processing. With the new debug option, developers could check if a deep C
state could impact performance and how much impact it could cause.
Also add this option in Documentation/cpuidle/sysfs.txt.
[akpm@linux-foundation.org: check kstrtol return value]
Signed-off-by: ShuoX Liu <shuox.liu@intel.com>
Reviewed-by: Yanmin Zhang <yanmin_zhang@intel.com>
Reviewed-and-Tested-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
cpuidle: Single/Global registration of idle states
cpuidle: Split cpuidle_state structure and move per-cpu statistics fields
cpuidle: Remove CPUIDLE_FLAG_IGNORE and dev->prepare()
cpuidle: Move dev->last_residency update to driver enter routine; remove dev->last_state
ACPI: Fix CONFIG_ACPI_DOCK=n compiler warning
ACPI: Export FADT pm_profile integer value to userspace
thermal: Prevent polling from happening during system suspend
ACPI: Drop ACPI_NO_HARDWARE_INIT
ACPI atomicio: Convert width in bits to bytes in __acpi_ioremap_fast()
PNPACPI: Simplify disabled resource registration
ACPI: Fix possible recursive locking in hwregs.c
ACPI: use kstrdup()
mrst pmu: update comment
tools/power turbostat: less verbose debugging
This patch makes the cpuidle_states structure global (single copy)
instead of per-cpu. The statistics needed on per-cpu basis
by the governor are kept per-cpu. This simplifies the cpuidle
subsystem as state registration is done by single cpu only.
Having single copy of cpuidle_states saves memory. Rare case
of asymmetric C-states can be handled within the cpuidle driver
and architectures such as POWER do not have asymmetric C-states.
Having single/global registration of all the idle states,
dynamic C-state transitions on x86 are handled by
the boot cpu. Here, the boot cpu would disable all the devices,
re-populate the states and later enable all the devices,
irrespective of the cpu that would receive the notification first.
Reference:
https://lkml.org/lkml/2011/4/25/83
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com>
Tested-by: Jean Pihet <j-pihet@ti.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The cpuidle_device->prepare() mechanism causes updates to the
cpuidle_state[].flags, setting and clearing CPUIDLE_FLAG_IGNORE
to tell the governor not to chose a state on a per-cpu basis at
run-time. State demotion is now handled by the driver and it returns
the actual state entered. Hence, this mechanism is not required.
Also this removes per-cpu flags from cpuidle_state enabling
it to be made global.
Reference:
https://lkml.org/lkml/2011/3/25/52
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm>
Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com>
Tested-by: Jean Pihet <j-pihet@ti.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cpuidle governor only suggests the state to enter using the
governor->select() interface, but allows the low level driver to
override the recommended state. The actual entered state
may be different because of software or hardware demotion. Software
demotion is done by the back-end cpuidle driver and can be accounted
correctly. Current cpuidle code uses last_state field to capture the
actual state entered and based on that updates the statistics for the
state entered.
Ideally the driver enter routine should update the counters,
and it should return the state actually entered rather than the time
spent there. The generic cpuidle code should simply handle where
the counters live in the sysfs namespace, not updating the counters.
Reference:
https://lkml.org/lkml/2011/3/25/52
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com>
Tested-by: Jean Pihet <j-pihet@ti.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Len Brown <len.brown@intel.com>
This file has module_init/exit and MODULE_LICENSE, and so it
needs the full module.h header.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The PM QoS implementation files are better named
kernel/power/qos.c and include/linux/pm_qos.h.
The PM QoS support is compiled under the CONFIG_PM option.
Signed-off-by: Jean Pihet <j-pihet@ti.com>
Acked-by: markgross <markgross@thegnar.org>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cpuidle menu governor is using u32 as a temporary datatype for storing
nanosecond values which wrap around at 4.294 seconds. This causes errors
in predicted sleep times resulting in higher than should be C state
selection and increased power consumption. This also breaks cpuidle
state residency statistics.
cc: stable@kernel.org # .32.x through .39.x
Signed-off-by: Tero Kristo <tero.kristo@nokia.com>
Signed-off-by: Len Brown <len.brown@intel.com>
On some SoC chips, HW resources may be in use during any particular idle
period. As a consequence, the cpuidle states that the SoC is safe to
enter can change from idle period to idle period. In addition, the
latency and threshold of each cpuidle state can vary, depending on the
operating condition when the CPU becomes idle, e.g. the current cpu
frequency, the current state of the HW blocks, etc.
cpuidle core and the menu governor, in the current form, are geared
towards cpuidle states that are static, i.e. the availabiltiy of the
states, their latencies, their thresholds are non-changing during run
time. cpuidle does not provide any hook that cpuidle drivers can use to
adjust those values on the fly for the current idle period before the menu
governor selects the target cpuidle state.
This patch extends cpuidle core and the menu governor to handle states
that are dynamic. There are three additions in the patch and the patch
maintains backwards-compatibility with existing cpuidle drivers.
1) add prepare() to struct cpuidle_device. A cpuidle driver can hook
into the callback and cpuidle will call prepare() before calling the
governor's select function. The callback gives the cpuidle driver a
chance to update the dynamic information of the cpuidle states for the
current idle period, e.g. state availability, latencies, thresholds,
power values, etc.
2) add CPUIDLE_FLAG_IGNORE as one of the state flags. In the prepare()
function, a cpuidle driver can set/clear the flag to indicate to the
menu governor whether a cpuidle state should be ignored, i.e. not
available, during the current idle period.
3) add power_specified bit to struct cpuidle_device. The menu governor
currently assumes that the cpuidle states are arranged in the order of
increasing latency, threshold, and power savings. This is true or can
be made true for static states. Once the state parameters are dynamic,
the latencies, thresholds, and power savings for the cpuidle states can
increase or decrease by different amounts from idle period to idle
period. So the assumption of increasing latency, threshold, and power
savings from Cn to C(n+1) can no longer be guaranteed.
It can be straightforward to calculate the power consumption of each
available state and to specify it in power_usage for the idle period.
Using the power_usage fields, the menu governor then selects the state
that has the lowest power consumption and that still satisfies all other
critieria. The power_specified bit defaults to 0. For existing cpuidle
drivers, cpuidle detects that power_specified is 0 and fills in a dummy
set of power_usage values.
Signed-off-by: Ai Li <aili@codeaurora.org>
Cc: Len Brown <len.brown@intel.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0224cf4c5e (sched: Intoduce get_cpu_iowait_time_us())
broke things by not making sure preemption was indeed disabled
by the callers of nr_iowait_cpu() which took the iowait value of
the current cpu.
This resulted in a heap of preempt warnings. Cure this by making
nr_iowait_cpu() take a cpu number and fix up the callers to pass
in the right number.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: linux-pm@lists.linux-foundation.org
LKML-Reference: <1277968037.1868.120.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, the menu governor uses the (corrected) next timer as key item
for predicting the idle duration.
It turns out that there are specific cases where this breaks down: There
are cases where we have a very repetitive pattern of idle durations, where
the idle period is pretty much the same, for reasons completely unrelated
to the next timer event. Examples of such repeating patterns are network
loads with irq mitigation, the mouse moving but in theory also the wifi
beacons.
This patch adds a relatively simple detector for such repeating patterns,
where the standard deviation of the last 8 idle periods is compared to a
threshold.
With this extra predictor in place, measurements show that the DECAY
factor can now be increased (the decaying average will now decay slower)
to get an even more stable result.
[arjan@infradead.org: fix bug identified by Frank]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Cc: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes the string based list management to a handle base
implementation to help with the hot path use of pm-qos, it also renames
much of the API to use "request" as opposed to "requirement" that was
used in the initial implementation. I did this because request more
accurately represents what it actually does.
Also, I added a string based ABI for users wanting to use a string
interface. So if the user writes 0xDDDDDDDD formatted hex it will be
accepted by the interface. (someone asked me for it and I don't think
it hurts anything.)
This patch updates some documentation input I got from Randy.
Signed-off-by: markgross <mgross@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
commit 672917dcc7 ("cpuidle: menu governor: reduce latency on exit")
added an optimization, where the analysis on the past idle period moved
from the end of idle, to the beginning of the new idle.
Unfortunately, this optimization had a bug where it zeroed one key
variable for new use, that is needed for the analysis. The fix is
simple, zero the variable after doing the work from the previous idle.
During the audit of the code that found this issue, another issue was
also found; the ->measured_us data structure member is never set, a
local variable is always used instead.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reorder struct menu_device to remove 8 bytes of padding on 64 bit builds.
Size drops from 136 to 128 bytes, so possibly needing one fewer cache
lines.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
menu: use proper 64 bit math
The new menu governor is incorrectly doing a 64 bit divide. Compile
tested only
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It does not seem possible that ldev can be NULL, so drop the unnecessary
test. If ldev can somehow be NULL, then the initialization of last_idx
should be moved below the test.
A simplified version of the semantic match that detects this problem is as
follows (http://coccinelle.lip6.fr/):
// <smpl>
@match exists@
expression x, E;
identifier fld;
@@
* x->fld
... when != \(x = E\|&x\)
* x == NULL
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the state residency accounting and statistics computation off the hot
exit path.
On exit, the need to recompute statistics is recorded, and new statistics
will be computed when menu_select is called again.
The expected effect is to reduce processor wakeup latency from sleep
(C-states). We are speaking of few hundreds of cycles reduction out of a
several microseconds latency (determined by the hardware transition), so
it is difficult to measure.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Adam Belay <abelay@novell.com
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the menu idle governor which balances power savings, energy efficiency
and performance impact.
The reason for a reworked governor is that there have been serious
performance issues reported with the existing code on Nehalem server
systems.
To show this I'm sure Andrew wants to see benchmark results:
(benchmark is "fio", "no cstates" is using "idle=poll")
no cstates current linux new algorithm
1 disk 107 Mb/s 85 Mb/s 105 Mb/s
2 disks 215 Mb/s 123 Mb/s 209 Mb/s
12 disks 590 Mb/s 320 Mb/s 585 Mb/s
In various power benchmark measurements, no degredation was found by our
measurement&diagnostics team. Obviously a small percentage more power was
used in the "fio" benchmark, due to the much higher performance.
While it would be a novel idea to describe the new algorithm in this
commit message, I cheaped out and described it in comments in the code
instead.
[changes since first post: spelling fixes from akpm, review feedback,
folded menu-tng into menu.c]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add decaying history of predicted idle time, instead of using the last early
wakeup. This logic helps menu governor do better job of predicting idle time.
With this change, we also measured noticable (~8%) power savings on
a DP server system with CPUs supporting deep C states, when system
was lightly loaded. There was no change to power or perf on other load
conditions.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
ladder governor only honored latency requirement when promoting C-states.
Instead. it should check for latency requirement on each idle call,
and demote to appropriate C-state when there is a latency requirement change.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
There is a bug in menu governor where we have
if (data->elapsed_us < data->elapsed_us + measured_us)
with measured_us already having elapsed_us added in tickless case here
unsigned int measured_us =
cpuidle_get_last_residency(dev) + data->elapsed_us;
Also, it should be last_residency, not measured_us, that need to be used to
do comparing and distinguish between expected & non-expected events.
Refactor menu_reflect() to fix these two problems.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Wei Gang <gang.wei@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
poll_idle was added to CPUIDLE, just as a low latency idle handler, to be
used in cases when user desires CPUs not to enter any idle state at all. It
was supposed to be a run time idle=poll option to the user. But, it was indeed
getting used during normal menu and ladder governor default case, with no
special user setting (Reported by Linus Torvalds).
Change below ensures that poll_idle will not be used unless user explicitly
asks pm_qos infrastructure for zero latency requirement.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
The following patch is a generalization of the latency.c implementation done
by Arjan last year. It provides infrastructure for more than one parameter,
and exposes a user mode interface for processes to register pm_qos
expectations of processes.
This interface provides a kernel and user mode interface for registering
performance expectations by drivers, subsystems and user space applications on
one of the parameters.
Currently we have {cpu_dma_latency, network_latency, network_throughput} as
the initial set of pm_qos parameters.
The infrastructure exposes multiple misc device nodes one per implemented
parameter. The set of parameters implement is defined by pm_qos_power_init()
and pm_qos_params.h. This is done because having the available parameters
being runtime configurable or changeable from a driver was seen as too easy to
abuse.
For each parameter a list of performance requirements is maintained along with
an aggregated target value. The aggregated target value is updated with
changes to the requirement list or elements of the list. Typically the
aggregated target value is simply the max or min of the requirement values
held in the parameter list elements.
>From kernel mode the use of this interface is simple:
pm_qos_add_requirement(param_id, name, target_value):
Will insert a named element in the list for that identified PM_QOS
parameter with the target value. Upon change to this list the new target is
recomputed and any registered notifiers are called only if the target value
is now different.
pm_qos_update_requirement(param_id, name, new_target_value):
Will search the list identified by the param_id for the named list element
and then update its target value, calling the notification tree if the
aggregated target is changed. with that name is already registered.
pm_qos_remove_requirement(param_id, name):
Will search the identified list for the named element and remove it, after
removal it will update the aggregate target and call the notification tree
if the target was changed as a result of removing the named requirement.
>From user mode:
Only processes can register a pm_qos requirement. To provide for
automatic cleanup for process the interface requires the process to register
its parameter requirements in the following way:
To register the default pm_qos target for the specific parameter, the
process must open one of /dev/[cpu_dma_latency, network_latency,
network_throughput]
As long as the device node is held open that process has a registered
requirement on the parameter. The name of the requirement is
"process_<PID>" derived from the current->pid from within the open system
call.
To change the requested target value the process needs to write a s32
value to the open device node. This translates to a
pm_qos_update_requirement call.
To remove the user mode request for a target value simply close the device
node.
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix build again]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: mark gross <mgross@linux.intel.com>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Adam Belay <abelay@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit e5a16b1f9eec0af7cfa0830304b41c1c0833cf9f
Author: Len Brown <len.brown@intel.com>
Date: Tue Oct 2 23:44:44 2007 -0400
cpuidle: shrink diff
processor_idle.c | 440 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 429 insertions(+), 11 deletions(-)
Signed-off-by: Len Brown <len.brown@intel.com>
commit dfbb9d5aedfb18848a3e0d6f6e3e4969febb209c
Author: Len Brown <len.brown@intel.com>
Date: Wed Sep 26 02:17:55 2007 -0400
cpuidle: reduce diff size
Reduces the cpuidle processor_idle.c diff vs 2.6.22 from this
processor_idle.c | 2006 ++++++++++++++++++++++++++-----------------
1 file changed, 1219 insertions(+), 787 deletions(-)
to this:
processor_idle.c | 502 +++++++++++++++++++++++++++++++++++++++----
1 file changed, 458 insertions(+), 44 deletions(-)
...for the purpose of making the cpuilde patch less invasive
and easier to review.
no functional changes. build tested only.
Signed-off-by: Len Brown <len.brown@intel.com>
commit 889172fc915f5a7fe20f35b133cbd205ce69bf6c
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Sep 13 13:40:05 2007 -0700
cpuidle: Retain old ACPI policy for !CONFIG_CPU_IDLE
Retain the old policy in processor_idle, so that when CPU_IDLE is not
configured, old C-state policy will still be used. This provides a
clean gradual migration path from old ACPI policy to new cpuidle
based policy.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 9544a8181edc7ecc33b3bfd69271571f98ed08bc
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Sep 13 13:39:17 2007 -0700
cpuidle: Configure governors by default
Quoting Len "Do not give an option to users to shoot themselves in the foot".
Remove the configurability of ladder and menu governors as they are
needed for default policy of cpuidle. That way users will not be able to
have cpuidle without any policy loosing all C-state power savings.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 8975059a2c1e56cfe83d1bcf031bcf4cb39be743
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:27:07 2007 -0400
CPUIDLE: load ACPI properly when CPUIDLE is disabled
Change the registration return codes for when CPUIDLE
support is not compiled into the kernel. As a result, the ACPI
processor driver will load properly even if CPUIDLE is unavailable.
However, it may be possible to cleanup the ACPI processor driver further
and eliminate some dead code paths.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit e0322e2b58dd1b12ec669bf84693efe0dc2414a8
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:26:06 2007 -0400
CPUIDLE: remove cpuidle_get_bm_activity()
Remove cpuidle_get_bm_activity() and updates governors
accordingly.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 18a6e770d5c82ba26653e53d240caa617e09e9ab
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:25:58 2007 -0400
CPUIDLE: max_cstate fix
Currently max_cstate is limited to 0, resulting in no idle processor
power management on ACPI platforms. This patch restores the value to
the array size.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 1fdc0887286179b40ce24bcdbde663172e205ef0
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:25:40 2007 -0400
CPUIDLE: handle BM detection inside the ACPI Processor driver
Update the ACPI processor driver to detect BM activity and
limit state entry depth internally, rather than exposing such
requirements to CPUIDLE. As a result, CPUIDLE can drop this
ACPI-specific interface and become more platform independent. BM
activity is now handled much more aggressively than it was in the
original implementation, so some testing coverage may be needed to
verify that this doesn't introduce any DMA buffer under-run issues.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 0ef38840db666f48e3cdd2b769da676c57228dd9
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:25:14 2007 -0400
CPUIDLE: menu governor updates
Tweak the menu governor to more effectively handle non-timer
break events. Non-timer break events are detected by comparing the
actual sleep time to the expected sleep time. In future revisions, it
may be more reliable to use the timer data structures directly.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit bb4d74fca63fa96cf3ace644b15ae0f12b7df5a1
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:24:40 2007 -0400
CPUIDLE: fix 'current_governor' sysfs entry
Allow the "current_governor" sysfs entry to properly handle
input terminated with '\n'.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit df3c71559bb69b125f1a48971bf0d17f78bbdf47
Author: Len Brown <len.brown@intel.com>
Date: Sun Aug 12 02:00:45 2007 -0400
cpuidle: fix IA64 build (again)
Signed-off-by: Len Brown <len.brown@intel.com>
commit a02064579e3f9530fd31baae16b1fc46b5a7bca8
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Sun Aug 12 01:39:27 2007 -0400
cpuidle: Remove support for runtime changing of max_cstate
Remove support for runtime changeability of max_cstate. Drivers can use
use latency APIs.
max_cstate can still be used as a boot time option and dmi override.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 0912a44b13adf22f5e3f607d263aed23b4910d7e
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Sun Aug 12 01:39:16 2007 -0400
cpuidle: Remove ACPI cstate_limit calls from ipw2100
ipw2100 already has code to use accetable_latency interfaces to limit the
C-state. Remove the calls to acpi_set_cstate_limit and acpi_get_cstate_limit
as they are redundant.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit c649a76e76be6bff1fd770d0a775798813a3f6e0
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Sun Aug 12 01:35:39 2007 -0400
cpuidle: compile fix for pause and resume functions
Fix the compilation failure when cpuidle is not compiled in.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Adam Belay <adam.belay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 2305a5920fb8ee6ccec1c62ade05aa8351091d71
Author: Adam Belay <abelay@novell.com>
Date: Thu Jul 19 00:49:00 2007 -0400
cpuidle: re-write
Some portions have been rewritten to make the code cleaner and lighter
weight. The following is a list of changes:
1.) the state name is now included in the sysfs interface
2.) detection, hotplug, and available state modifications are handled by
CPUIDLE drivers directly
3.) the CPUIDLE idle handler is only ever installed when at least one
cpuidle_device is enabled and ready
4.) the menu governor BM code no longer overflows
5.) the sysfs attributes are now printed as unsigned integers, avoiding
negative values
6.) a variety of other small cleanups
Also, Idle drivers are no longer swappable during runtime through the
CPUIDLE sysfs inteface. On i386 and x86_64 most idle handlers (e.g.
poll, mwait, halt, etc.) don't benefit from an infrastructure that
supports multiple states, so I think using a more general case idle
handler selection mechanism would be cleaner.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit df25b6b56955714e6e24b574d88d1fd11f0c3ee5
Author: Len Brown <len.brown@intel.com>
Date: Tue Jul 24 17:08:21 2007 -0400
cpuidle: fix IA64 buid
Signed-off-by: Len Brown <len.brown@intel.com>
commit fd6ada4c14488755ff7068860078c437431fbccd
Author: Adrian Bunk <bunk@stusta.de>
Date: Mon Jul 9 11:33:13 2007 -0700
cpuidle: static
make cpuidle_replace_governor() static
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit c1d4a2cebcadf2429c0c72e1d29aa2a9684c32e0
Author: Adrian Bunk <bunk@stusta.de>
Date: Tue Jul 3 00:54:40 2007 -0400
cpuidle: static
This patch makes the needlessly global struct menu_governor static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit dbf8780c6e8d572c2c273da97ed1cca7608fd999
Author: Andrew Morton <akpm@linux-foundation.org>
Date: Tue Jul 3 00:49:14 2007 -0400
export symbol tick_nohz_get_sleep_length
ERROR: "tick_nohz_get_sleep_length" [drivers/cpuidle/governors/menu.ko] undefined!
ERROR: "tick_nohz_get_idle_jiffies" [drivers/cpuidle/governors/menu.ko] undefined!
And please be sure to get your changes to core kernel suitably reviewed.
Cc: Adam Belay <abelay@novell.com>
Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 29f0e248e7017be15f99febf9143a2cef00b2961
Author: Andrew Morton <akpm@linux-foundation.org>
Date: Tue Jul 3 00:43:04 2007 -0400
tick.h needs hrtimer.h
It uses hrtimers.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit e40cede7d63a029e92712a3fe02faee60cc38fb4
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:40:34 2007 -0400
cpuidle: first round of documentation updates
Documentation changes based on Pavel's feedback.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 83b42be2efece386976507555c29e7773a0dfcd1
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:39:25 2007 -0400
cpuidle: add rating to the governors and pick the one with highest rating by default
Introduce a governor rating scheme to pick the right governor by default.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit d2a74b8c5e8f22def4709330d4bfc4a29209b71c
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:38:08 2007 -0400
cpuidle: make cpuidle sysfs driver governor switch off by default
Make default cpuidle sysfs to show current_governor and current_driver in
read-only mode. More elaborate available_governors and available_drivers with
writeable current_governor and current_driver interface only appear with
"cpuidle_sysfs_switch" boot parameter.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 1f60a0e80bf83cf6b55c8845bbe5596ed8f6307b
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:37:00 2007 -0400
cpuidle: menu governor: change the early break condition
Change the C-state early break out algorithm in menu governor.
We only look at early breakouts that result in wakeups shorter than idle
state's target_residency. If such a breakout is frequent enough, eliminate
the particular idle state upto a timeout period.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 45a42095cf64b003b4a69be3ce7f434f97d7af51
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:35:38 2007 -0400
cpuidle: fix uninitialized variable in sysfs routine
Fix the uninitialized usage of ret.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 80dca7cdba3e6ee13eae277660873ab9584eb3be
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:34:16 2007 -0400
cpuidle: reenable /proc/acpi//power interface for the time being
Keep /proc/acpi/processor/CPU*/power around for a while as powertop depends
on it. It will be marked deprecated and removed in future. powertop can use
cpuidle interfaces instead.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 589c37c2646c5e3813a51255a5ee1159cb4c33fc
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:32:37 2007 -0400
cpuidle: menu governor and hrtimer compile fix
Compile fix for menu governor.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 0ba80bd9ab3ed304cb4f19b722e4cc6740588b5e
Author: Len Brown <len.brown@intel.com>
Date: Thu May 31 22:51:43 2007 -0400
cpuidle: build fix - cpuidle vs ipw2100 module
ERROR: "acpi_set_cstate_limit" [drivers/net/wireless/ipw2100.ko] undefined!
Signed-off-by: Len Brown <len.brown@intel.com>
commit d7d8fa7f96a7f7682be7c6cc0cc53fa7a18c3b58
Author: Adam Belay <abelay@novell.com>
Date: Sat Mar 24 03:47:07 2007 -0400
cpuidle: add the 'menu' governor
Here is my first take at implementing an idle PM governor that takes
full advantage of NO_HZ. I call it the 'menu' governor because it
considers the full list of idle states before each entry.
I've kept the implementation fairly simple. It attempts to guess the
next residency time and then chooses a state that would meet at least
the break-even point between power savings and entry cost. To this end,
it selects the deepest idle state that satisfies the following
constraints:
1. If the idle time elapsed since bus master activity was detected
is below a threshold (currently 20 ms), then limit the selection
to C2-type or above.
2. Do not choose a state with a break-even residency that exceeds
the expected time remaining until the next timer interrupt.
3. Do not choose a state with a break-even residency that exceeds
the elapsed time between the last pair of break events,
excluding timer interrupts.
This governor has an advantage over "ladder" governor because it
proactively checks how much time remains until the next timer interrupt
using the tick infrastructure. Also, it handles device interrupt
activity more intelligently by not including timer interrupts in break
event calculations. Finally, it doesn't make policy decisions using the
number of state entries, which can have variable residency times (NO_HZ
makes these potentially very large), and instead only considers sleep
time deltas.
The menu governor can be selected during runtime using the cpuidle sysfs
interface like so:
"echo "menu" > /sys/devices/system/cpu/cpuidle/current_governor"
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit a4bec7e65aa3b7488b879d971651cc99a6c410fe
Author: Adam Belay <abelay@novell.com>
Date: Sat Mar 24 03:47:03 2007 -0400
cpuidle: export time until next timer interrupt using NO_HZ
Expose information about the time remaining until the next
timer interrupt expires by utilizing the dynticks infrastructure.
Also modify the main idle loop to allow dynticks to handle
non-interrupt break events (e.g. DMA). Finally, expose sleep ticks
information to external code. Thomas Gleixner is responsible for much
of the code in this patch. However, I've made some additional changes,
so I'm probably responsible if there are any bugs or oversights :)
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 2929d8996fbc77f41a5ff86bb67cdde3ca7d2d72
Author: Adam Belay <abelay@novell.com>
Date: Sat Mar 24 03:46:58 2007 -0400
cpuidle: governor API changes
This patch prepares cpuidle for the menu governor. It adds an optional
stage after idle state entry to give the governor an opportunity to
check why the state was exited. Also it makes sure the idle loop
returns after each state entry, allowing the appropriate dynticks code
to run.
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 3a7fd42f9825c3b03e364ca59baa751bb350775f
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Apr 26 00:03:59 2007 -0700
cpuidle: hang fix
Prevent hang on x86-64, when ACPI processor driver is added as a module on
a system that does not support C-states.
x86-64 expects all idle handlers to enable interrupts before returning from
idle handler. This is due to enter_idle(), exit_idle() races. Make
cpuidle_idle_call() confirm to this when there is no pm_idle_old.
Also, cpuidle look at the return values of attch_driver() and set
current_driver to NULL if attach fails on all CPUs.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 4893339a142afbd5b7c01ffadfd53d14746e858e
Author: Shaohua Li <shaohua.li@intel.com>
Date: Thu Apr 26 10:40:09 2007 +0800
cpuidle: add support for max_cstate limit
With CPUIDLE framework, the max_cstate (to limit max cpu c-state)
parameter is ingored. Some systems require it to ignore C2/C3
and some drivers like ipw require it too.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 43bbbbe1cb998cbd2df656f55bb3bfe30f30e7d1
Author: Shaohua Li <shaohua.li@intel.com>
Date: Thu Apr 26 10:40:13 2007 +0800
cpuidle: add cpuidle_fore_redetect_devices API
add cpuidle_force_redetect_devices API,
which forces all CPU redetect idle states.
Next patch will use it.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit d1edadd608f24836def5ec483d2edccfb37b1d19
Author: Shaohua Li <shaohua.li@intel.com>
Date: Thu Apr 26 10:40:01 2007 +0800
cpuidle: fix sysfs related issue
Fix the cpuidle sysfs issue.
a. make kobject dynamicaly allocated
b. fixed sysfs init issue to avoid suspend/resume issue
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 7169a5cc0d67b263978859672e86c13c23a5570d
Author: Randy Dunlap <randy.dunlap@oracle.com>
Date: Wed Mar 28 22:52:53 2007 -0400
cpuidle: 1-bit field must be unsigned
A 1-bit bitfield has no room for a sign bit.
drivers/cpuidle/governors/ladder.c:54:16: error: dubious bitfield without explicit `signed' or `unsigned'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 4658620158dc2fbd9e4bcb213c5b6fb5d05ba7d4
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Wed Mar 28 22:52:41 2007 -0400
cpuidle: fix boot hang
Patch for cpuidle boot hang reported by Larry Finger here.
http://www.ussg.iu.edu/hypermail/linux/kernel/0703.2/2025.html
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Larry Finger <larry.finger@lwfinger.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit c17e168aa6e5fe3851baaae8df2fbc1cf11443a9
Author: Len Brown <len.brown@intel.com>
Date: Wed Mar 7 04:37:53 2007 -0500
cpuidle: ladder does not depend on ACPI
build fix for CONFIG_ACPI=n
In file included from drivers/cpuidle/governors/ladder.c:21:
include/acpi/processor.h:88: error: expected specifier-qualifier-list before âacpi_integerâ
include/acpi/processor.h:106: error: expected specifier-qualifier-list before âacpi_integerâ
include/acpi/processor.h:168: error: expected specifier-qualifier-list before âacpi_handleâ
Signed-off-by: Len Brown <len.brown@intel.com>
commit 8c91d958246bde68db0c3f0c57b535962ce861cb
Author: Adrian Bunk <bunk@stusta.de>
Date: Tue Mar 6 02:29:40 2007 -0800
cpuidle: make code static
This patch makes the following needlessly global code static:
- driver.c: __cpuidle_find_driver()
- governor.c: __cpuidle_find_governor()
- ladder.c: struct ladder_governor
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Adam Belay <abelay@novell.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 0c39dc3187094c72c33ab65a64d2017b21f372d2
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Wed Mar 7 02:38:22 2007 -0500
cpu_idle: fix build break
This patch fixes a build breakage with !CONFIG_HOTPLUG_CPU and
CONFIG_CPU_IDLE.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 8112e3b115659b07df340ef170515799c0105f82
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Mar 6 02:29:39 2007 -0800
cpuidle: build fix for !CPU_IDLE
Fix the compile issues when CPU_IDLE is not configured.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Adam Belay <abelay@novell.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 1eb4431e9599cd25e0d9872f3c2c8986821839dd
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Feb 22 13:54:57 2007 -0800
cpuidle take2: Basic documentation for cpuidle
Documentation for cpuidle infrastructure
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit ef5f15a8b79123a047285ec2e3899108661df779
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Feb 22 13:54:03 2007 -0800
cpuidle take2: Hookup ACPI C-states driver with cpuidle
Hookup ACPI C-states onto generic cpuidle infrastructure.
drivers/acpi/procesor_idle.c is now a ACPI C-states driver that registers as
a driver in cpuidle infrastructure and the policy part is removed from
drivers/acpi/processor_idle.c. We use governor in cpuidle instead.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 987196fa82d4db52c407e8c9d5dec884ba602183
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Feb 22 13:52:57 2007 -0800
cpuidle take2: Core cpuidle infrastructure
Announcing 'cpuidle', a new CPU power management infrastructure to manage
idle CPUs in a clean and efficient manner.
cpuidle separates out the drivers that can provide support for multiple types
of idle states and policy governors that decide on what idle state to use
at run time.
A cpuidle driver can support multiple idle states based on parameters like
varying power consumption, wakeup latency, etc (ACPI C-states for example).
A cpuidle governor can be usage model specific (laptop, server,
laptop on battery etc).
Main advantage of the infrastructure being, it allows independent development
of drivers and governors and allows for better CPU power management.
A huge thanks to Adam Belay and Shaohua Li who were part of this mini-project
since its beginning and are greatly responsible for this patchset.
This patch:
Core cpuidle infrastructure.
Introduces a new abstraction layer for cpuidle:
* which manages drivers that can support multiple idles states. Drivers
can be generic or particular to specific hardware/platform
* allows pluging in multiple policy governors that can take idle state policy
decision
* The core also has a set of sysfs interfaces with which administrato can know
about supported drivers and governors and switch them at run time.
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>