Commit Graph

2978 Commits

Author SHA1 Message Date
Rafael J. Wysocki
538b0188da cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
Commit 3c55e94c0a ("cpufreq: ACPI: Extend frequency tables to cover
boost frequencies") attempted to address a performance issue involving
acpi-cpufreq, the schedutil governor and scale-invariance on x86 by
extending the frequency tables created by acpi-cpufreq to cover the
entire range of "turbo" (or "boost") frequencies, but that caused
frequencies reported via /proc/cpuinfo and the scaling_cur_freq
attribute in sysfs to change which may confuse users and monitoring
tools.

For this reason, revert the part of commit 3c55e94c0a adding the
extra entry to the frequency table and use the observation that
in principle cpuinfo.max_freq need not be equal to the maximum
frequency listed in the frequency table for the given policy.

Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq
drivers to set their own cpuinfo.max_freq above that frequency and
change  acpi-cpufreq to set cpuinfo.max_freq to the maximum boost
frequency found via CPPC.

This should be sufficient to let all of the cpufreq subsystem know
the real maximum frequency of the CPU without changing frequency
reporting.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305
Fixes: 3c55e94c0a ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
Reported-by: Matt McDonald <gardotd426@gmail.com>
Tested-by: Matt McDonald <gardotd426@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Tested-by: Michael Larabel <Michael@phoronix.com>
Cc: 5.11+ <stable@vger.kernel.org> # 5.11+
2021-02-18 18:34:56 +01:00
Rafael J. Wysocki
8a3f1f181d Merge back cpufreq updates for v5.12. 2021-02-10 19:11:06 +01:00
Rafael J. Wysocki
7ac839a0a7 Merge branch 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull ARM cpufreq changes for v5.12 from Viresh Kumar:

"- Removal of Tango driver as the platform got removed (Arnd Bergmann).

 - Use resource managed APIs for tegra20 (Dmitry Osipenko).

 - Generic cleanups for brcmstb (Christophe JAILLET).

 - Enable boost support for qcom-hw (Shawn Guo)."

* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  cpufreq: remove tango driver
  cpufreq: brcmstb-avs-cpufreq: Fix resource leaks in ->remove()
  cpufreq: brcmstb-avs-cpufreq: Free resources in error path
  cpufreq: qcom-hw: enable boost support
  cpufreq: tegra20: Use resource-managed API
2021-02-08 13:54:58 +01:00
Rafael J. Wysocki
d11a1d08a0 cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
If the maximum performance level taken for computing the
arch_max_freq_ratio value used in the x86 scale-invariance code is
higher than the one corresponding to the cpuinfo.max_freq value
coming from the acpi_cpufreq driver, the scale-invariant utilization
falls below 100% even if the CPU runs at cpuinfo.max_freq or slightly
faster, which causes the schedutil governor to select a frequency
below cpuinfo.max_freq.  That frequency corresponds to a frequency
table entry below the maximum performance level necessary to get to
the "boost" range of CPU frequencies which prevents "boost"
frequencies from being used in some workloads.

While this issue is related to scale-invariance, it may be amplified
by commit db865272d9 ("cpufreq: Avoid configuring old governors as
default with intel_pstate") from the 5.10 development cycle which
made it extremely easy to default to schedutil even if the preferred
driver is acpi_cpufreq as long as intel_pstate is built too, because
the mere presence of the latter effectively removes the ondemand
governor from the defaults.  Distro kernels are likely to include
both intel_pstate and acpi_cpufreq on x86, so their users who cannot
use intel_pstate or choose to use acpi_cpufreq may easily be
affectecd by this issue.

If CPPC is available, it can be used to address this issue by
extending the frequency tables created by acpi_cpufreq to cover the
entire available frequency range (including "boost" frequencies) for
each CPU, but if CPPC is not there, acpi_cpufreq has no idea what
the maximum "boost" frequency is and the frequency tables created by
it cannot be extended in a meaningful way, so in that case make it
ask the arch scale-invariance code to to use the "nominal" performance
level for CPU utilization scaling in order to avoid the issue at hand.

Fixes: db865272d9 ("cpufreq: Avoid configuring old governors as default with intel_pstate")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-02-08 13:45:51 +01:00
Rafael J. Wysocki
3c55e94c0a cpufreq: ACPI: Extend frequency tables to cover boost frequencies
A severe performance regression on AMD EPYC processors when using
the schedutil scaling governor was discovered by Phoronix.com and
attributed to the following commits:

  41ea667227 ("x86, sched: Calculate frequency invariance for AMD
  systems")

  976df7e573 ("x86, sched: Use midpoint of max_boost and max_P for
  frequency invariance on AMD EPYC")

The source of the problem is that the maximum performance level taken
for computing the arch_max_freq_ratio value used in the x86 scale-
invariance code is higher than the one corresponding to the
cpuinfo.max_freq value coming from the acpi_cpufreq driver.

This effectively causes the scale-invariant utilization to fall below
100% even if the CPU runs at cpuinfo.max_freq or slightly faster, so
the schedutil governor selects a frequency below cpuinfo.max_freq
then.  That frequency corresponds to a frequency table entry below
the maximum performance level necessary to get to the "boost" range
of CPU frequencies.

However, if the cpuinfo.max_freq value coming from acpi_cpufreq was
higher, the schedutil governor would select higher frequencies which
in turn would allow acpi_cpufreq to set more adequate performance
levels and to get to the "boost" range of CPU frequencies more often.

This issue affects any systems where acpi_cpufreq is used and the
"boost" (or "turbo") frequencies are enabled, not just AMD EPYC.
Moreover, commit db865272d9 ("cpufreq: Avoid configuring old
governors as default with intel_pstate") from the 5.10 development
cycle made it extremely easy to default to schedutil even if the
preferred driver is acpi_cpufreq as long as intel_pstate is built
too, because the mere presence of the latter effectively removes the
ondemand governor from the defaults.  Distro kernels are likely to
include both intel_pstate and acpi_cpufreq on x86, so their users
who cannot use intel_pstate or choose to use acpi_cpufreq may
easily be affectecd by this issue.

To address this issue, extend the frequency table constructed by
acpi_cpufreq for each CPU to cover the entire range of available
frequencies (including the "boost" ones) if CPPC is available and
indicates that "boost" (or "turbo") frequencies are enabled.  That
causes cpuinfo.max_freq to become the maximum "boost" frequency of
the given CPU (instead of the maximum frequency returned by the ACPI
_PSS object that corresponds to the "nominal" performance level).

Fixes: 41ea667227 ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 976df7e573 ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")
Fixes: db865272d9 ("cpufreq: Avoid configuring old governors as default with intel_pstate")
Link: https://www.phoronix.com/scan.php?page=article&item=linux511-amd-schedutil&num=1
Link: https://lore.kernel.org/linux-pm/20210203135321.12253-2-ggherdovich@suse.cz/
Reported-by: Michael Larabel <Michael@phoronix.com>
Diagnosed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Tested-by: Michael Larabel <Michael@phoronix.com>
2021-02-08 13:45:51 +01:00
Viresh Kumar
2f0531869f cpufreq: Remove unused flag CPUFREQ_PM_NO_WARN
This flag is set by one of the drivers but it isn't used in the code
otherwise. Remove the unused flag and update the driver.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-04 19:25:47 +01:00
Viresh Kumar
5ae4a4b45d cpufreq: Remove CPUFREQ_STICKY flag
During cpufreq driver's registration, if the ->init() callback for all
the CPUs fail then there is not much point in keeping the driver around
as it will only account for more of unnecessary noise, for example
cpufreq core will try to suspend/resume the driver which never got
registered properly.

The removal of such a driver is avoided if the driver carries the
CPUFREQ_STICKY flag. This was added way back [1] in 2004 and perhaps no
one should ever need it now. A lot of drivers do set this flag, probably
because they just copied it from other drivers.

This was added earlier for some platforms [2] because their cpufreq
drivers were getting registered before the CPUs were registered with
subsys framework. And hence they used to fail.

The same isn't true anymore though. The current code flow in the kernel
is:

start_kernel()
-> kernel_init()
   -> kernel_init_freeable()
      -> do_basic_setup()
         -> driver_init()
            -> cpu_dev_init()
               -> subsys_system_register() //For CPUs

         -> do_initcalls()
            -> cpufreq_register_driver()

Clearly, the CPUs will always get registered with subsys framework
before any cpufreq driver can get probed. Remove the flag and update the
relevant drivers.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/include/linux/cpufreq.h?id=7cc9f0d9a1ab04cedc60d64fd8dcf7df224a3b4d # [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/arch/arm/mach-sa1100/cpu-sa1100.c?id=f59d3bbe35f6268d729f51be82af8325d62f20f5 # [2]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-04 19:23:20 +01:00
Nigel Christian
75a8d877d6 cpufreq: intel_pstate: Remove repeated word
In the comment for trace in passive mode there is an
unnecessary "the". Eradicate it.

Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-01-22 17:04:56 +01:00
Arnd Bergmann
7114ebffd3 cpufreq: remove tango driver
The tango platform is getting removed, so the driver is no
longer needed.

Cc: Marc Gonzalez <marc.w.gonzalez@free.fr>
Cc: Mans Rullgard <mans@mansr.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[ Viresh: Update cpufreq-dt-platdev.c as well ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-01-21 09:34:46 +05:30
Christophe JAILLET
3657f729b6 cpufreq: brcmstb-avs-cpufreq: Fix resource leaks in ->remove()
If 'cpufreq_unregister_driver()' fails, just WARN and continue, so that
other resources are freed.

Fixes: de322e0859 ("cpufreq: brcmstb-avs-cpufreq: AVS CPUfreq driver for Broadcom STB SoCs")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
[ Viresh: Updated Subject ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-01-18 12:23:43 +05:30
Christophe JAILLET
05f456286f cpufreq: brcmstb-avs-cpufreq: Free resources in error path
If 'cpufreq_register_driver()' fails, we must release the resources
allocated in 'brcm_avs_prepare_init()' as already done in the remove
function.

To do that, introduce a new function 'brcm_avs_prepare_uninit()' in order
to avoid code duplication. This also makes the code more readable (IMHO).

Fixes: de322e0859 ("cpufreq: brcmstb-avs-cpufreq: AVS CPUfreq driver for Broadcom STB SoCs")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
[ Viresh: Updated Subject ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-01-18 12:23:28 +05:30
Shawn Guo
266991721c cpufreq: qcom-hw: enable boost support
At least on sdm850, the 2956800 khz is detected as a boost frequency in
function qcom_cpufreq_hw_read_lut().  Let's enable boost support by
calling cpufreq_enable_boost_support(), so that we can get the boost
frequency by switching it on via 'boost' sysfs entry like below.

 $ echo 1 > /sys/devices/system/cpu/cpufreq/boost

Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Tested-by: Steev Klimaszewski <steev@kali.org>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-01-18 12:13:53 +05:30
Dmitry Osipenko
763ec5daae cpufreq: tegra20: Use resource-managed API
Switch cpufreq-tegra20 driver to use resource-managed API.
This removes the need to get opp_table pointer using
dev_pm_opp_get_opp_table() in order to release OPP table that
was requested by dev_pm_opp_set_supported_hw(), making the code
a bit more straightforward.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-01-18 12:02:53 +05:30
Chen Yu
6f67e06008 cpufreq: intel_pstate: Get per-CPU max freq via MSR_HWP_CAPABILITIES if available
Currently, when turbo is disabled (either by BIOS or by the user),
the intel_pstate driver reads the max non-turbo frequency from the
package-wide MSR_PLATFORM_INFO(0xce) register.

However, on asymmetric platforms it is possible in theory that small
and big core with HWP enabled might have different max non-turbo CPU
frequency, because MSR_HWP_CAPABILITIES is per-CPU scope according
to Intel Software Developer Manual.

The turbo max freq is already per-CPU in current code, so make
similar change to the max non-turbo frequency as well.

Reported-by: Wendy Wang <wendy.wang@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw: Subject and changelog edits ]
Cc: 4.18+ <stable@vger.kernel.org> # 4.18+: a45ee4d4e1: cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-01-12 19:44:49 +01:00
Rafael J. Wysocki
597ffbc8d0 cpufreq: intel_pstate: Rename two functions
Rename intel_cpufreq_adjust_hwp() and intel_cpufreq_adjust_perf_ctl()
to intel_cpufreq_hwp_update() and intel_cpufreq_perf_ctl_update(),
respectively, to avoid possible confusion with the ->adjist_perf()
callback function, intel_cpufreq_adjust_perf().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
2021-01-12 19:34:05 +01:00
Rafael J. Wysocki
a45ee4d4e1 cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument
All of the callers of intel_pstate_get_hwp_max() access the struct
cpudata object that corresponds to the given CPU already and the
function itself needs to access that object (in order to update
hwp_cap_cached), so modify the code to pass a struct cpudata pointer
to it instead of the CPU number.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
2021-01-12 19:34:04 +01:00
Rafael J. Wysocki
9dd04ec6bc cpufreq: intel_pstate: Always read hwp_cap_cached with READ_ONCE()
Because intel_pstate_get_hwp_max() which updates hwp_cap_cached
may run in parallel with the readers of it, annotate all of the
read accesses to it with READ_ONCE().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
2021-01-12 19:34:04 +01:00
Lukas Bulwahn
c4151604f0 cpufreq: intel_pstate: remove obsolete functions
percent_fp() was used in intel_pstate_pid_reset(), which was removed in
commit 9d0ef7af1f ("cpufreq: intel_pstate: Do not use PID-based P-state
selection") and hence, percent_fp() is unused since then.

percent_ext_fp() was last used in intel_pstate_update_perf_limits(), which
was refactored in commit 1a4fe38add ("cpufreq: intel_pstate: Remove
max/min fractions to limit performance"), and hence, percent_ext_fp() is
unused since then.

make CC=clang W=1 points us those unused functions:

drivers/cpufreq/intel_pstate.c:79:23: warning: unused function 'percent_fp' [-Wunused-function]
static inline int32_t percent_fp(int percent)
                      ^

drivers/cpufreq/intel_pstate.c:94:23: warning: unused function 'percent_ext_fp' [-Wunused-function]
static inline int32_t percent_ext_fp(int percent)
                      ^

Remove those obsolete functions.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-01-07 18:22:46 +01:00
Colin Ian King
943bdd0cec cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get()
Currently there is an unlikely case where cpufreq_cpu_get() returns a
NULL policy and this will cause a NULL pointer dereference later on.

Fix this by passing the policy to transition_frequency_fidvid() from
the caller and hence eliminating the need for the cpufreq_cpu_get()
and cpufreq_cpu_put().

Thanks to Viresh Kumar for suggesting the fix.

Addresses-Coverity: ("Dereference null return")
Fixes: b43a7ffbf3 ("cpufreq: Notify all policy->cpus in cpufreq_notify_transition()")
Suggested-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-01-07 17:37:33 +01:00
Rafael J. Wysocki
17ffd35809 cpufreq: intel_pstate: Use HWP capabilities in intel_cpufreq_adjust_perf()
If turbo P-states cannot be used, either due to the configuration of
the processor, or because intel_pstate is not allowed to used them,
the maximum available P-state with HWP enabled corresponds to the
HWP_CAP.GUARANTEED value which is not static.  It can be adjusted by
an out-of-band agent or during an Intel Speed Select performance
level change, so long as it remains less than or equal to
HWP_CAP.MAX.

However, if turbo P-states cannot be used, intel_cpufreq_adjust_perf()
always uses pstate.max_pstate (set during the initialization of the
driver only) as the maximum available P-state, so it may miss a change
of the HWP_CAP.GUARANTEED value.

Prevent that from happening by modifyig intel_cpufreq_adjust_perf()
to always read the "guaranteed" and "maximum turbo" performance
levels from the cached HWP_CAP value.

Fixes: a365ab6b9d ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
2021-01-07 17:34:32 +01:00
Rafael J. Wysocki
be1283454b cpufreq: intel_pstate: Fix fast-switch fallback path
When sugov_update_single_perf() falls back to the "frequency"
path due to the missing scale-invariance, it will call
cpufreq_driver_fast_switch() via sugov_fast_switch()
and the driver's ->fast_switch() callback will be invoked,
so it must not be NULL.

However, after commit a365ab6b9d ("cpufreq: intel_pstate: Implement
the ->adjust_perf() callback") intel_pstate sets ->fast_switch() to
NULL when it is going to use intel_cpufreq_adjust_perf(), which is a
mistake, because on x86 the scale-invariance may be turned off
dynamically, so modify it to retain the original ->adjust_perf()
callback pointer.

Fixes: a365ab6b9d ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Tested-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-30 18:22:17 +01:00
Rafael J. Wysocki
c3a74f8e25 Merge branch 'pm-cpufreq'
* pm-cpufreq:
  cpufreq: intel_pstate: Use most recent guaranteed performance values
  cpufreq: intel_pstate: Implement the ->adjust_perf() callback
  cpufreq: Add special-purpose fast-switching callback for drivers
  cpufreq: schedutil: Add util to struct sg_cpu
  cppc_cpufreq: replace per-cpu data array with a list
  cppc_cpufreq: expose information on frequency domains
  cppc_cpufreq: clarify support for coordination types
  cppc_cpufreq: use policy->cpu as driver of frequency setting
  ACPI: processor: fix NONE coordination for domain mapping failure
  ACPI: processor: Drop duplicate setting of shared_cpu_map
2020-12-22 17:59:11 +01:00
Rafael J. Wysocki
e40ad84c26 cpufreq: intel_pstate: Use most recent guaranteed performance values
When turbo has been disabled by the BIOS, but HWP_CAP.GUARANTEED is
changed later, user space may want to take advantage of this increased
guaranteed performance.

HWP_CAP.GUARANTEED is not a static value.  It can be adjusted by an
out-of-band agent or during an Intel Speed Select performance level
change.  The HWP_CAP.MAX is still the maximum achievable performance
with turbo disabled by the BIOS, so HWP_CAP.GUARANTEED can still
change as long as it remains less than or equal to HWP_CAP.MAX.

When HWP_CAP.GUARANTEED is changed, the sysfs base_frequency
attribute shows the most recent guaranteed frequency value. This
attribute can be used by user space software to update the scaling
min/max limits of the CPU.

Currently, the ->setpolicy() callback already uses the latest
HWP_CAP values when setting HWP_REQ, but the ->verify() callback will
restrict the user settings to the to old guaranteed performance value
which prevents user space from making use of the extra CPU capacity
theoretically available to it after increasing HWP_CAP.GUARANTEED.

To address this, read HWP_CAP in intel_pstate_verify_cpu_policy()
to obtain the maximum P-state that can be used and use that to
confine the policy max limit instead of using the cached and
possibly stale pstate.max_freq value for this purpose.

For consistency, update intel_pstate_update_perf_limits() to use the
maximum available P-state returned by intel_pstate_get_hwp_max() to
compute the maximum frequency instead of using the return value of
intel_pstate_get_max_freq() which, again, may be stale.

This issue is a side-effect of fixing the scaling frequency limits in
commit eacc9c5a92 ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max()
for turbo disabled") which corrected the setting of the reduced scaling
frequency values, but caused stale HWP_CAP.GUARANTEED to be used in
the case at hand.

Fixes: eacc9c5a92 ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled")
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.8+ <stable@vger.kernel.org> # 5.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-21 10:51:00 +01:00
Rafael J. Wysocki
a365ab6b9d cpufreq: intel_pstate: Implement the ->adjust_perf() callback
Make intel_pstate expose the ->adjust_perf() callback when it
operates in the passive mode with HWP enabled which causes the
schedutil governor to use that callback instead of ->fast_switch().

The minimum and target performance-level values passed by the
governor to ->adjust_perf() are converted to HWP.REQ.MIN and
HWP.REQ.DESIRED, respectively, which allows the processor to
adjust its configuration to maximize energy-efficiency while
providing sufficient capacity.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-15 19:24:19 +01:00
Rafael J. Wysocki
ee2cc4276b cpufreq: Add special-purpose fast-switching callback for drivers
First off, some cpufreq drivers (eg. intel_pstate) can pass hints
beyond the current target frequency to the hardware and there are no
provisions for doing that in the cpufreq framework.  In particular,
today the driver has to assume that it should not allow the frequency
to fall below the one requested by the governor (or the required
capacity may not be provided) which may not be the case and which may
lead to excessive energy usage in some scenarios.

Second, the hints passed by these drivers to the hardware need not be
in terms of the frequency, so representing the utilization numbers
coming from the scheduler as frequency before passing them to those
drivers is not really useful.

Address the two points above by adding a special-purpose replacement
for the ->fast_switch callback, called ->adjust_perf, allowing the
governor to pass abstract performance level (rather than frequency)
values for the minimum (required) and target (desired) performance
along with the CPU capacity to compare them to.

Also update the schedutil governor to use the new callback instead
of ->fast_switch if present and if the utilization mertics are
frequency-invariant (that is requisite for the direct mapping
between the utilization and the CPU performance levels to be a
reasonable approximation).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-15 19:24:18 +01:00
Ionela Voinescu
a28b2bfc09 cppc_cpufreq: replace per-cpu data array with a list
The cppc_cpudata per-cpu storage was inefficient (1) additional to causing
functional issues (2) when CPUs are hotplugged out, due to per-cpu data
being improperly initialised.

(1) The amount of information needed for CPPC performance control in its
    cpufreq driver depends on the domain (PSD) coordination type:

    ANY:    One set of CPPC control and capability data (e.g desired
            performance, highest/lowest performance, etc) applies to all
            CPUs in the domain.

    ALL:    Same as ANY. To be noted that this type is not currently
            supported. When supported, information about which CPUs
            belong to a domain is needed in order for frequency change
            requests to be sent to each of them.

    HW:     It's necessary to store CPPC control and capability
            information for all the CPUs. HW will then coordinate the
            performance state based on their limitations and requests.

    NONE:   Same as HW. No HW coordination is expected.

    Despite this, the previous initialisation code would indiscriminately
    allocate memory for all CPUs (all_cpu_data) and unnecessarily
    duplicate performance capabilities and the domain sharing mask and type
    for each possible CPU.

(2) With the current per-cpu structure, when having ANY coordination,
    the cppc_cpudata cpu information is not initialised (will remain 0)
    for all CPUs in a policy, other than policy->cpu. When policy->cpu is
    hotplugged out, the driver will incorrectly use the uninitialised (0)
    value of the other CPUs when making frequency changes. Additionally,
    the previous values stored in the perf_ctrls.desired_perf will be
    lost when policy->cpu changes.

Therefore replace the array of per cpu data with a list. The memory for
each structure is allocated at policy init, where a single structure
can be allocated per policy, not per cpu. In order to accommodate the
struct list_head node in the cppc_cpudata structure, the now unused cpu
and cur_policy variables are removed.

For example, on a arm64 Juno platform with 6 CPUs: (0, 1, 2, 3) in PSD1,
(4, 5) in PSD2 - ANY coordination, the memory allocation comparison shows:

Before patch:

 - ANY coordination:
   total    slack      req alloc/free  caller
       0        0        0     0/1     _kernel_size_le_hi32+0x0xffff800008ff7810
       0        0        0     0/6     _kernel_size_le_hi32+0x0xffff800008ff7808
     128       80       48     1/0     _kernel_size_le_hi32+0x0xffff800008ffc070
     768        0      768     6/0     _kernel_size_le_hi32+0x0xffff800008ffc0e4

After patch:

 - ANY coordination:
    total    slack      req alloc/free  caller
     256        0      256     2/0     _kernel_size_le_hi32+0x0xffff800008fed410
       0        0        0     0/2     _kernel_size_le_hi32+0x0xffff800008fed274

Additional notes:
 - A pointer to the policy's cppc_cpudata is stored in policy->driver_data
 - Driver registration is skipped if _CPC entries are not present.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Mian Yousaf Kaukab <ykaukab@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-15 19:19:32 +01:00
Ionela Voinescu
cfdc589f4b cppc_cpufreq: expose information on frequency domains
Use the existing sysfs attribute "freqdomain_cpus" to expose
information to userspace about CPUs in the same frequency domain.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Mian Yousaf Kaukab <ykaukab@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-15 19:19:32 +01:00
Ionela Voinescu
bf76bb208f cppc_cpufreq: clarify support for coordination types
The previous coordination type handling in the cppc_cpufreq init code
created some confusion: the comment mentioned "Support only SW_ANY for
now" while only the SW_ALL/ALL case resulted in a failure. The other
coordination types (HW_ALL/HW, NONE) were silently supported.

Clarify support for coordination types while describing in comments the
intended behavior.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Mian Yousaf Kaukab <ykaukab@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-15 19:19:32 +01:00
Ionela Voinescu
d2641a5c3d cppc_cpufreq: use policy->cpu as driver of frequency setting
Considering only the currently supported coordination types (ANY, HW,
NONE), this change only makes a difference for the ANY type, when
policy->cpu is hotplugged out. In that case the new policy->cpu will
be different from ((struct cppc_cpudata *)policy->driver_data)->cpu.

While in this case the controls of *ANY* CPU could be used to drive
frequency changes, it's more consistent to use policy->cpu as the
leading CPU, as used in all other cppc_cpufreq functions. Additionally,
the debug prints in cppc_set_perf() would no longer create confusion
when referring to a CPU that is hotplugged out.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Mian Yousaf Kaukab <ykaukab@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-15 19:19:31 +01:00
Rafael J. Wysocki
e1f1320fc0 Merge branch 'pm-cpufreq'
* pm-cpufreq: (31 commits)
  cpufreq: Fix cpufreq_online() return value on errors
  cpufreq: Fix up several kerneldoc comments
  cpufreq: stats: Use local_clock() instead of jiffies
  cpufreq: schedutil: Simplify sugov_update_next_freq()
  cpufreq: intel_pstate: Simplify intel_cpufreq_update_pstate()
  cpufreq: arm_scmi: Discover the power scale in performance protocol
  firmware: arm_scmi: Add power_scale_mw_get() interface
  cpufreq: tegra194: Rename tegra194_get_speed_common function
  cpufreq: tegra194: Remove unnecessary frequency calculation
  cpufreq: tegra186: Simplify cluster information lookup
  cpufreq: tegra186: Fix sparse 'incorrect type in assignment' warning
  cpufreq: imx: fix NVMEM_IMX_OCOTP dependency
  cpufreq: vexpress-spc: Add missing MODULE_ALIAS
  cpufreq: scpi: Add missing MODULE_ALIAS
  cpufreq: loongson1: Add missing MODULE_ALIAS
  cpufreq: sun50i: Add missing MODULE_DEVICE_TABLE
  cpufreq: st: Add missing MODULE_DEVICE_TABLE
  cpufreq: qcom: Add missing MODULE_DEVICE_TABLE
  cpufreq: mediatek: Add missing MODULE_DEVICE_TABLE
  cpufreq: highbank: Add missing MODULE_DEVICE_TABLE
  ...
2020-12-15 15:24:52 +01:00
Rafael J. Wysocki
30c768829a Merge branch 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull ARM cpufreq updates for 5.11-rc1 from Viresh Kumar:

"This contains the following updates:

 - Fix imx's NVMEM_IMX_OCOTP dependency (Arnd Bergmann).

 - Add support for mt8167 and blacklist mt8516 (Fabien Parent).

 - Some ->get() callback related cleanups to the tegra194 driver and
   some optimizations in tegra186 driver (Jon Hunter and Sumit Gupta).

 - Power scale improvements to arm_scmi driver (Lukasz Luba).

 - Add missing MODULE_DEVICE_TABLE and MODULE_ALIAS to several drivers
   (Pali Rohár).

 - Fix error path in mediatek driver (Qinglang Miao).

 - Fix memleak in ST's cpufreq driver (Yangtao Li)."

* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: (22 commits)
  cpufreq: arm_scmi: Discover the power scale in performance protocol
  firmware: arm_scmi: Add power_scale_mw_get() interface
  cpufreq: tegra194: Rename tegra194_get_speed_common function
  cpufreq: tegra194: Remove unnecessary frequency calculation
  cpufreq: tegra186: Simplify cluster information lookup
  cpufreq: tegra186: Fix sparse 'incorrect type in assignment' warning
  cpufreq: imx: fix NVMEM_IMX_OCOTP dependency
  cpufreq: vexpress-spc: Add missing MODULE_ALIAS
  cpufreq: scpi: Add missing MODULE_ALIAS
  cpufreq: loongson1: Add missing MODULE_ALIAS
  cpufreq: sun50i: Add missing MODULE_DEVICE_TABLE
  cpufreq: st: Add missing MODULE_DEVICE_TABLE
  cpufreq: qcom: Add missing MODULE_DEVICE_TABLE
  cpufreq: mediatek: Add missing MODULE_DEVICE_TABLE
  cpufreq: highbank: Add missing MODULE_DEVICE_TABLE
  cpufreq: ap806: Add missing MODULE_DEVICE_TABLE
  cpufreq: mediatek: add missing platform_driver_unregister() on error in mtk_cpufreq_driver_init
  cpufreq: tegra194: get consistent cpuinfo_cur_freq
  cpufreq: blacklist mt8516 in cpufreq-dt-platdev
  cpufreq: mediatek: Add support for mt8167
  ...
2020-12-14 20:29:50 +01:00
Rafael J. Wysocki
f0f6dbaf06 Merge branch 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull OPP (Operating Performance Points) updates for 5.11-rc1 from
Viresh Kumar:

"This contains the following updates:

 - Allow empty (node-less) OPP tables in DT for passing just the
   dependency related information (Nicola Mazzucato).

 - Fix a potential lockdep in OPP core and other OPP core cleanups
   (Viresh Kumar).

 - Don't abuse dev_pm_opp_get_opp_table() to create an OPP table, fix
   cpufreq-dt driver for the same (Viresh Kumar).

 - dev_pm_opp_put_regulators() accepts a NULL argument now, updates to
   all the users as well (Viresh Kumar)."

* 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  opp: of: Allow empty opp-table with opp-shared
  dt-bindings: opp: Allow empty OPP tables
  media: venus: dev_pm_opp_put_*() accepts NULL argument
  drm/panfrost: dev_pm_opp_put_*() accepts NULL argument
  drm/lima: dev_pm_opp_put_*() accepts NULL argument
  PM / devfreq: exynos: dev_pm_opp_put_*() accepts NULL argument
  cpufreq: qcom-cpufreq-nvmem: dev_pm_opp_put_*() accepts NULL argument
  cpufreq: dt: dev_pm_opp_put_regulators() accepts NULL argument
  opp: Allow dev_pm_opp_put_*() APIs to accept NULL opp_table
  opp: Don't create an OPP table from dev_pm_opp_get_opp_table()
  cpufreq: dt: Don't (ab)use dev_pm_opp_get_opp_table() to create OPP table
  opp: Reduce the size of critical section in _opp_kref_release()
  opp: Don't return opp_dev from _find_opp_dev()
  opp: Allocate the OPP table outside of opp_table_lock
  opp: Always add entries in dev_list with opp_table->lock held
2020-12-14 20:26:17 +01:00
Wang ShaoBo
b96f038432 cpufreq: Fix cpufreq_online() return value on errors
Make cpufreq_online() return negative error codes on all errors that
cause the policy to be destroyed, as appropriate.

Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-11 19:54:17 +01:00
Rafael J. Wysocki
ec06e586ab cpufreq: Fix up several kerneldoc comments
Fix up the remaining kerneldoc comments that don't adhere to the
expected format and clarify some of them a bit.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-11 19:54:08 +01:00
Viresh Kumar
7854c7520b cpufreq: stats: Use local_clock() instead of jiffies
local_clock() has better precision and accuracy as compared to jiffies,
lets use it for time management in cpufreq stats.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-11 19:53:58 +01:00
Rafael J. Wysocki
2554c32f0b cpufreq: intel_pstate: Simplify intel_cpufreq_update_pstate()
Avoid doing the same assignment in both branches of a conditional,
do it after the whole conditional instead.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-11 19:53:58 +01:00
Rafael J. Wysocki
42807537b6 Merge back cpufreq material for v5.11. 2020-12-11 19:52:52 +01:00
Viresh Kumar
2ff8fe13ac cpufreq: qcom-cpufreq-nvmem: dev_pm_opp_put_*() accepts NULL argument
The dev_pm_opp_put_*() APIs now accepts a NULL opp_table pointer and so
there is no need for us to carry the extra checks. Drop them.

Reviewed-by: Ilia Lin <ilia.lin@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-09 11:21:12 +05:30
Viresh Kumar
5f6ffb8d8f cpufreq: dt: dev_pm_opp_put_regulators() accepts NULL argument
The dev_pm_opp_put_*() APIs now accepts a NULL opp_table pointer and so
there is no need for us to carry the extra checks. Drop them.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-09 11:21:11 +05:30
Viresh Kumar
873c9851eb cpufreq: dt: Don't (ab)use dev_pm_opp_get_opp_table() to create OPP table
Initially, the helper dev_pm_opp_get_opp_table() was supposed to be used
only for the OPP core's internal use (it tries to find an existing OPP
table and if it doesn't find one, then it allocates the OPP table).

Sometime back, the cpufreq-dt driver started using it to make sure all
the relevant resources required by the OPP core are available earlier
during initialization process to properly propagate -EPROBE_DEFER.

It worked but it also abused the API to create an OPP table, which
should be created with the help of other helpers provided by the OPP
core.

The OPP core will be updated in a later commit to limit the scope of
dev_pm_opp_get_opp_table() to only finding an existing OPP table and not
create one. This commit updates the cpufreq-dt driver before that
happens.

Now the cpufreq-dt driver creates the OPP and cpufreq tables for all the
CPUs from driver's init callback itself.

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-09 11:21:11 +05:30
Viresh Kumar
c8bb452054 Merge branch 'cpufreq/scmi' into cpufreq/arm/linux-next 2020-12-08 11:22:17 +05:30
Lukasz Luba
f9b0498d29 cpufreq: arm_scmi: Discover the power scale in performance protocol
Add mechanism to discover the power scale present in the performance
protocol for all domains. Provide this information to Energy Model,
which then can be checked in other frameworks, e.g. thermal.

Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-08 10:16:13 +05:30
Jon Hunter
f45f89a778 cpufreq: tegra194: Rename tegra194_get_speed_common function
The function tegra194_get_speed_common() uses hardware timers to
calculate the current CPUFREQ and so rename this function to be
tegra194_calculate_speed() to reflect what it does.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-07 13:02:44 +05:30
Jon Hunter
93549516d4 cpufreq: tegra194: Remove unnecessary frequency calculation
The Tegra194 CPUFREQ driver sets the CPUFREQ_NEED_INITIAL_FREQ_CHECK
flag which means that the CPUFREQ framework will call the 'get' callback
on boot to determine the current frequency of the CPUs. Therefore, it is
not necessary for the Tegra194 CPUFREQ driver to internally call the
tegra194_get_speed_common() during initialisation to query the current
frequency as well. Fix this by removing the call to the
tegra194_get_speed_common() during initialisation and simplify the code.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-07 13:02:44 +05:30
Jon Hunter
cfef4bcacc cpufreq: tegra186: Simplify cluster information lookup
The CPUFREQ driver framework references each individual CPUs when
getting and setting the speed. Tegra186 has 3 clusters of A57 CPUs and
1 cluster of Denver CPUs. Hence, the Tegra186 CPUFREQ driver need to
know which cluster a given CPU belongs to. The logic in the Tegra186
driver can be greatly simplified by storing the cluster ID associated
with each CPU in the tegra186_cpufreq_cpu structure. This allow us to
completely remove the Tegra cluster info structure from the driver and
simplifiy the code.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-07 13:02:44 +05:30
Jon Hunter
b7b4e78552 cpufreq: tegra186: Fix sparse 'incorrect type in assignment' warning
Sparse warns that the incorrect type is being assigned to the CPUFREQ
driver_data variable in the Tegra186 CPUFREQ driver. The Tegra186
CPUFREQ driver is assigned a type of 'void __iomem *' to a pointer of
type 'void *' ...

 drivers/cpufreq/tegra186-cpufreq.c:72:37: sparse: sparse: incorrect
 	type in assignment (different address spaces) @@
	expected void *driver_data @@     got void [noderef] __iomem * @@
 ...

 drivers/cpufreq/tegra186-cpufreq.c:87:40: sparse: sparse: incorrect
 	type in initializer (different address spaces) @@
	expected void [noderef] __iomem *edvd_reg @@     got void *driver_data @@

The Tegra186 CPUFREQ driver is using the policy->driver_data variable to
store and iomem pointer to a Tegra186 CPU register that is used to set
the clock speed for the CPU. This is not necessary because the register
base address is already stored in the driver data and the offset of the
register for each CPU is static. Therefore, fix this by adding a new
structure with the register offsets for each CPU and store this in the
main driver data structure along with the register base address. Please
note that a new structure has been added for storing the register
offsets rather than a simple array, because this will permit further
clean-ups and simplification of the driver.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-07 13:02:44 +05:30
Arnd Bergmann
fc928b901d cpufreq: imx: fix NVMEM_IMX_OCOTP dependency
A driver should not 'select' drivers from another subsystem.
If NVMEM is disabled, this one results in a warning:

WARNING: unmet direct dependencies detected for NVMEM_IMX_OCOTP
  Depends on [n]: NVMEM [=n] && (ARCH_MXC [=y] || COMPILE_TEST [=y]) && HAS_IOMEM [=y]
  Selected by [y]:
  - ARM_IMX6Q_CPUFREQ [=y] && CPU_FREQ [=y] && (ARM || ARM64 [=y]) && ARCH_MXC [=y] && REGULATOR_ANATOP [=y]

Change the 'select' to 'depends on' to prevent it from going wrong,
and allow compile-testing without that driver, since it is only
a runtime dependency.

Fixes: 2782ef34ed ("cpufreq: imx: Select NVMEM_IMX_OCOTP")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-07 13:02:38 +05:30
Pali Rohár
d15183991c cpufreq: vexpress-spc: Add missing MODULE_ALIAS
This patch adds missing MODULE_ALIAS for automatic loading of this cpufreq
driver when it is compiled as an external module.

Signed-off-by: Pali Rohár <pali@kernel.org>
Fixes: 47ac9aa165 ("cpufreq: arm_big_little: add vexpress SPC interface driver")
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-07 13:02:38 +05:30
Pali Rohár
c0382d049d cpufreq: scpi: Add missing MODULE_ALIAS
This patch adds missing MODULE_ALIAS for automatic loading of this cpufreq
driver when it is compiled as an external module.

Signed-off-by: Pali Rohár <pali@kernel.org>
Fixes: 8def31034d ("cpufreq: arm_big_little: add SCPI interface driver")
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-07 13:02:37 +05:30
Pali Rohár
b9acab0918 cpufreq: loongson1: Add missing MODULE_ALIAS
This patch adds missing MODULE_ALIAS for automatic loading of this cpufreq
driver when it is compiled as an external module.

Signed-off-by: Pali Rohár <pali@kernel.org>
Fixes: a0a22cf144 ("cpufreq: Loongson1: Add cpufreq driver for Loongson1B")
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-07 13:02:37 +05:30