linux/drivers/cpufreq/intel_pstate.c

3354 lines
84 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-only
/*
* intel_pstate.c: Native P state management for Intel processors
*
* (C) Copyright 2012 Intel Corporation
* Author: Dirk Brandewie <dirk.j.brandewie@intel.com>
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/kernel.h>
#include <linux/kernel_stat.h>
#include <linux/module.h>
#include <linux/ktime.h>
#include <linux/hrtimer.h>
#include <linux/tick.h>
#include <linux/slab.h>
#include <linux/sched/cpufreq.h>
#include <linux/list.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
#include <linux/sysfs.h>
#include <linux/types.h>
#include <linux/fs.h>
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
#include <linux/acpi.h>
x86/mm: Decouple <linux/vmalloc.h> from <asm/io.h> Nothing in <asm/io.h> uses anything from <linux/vmalloc.h>, so remove it from there and fix up the resulting build problems triggered on x86 {64|32}-bit {def|allmod|allno}configs. The breakages were triggering in places where x86 builds relied on vmalloc() facilities but did not include <linux/vmalloc.h> explicitly and relied on the implicit inclusion via <asm/io.h>. Also add: - <linux/init.h> to <linux/io.h> - <asm/pgtable_types> to <asm/io.h> ... which were two other implicit header file dependencies. Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> [ Tidied up the changelog. ] Acked-by: David Miller <davem@davemloft.net> Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Vinod Koul <vinod.koul@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Colin Cross <ccross@android.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: James E.J. Bottomley <JBottomley@odin.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Kees Cook <keescook@chromium.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Kristen Carlson Accardi <kristen@linux.intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Suma Ramars <sramars@cisco.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-02 17:01:38 +08:00
#include <linux/vmalloc.h>
#include <linux/pm_qos.h>
#include <trace/events/power.h>
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
#include <asm/cpu.h>
#include <asm/div64.h>
#include <asm/msr.h>
#include <asm/cpu_device_id.h>
#include <asm/cpufeature.h>
#include <asm/intel-family.h>
#define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC)
#define INTEL_CPUFREQ_TRANSITION_LATENCY 20000
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000
#define INTEL_CPUFREQ_TRANSITION_DELAY 500
#ifdef CONFIG_ACPI
#include <acpi/processor.h>
#include <acpi/cppc_acpi.h>
#endif
#define FRAC_BITS 8
#define int_tofp(X) ((int64_t)(X) << FRAC_BITS)
#define fp_toint(X) ((X) >> FRAC_BITS)
#define ONE_EIGHTH_FP ((int64_t)1 << (FRAC_BITS - 3))
#define EXT_BITS 6
#define EXT_FRAC_BITS (EXT_BITS + FRAC_BITS)
#define fp_ext_toint(X) ((X) >> EXT_FRAC_BITS)
#define int_ext_tofp(X) ((int64_t)(X) << EXT_FRAC_BITS)
static inline int32_t mul_fp(int32_t x, int32_t y)
{
return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
}
intel_pstate: Fix overflow in busy_scaled due to long delay The kernel may delay interrupts for a long time which can result in timers being delayed. If this occurs the intel_pstate driver will crash with a divide by zero error: divide error: 0000 [#1] SMP Modules linked in: btrfs zlib_deflate raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc arc4 md4 nls_utf8 cifs dns_resolver tcp_lp bnep bluetooth rfkill fuse dm_service_time iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables intel_powerclamp coretemp vfat fat kvm_intel iTCO_wdt iTCO_vendor_support ipmi_devintf sr_mod kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel cdc_ether lrw usbnet cdrom mii gf128mul glue_helper ablk_helper cryptd lpc_ich mfd_core pcspkr sb_edac edac_core ipmi_si ipmi_msghandler ioatdma wmi shpchp acpi_pad nfsd auth_rpcgss nfs_acl lockd uinput dm_multipath sunrpc xfs libcrc32c usb_storage sd_mod crc_t10dif crct10dif_common ixgbe mgag200 syscopyarea sysfillrect sysimgblt mdio drm_kms_helper ttm igb drm ptp pps_core dca i2c_algo_bit megaraid_sas i2c_core dm_mirror dm_region_hash dm_log dm_mod CPU: 113 PID: 0 Comm: swapper/113 Tainted: G W -------------- 3.10.0-229.1.2.el7.x86_64 #1 Hardware name: IBM x3950 X6 -[3837AC2]-/00FN827, BIOS -[A8E112BUS-1.00]- 08/27/2014 task: ffff880fe8abe660 ti: ffff880fe8ae4000 task.ti: ffff880fe8ae4000 RIP: 0010:[<ffffffff814a9279>] [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0 RSP: 0018:ffff883fff4e3db8 EFLAGS: 00010206 RAX: 0000000027100000 RBX: ffff883fe6965100 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000010 RDI: 000000002e53632d RBP: ffff883fff4e3e20 R08: 000e6f69a5a125c0 R09: ffff883fe84ec001 R10: 0000000000000002 R11: 0000000000000005 R12: 00000000000049f5 R13: 0000000000271000 R14: 00000000000049f5 R15: 0000000000000246 FS: 0000000000000000(0000) GS:ffff883fff4e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7668601000 CR3: 000000000190a000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff883fff4e3e58 ffffffff81099dc1 0000000000000086 0000000000000071 ffff883fff4f3680 0000000000000071 fbdc8a965e33afee ffffffff810b69dd ffff883fe84ec000 ffff883fe6965108 0000000000000100 ffffffff814a9100 Call Trace: <IRQ> [<ffffffff81099dc1>] ? run_posix_cpu_timers+0x51/0x840 [<ffffffff810b69dd>] ? trigger_load_balance+0x5d/0x200 [<ffffffff814a9100>] ? pid_param_set+0x130/0x130 [<ffffffff8107df56>] call_timer_fn+0x36/0x110 [<ffffffff814a9100>] ? pid_param_set+0x130/0x130 [<ffffffff8107fdcf>] run_timer_softirq+0x21f/0x320 [<ffffffff81077b2f>] __do_softirq+0xef/0x280 [<ffffffff816156dc>] call_softirq+0x1c/0x30 [<ffffffff81015d95>] do_softirq+0x65/0xa0 [<ffffffff81077ec5>] irq_exit+0x115/0x120 [<ffffffff81616355>] smp_apic_timer_interrupt+0x45/0x60 [<ffffffff81614a1d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff814a9c32>] ? cpuidle_enter_state+0x52/0xc0 [<ffffffff814a9c28>] ? cpuidle_enter_state+0x48/0xc0 [<ffffffff814a9d65>] cpuidle_idle_call+0xc5/0x200 [<ffffffff8101d14e>] arch_cpu_idle+0xe/0x30 [<ffffffff810c67c1>] cpu_startup_entry+0xf1/0x290 [<ffffffff8104228a>] start_secondary+0x1ba/0x230 Code: 42 0f 00 45 89 e6 48 01 c2 43 8d 44 6d 00 39 d0 73 26 49 c1 e5 08 89 d2 4d 63 f4 49 63 c5 48 c1 e2 08 48 c1 e0 08 48 63 ca 48 99 <48> f7 f9 48 98 4c 0f af f0 49 c1 ee 08 8b 43 78 c1 e0 08 44 29 RIP [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0 RSP <ffff883fff4e3db8> The kernel values for cpudata for CPU 113 were: struct cpudata { cpu = 113, timer = { entry = { next = 0x0, prev = 0xdead000000200200 }, expires = 8357799745, base = 0xffff883fe84ec001, function = 0xffffffff814a9100 <intel_pstate_timer_func>, data = 18446612406765768960, <snip> i_gain = 0, d_gain = 0, deadband = 0, last_err = 22489 }, last_sample_time = { tv64 = 4063132438017305 }, prev_aperf = 287326796397463, prev_mperf = 251427432090198, sample = { core_pct_busy = 23081, aperf = 2937407, mperf = 3257884, freq = 2524484, time = { tv64 = 4063149215234118 } } } which results in the time between samples = last_sample_time - sample.time = 4063149215234118 - 4063132438017305 = 16777216813 which is 16.777 seconds. The duration between reads of the APERF and MPERF registers overflowed a s32 sized integer in intel_pstate_get_scaled_busy()'s call to div_fp(). The result is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0. While the kernel shouldn't be delaying for a long time, it can and does happen and the intel_pstate driver should not panic in this situation. This patch changes the div_fp() function to use div64_s64() to allow for "long" division. This will avoid the overflow condition on long delays. [v2]: use div64_s64() in div_fp() Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-06-16 01:43:29 +08:00
static inline int32_t div_fp(s64 x, s64 y)
{
intel_pstate: Fix overflow in busy_scaled due to long delay The kernel may delay interrupts for a long time which can result in timers being delayed. If this occurs the intel_pstate driver will crash with a divide by zero error: divide error: 0000 [#1] SMP Modules linked in: btrfs zlib_deflate raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc arc4 md4 nls_utf8 cifs dns_resolver tcp_lp bnep bluetooth rfkill fuse dm_service_time iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables intel_powerclamp coretemp vfat fat kvm_intel iTCO_wdt iTCO_vendor_support ipmi_devintf sr_mod kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel cdc_ether lrw usbnet cdrom mii gf128mul glue_helper ablk_helper cryptd lpc_ich mfd_core pcspkr sb_edac edac_core ipmi_si ipmi_msghandler ioatdma wmi shpchp acpi_pad nfsd auth_rpcgss nfs_acl lockd uinput dm_multipath sunrpc xfs libcrc32c usb_storage sd_mod crc_t10dif crct10dif_common ixgbe mgag200 syscopyarea sysfillrect sysimgblt mdio drm_kms_helper ttm igb drm ptp pps_core dca i2c_algo_bit megaraid_sas i2c_core dm_mirror dm_region_hash dm_log dm_mod CPU: 113 PID: 0 Comm: swapper/113 Tainted: G W -------------- 3.10.0-229.1.2.el7.x86_64 #1 Hardware name: IBM x3950 X6 -[3837AC2]-/00FN827, BIOS -[A8E112BUS-1.00]- 08/27/2014 task: ffff880fe8abe660 ti: ffff880fe8ae4000 task.ti: ffff880fe8ae4000 RIP: 0010:[<ffffffff814a9279>] [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0 RSP: 0018:ffff883fff4e3db8 EFLAGS: 00010206 RAX: 0000000027100000 RBX: ffff883fe6965100 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000010 RDI: 000000002e53632d RBP: ffff883fff4e3e20 R08: 000e6f69a5a125c0 R09: ffff883fe84ec001 R10: 0000000000000002 R11: 0000000000000005 R12: 00000000000049f5 R13: 0000000000271000 R14: 00000000000049f5 R15: 0000000000000246 FS: 0000000000000000(0000) GS:ffff883fff4e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7668601000 CR3: 000000000190a000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff883fff4e3e58 ffffffff81099dc1 0000000000000086 0000000000000071 ffff883fff4f3680 0000000000000071 fbdc8a965e33afee ffffffff810b69dd ffff883fe84ec000 ffff883fe6965108 0000000000000100 ffffffff814a9100 Call Trace: <IRQ> [<ffffffff81099dc1>] ? run_posix_cpu_timers+0x51/0x840 [<ffffffff810b69dd>] ? trigger_load_balance+0x5d/0x200 [<ffffffff814a9100>] ? pid_param_set+0x130/0x130 [<ffffffff8107df56>] call_timer_fn+0x36/0x110 [<ffffffff814a9100>] ? pid_param_set+0x130/0x130 [<ffffffff8107fdcf>] run_timer_softirq+0x21f/0x320 [<ffffffff81077b2f>] __do_softirq+0xef/0x280 [<ffffffff816156dc>] call_softirq+0x1c/0x30 [<ffffffff81015d95>] do_softirq+0x65/0xa0 [<ffffffff81077ec5>] irq_exit+0x115/0x120 [<ffffffff81616355>] smp_apic_timer_interrupt+0x45/0x60 [<ffffffff81614a1d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff814a9c32>] ? cpuidle_enter_state+0x52/0xc0 [<ffffffff814a9c28>] ? cpuidle_enter_state+0x48/0xc0 [<ffffffff814a9d65>] cpuidle_idle_call+0xc5/0x200 [<ffffffff8101d14e>] arch_cpu_idle+0xe/0x30 [<ffffffff810c67c1>] cpu_startup_entry+0xf1/0x290 [<ffffffff8104228a>] start_secondary+0x1ba/0x230 Code: 42 0f 00 45 89 e6 48 01 c2 43 8d 44 6d 00 39 d0 73 26 49 c1 e5 08 89 d2 4d 63 f4 49 63 c5 48 c1 e2 08 48 c1 e0 08 48 63 ca 48 99 <48> f7 f9 48 98 4c 0f af f0 49 c1 ee 08 8b 43 78 c1 e0 08 44 29 RIP [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0 RSP <ffff883fff4e3db8> The kernel values for cpudata for CPU 113 were: struct cpudata { cpu = 113, timer = { entry = { next = 0x0, prev = 0xdead000000200200 }, expires = 8357799745, base = 0xffff883fe84ec001, function = 0xffffffff814a9100 <intel_pstate_timer_func>, data = 18446612406765768960, <snip> i_gain = 0, d_gain = 0, deadband = 0, last_err = 22489 }, last_sample_time = { tv64 = 4063132438017305 }, prev_aperf = 287326796397463, prev_mperf = 251427432090198, sample = { core_pct_busy = 23081, aperf = 2937407, mperf = 3257884, freq = 2524484, time = { tv64 = 4063149215234118 } } } which results in the time between samples = last_sample_time - sample.time = 4063149215234118 - 4063132438017305 = 16777216813 which is 16.777 seconds. The duration between reads of the APERF and MPERF registers overflowed a s32 sized integer in intel_pstate_get_scaled_busy()'s call to div_fp(). The result is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0. While the kernel shouldn't be delaying for a long time, it can and does happen and the intel_pstate driver should not panic in this situation. This patch changes the div_fp() function to use div64_s64() to allow for "long" division. This will avoid the overflow condition on long delays. [v2]: use div64_s64() in div_fp() Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-06-16 01:43:29 +08:00
return div64_s64((int64_t)x << FRAC_BITS, y);
}
static inline int ceiling_fp(int32_t x)
{
int mask, ret;
ret = fp_toint(x);
mask = (1 << FRAC_BITS) - 1;
if (x & mask)
ret += 1;
return ret;
}
static inline u64 mul_ext_fp(u64 x, u64 y)
{
return (x * y) >> EXT_FRAC_BITS;
}
static inline u64 div_ext_fp(u64 x, u64 y)
{
return div64_u64(x << EXT_FRAC_BITS, y);
}
/**
* struct sample - Store performance sample
* @core_avg_perf: Ratio of APERF/MPERF which is the actual average
* performance during last sample period
* @busy_scaled: Scaled busy value which is used to calculate next
* P state. This can be different than core_avg_perf
* to account for cpu idle period
* @aperf: Difference of actual performance frequency clock count
* read from APERF MSR between last and current sample
* @mperf: Difference of maximum performance frequency clock count
* read from MPERF MSR between last and current sample
* @tsc: Difference of time stamp counter between last and
* current sample
* @time: Current time from scheduler
*
* This structure is used in the cpudata structure to store performance sample
* data for choosing next P State.
*/
struct sample {
int32_t core_avg_perf;
int32_t busy_scaled;
u64 aperf;
u64 mperf;
u64 tsc;
u64 time;
};
/**
* struct pstate_data - Store P state data
* @current_pstate: Current requested P state
* @min_pstate: Min P state possible for this platform
* @max_pstate: Max P state possible for this platform
* @max_pstate_physical:This is physical Max P state for a processor
* This can be higher than the max_pstate which can
* be limited by platform thermal design power limits
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
* @perf_ctl_scaling: PERF_CTL P-state to frequency scaling factor
* @scaling: Scaling factor between performance and frequency
* @turbo_pstate: Max Turbo P state possible for this platform
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
* @min_freq: @min_pstate frequency in cpufreq units
* @max_freq: @max_pstate frequency in cpufreq units
* @turbo_freq: @turbo_pstate frequency in cpufreq units
*
* Stores the per cpu model P state limits and current P state.
*/
struct pstate_data {
int current_pstate;
int min_pstate;
int max_pstate;
int max_pstate_physical;
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
int perf_ctl_scaling;
int scaling;
int turbo_pstate;
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
unsigned int min_freq;
unsigned int max_freq;
unsigned int turbo_freq;
};
/**
* struct vid_data - Stores voltage information data
* @min: VID data for this platform corresponding to
* the lowest P state
* @max: VID data corresponding to the highest P State.
* @turbo: VID data for turbo P state
* @ratio: Ratio of (vid max - vid min) /
* (max P state - Min P State)
*
* Stores the voltage data for DVFS (Dynamic Voltage and Frequency Scaling)
* This data is used in Atom platforms, where in addition to target P state,
* the voltage data needs to be specified to select next P State.
*/
struct vid_data {
int min;
int max;
int turbo;
int32_t ratio;
};
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
/**
* struct global_params - Global parameters, mostly tunable via sysfs.
* @no_turbo: Whether or not to use turbo P-states.
* @turbo_disabled: Whether or not turbo P-states are available at all,
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
* based on the MSR_IA32_MISC_ENABLE value and whether or
* not the maximum reported turbo P-state is different from
* the maximum reported non-turbo one.
* @turbo_disabled_mf: The @turbo_disabled value reflected by cpuinfo.max_freq.
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
* @min_perf_pct: Minimum capacity limit in percent of the maximum turbo
* P-state capacity.
* @max_perf_pct: Maximum capacity limit in percent of the maximum turbo
* P-state capacity.
*/
struct global_params {
bool no_turbo;
bool turbo_disabled;
bool turbo_disabled_mf;
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
int max_perf_pct;
int min_perf_pct;
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
};
/**
* struct cpudata - Per CPU instance data storage
* @cpu: CPU number for this instance data
* @policy: CPUFreq policy value
* @update_util: CPUFreq utility callback information
* @update_util_set: CPUFreq utility callback is set
* @iowait_boost: iowait-related boost fraction
* @last_update: Time of the last update.
* @pstate: Stores P state limits for this CPU
* @vid: Stores VID limits for this CPU
* @last_sample_time: Last Sample time
* @aperf_mperf_shift: APERF vs MPERF counting frequency difference
* @prev_aperf: Last APERF value read from APERF MSR
* @prev_mperf: Last MPERF value read from MPERF MSR
* @prev_tsc: Last timestamp counter (TSC) value
* @prev_cummulative_iowait: IO Wait time difference from last and
* current sample
* @sample: Storage for storing last Sample data
* @min_perf_ratio: Minimum capacity in terms of PERF or HWP ratios
* @max_perf_ratio: Maximum capacity in terms of PERF or HWP ratios
* @acpi_perf_data: Stores ACPI perf information read from _PSS
* @valid_pss_table: Set to true for valid ACPI _PSS entries found
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
* @epp_powersave: Last saved HWP energy performance preference
* (EPP) or energy performance bias (EPB),
* when policy switched to performance
* @epp_policy: Last saved policy used to set EPP/EPB
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
* @epp_default: Power on default HWP energy performance
* preference/bias
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
* @epp_cached Cached HWP energy-performance preference value
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
* @hwp_req_cached: Cached value of the last HWP Request MSR
* @hwp_cap_cached: Cached value of the last HWP Capabilities MSR
* @last_io_update: Last time when IO wake flag was set
* @sched_flags: Store scheduler flags for possible cross CPU update
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
* @hwp_boost_min: Last HWP boosted min performance
* @suspended: Whether or not the driver has been suspended.
*
* This structure stores per CPU instance data for all CPUs.
*/
struct cpudata {
int cpu;
unsigned int policy;
struct update_util_data update_util;
bool update_util_set;
struct pstate_data pstate;
struct vid_data vid;
u64 last_update;
u64 last_sample_time;
u64 aperf_mperf_shift;
u64 prev_aperf;
u64 prev_mperf;
u64 prev_tsc;
u64 prev_cummulative_iowait;
struct sample sample;
int32_t min_perf_ratio;
int32_t max_perf_ratio;
#ifdef CONFIG_ACPI
struct acpi_processor_performance acpi_perf_data;
bool valid_pss_table;
#endif
unsigned int iowait_boost;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
s16 epp_powersave;
s16 epp_policy;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
s16 epp_default;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
s16 epp_cached;
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
u64 hwp_req_cached;
u64 hwp_cap_cached;
u64 last_io_update;
unsigned int sched_flags;
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
u32 hwp_boost_min;
bool suspended;
};
static struct cpudata **all_cpu_data;
/**
* struct pstate_funcs - Per CPU model specific callbacks
* @get_max: Callback to get maximum non turbo effective P state
* @get_max_physical: Callback to get maximum non turbo physical P state
* @get_min: Callback to get minimum P state
* @get_turbo: Callback to get turbo P state
* @get_scaling: Callback to get frequency scaling factor
* @get_cpu_scaling: Get frequency scaling factor for a given cpu
* @get_aperf_mperf_shift: Callback to get the APERF vs MPERF frequency difference
* @get_val: Callback to convert P state to actual MSR write value
* @get_vid: Callback to get VID data for Atom platforms
*
* Core and Atom CPU models have different way to get P State limits. This
* structure is used to store those callbacks.
*/
struct pstate_funcs {
int (*get_max)(int cpu);
int (*get_max_physical)(int cpu);
int (*get_min)(int cpu);
int (*get_turbo)(int cpu);
int (*get_scaling)(void);
int (*get_cpu_scaling)(int cpu);
int (*get_aperf_mperf_shift)(void);
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
u64 (*get_val)(struct cpudata*, int pstate);
void (*get_vid)(struct cpudata *);
};
static struct pstate_funcs pstate_funcs __read_mostly;
static int hwp_active __read_mostly;
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
static int hwp_mode_bdw __read_mostly;
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
static bool per_cpu_limits __read_mostly;
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
static bool hwp_boost __read_mostly;
static struct cpufreq_driver *intel_pstate_driver __read_mostly;
#ifdef CONFIG_ACPI
static bool acpi_ppc;
#endif
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
static struct global_params global;
static DEFINE_MUTEX(intel_pstate_driver_lock);
static DEFINE_MUTEX(intel_pstate_limits_lock);
#ifdef CONFIG_ACPI
static bool intel_pstate_acpi_pm_profile_server(void)
{
if (acpi_gbl_FADT.preferred_profile == PM_ENTERPRISE_SERVER ||
acpi_gbl_FADT.preferred_profile == PM_PERFORMANCE_SERVER)
return true;
return false;
}
static bool intel_pstate_get_ppc_enable_status(void)
{
if (intel_pstate_acpi_pm_profile_server())
return true;
return acpi_ppc;
}
#ifdef CONFIG_ACPI_CPPC_LIB
/* The work item is needed to avoid CPU hotplug locking issues */
static void intel_pstste_sched_itmt_work_fn(struct work_struct *work)
{
sched_set_itmt_support();
}
static DECLARE_WORK(sched_itmt_work, intel_pstste_sched_itmt_work_fn);
#define CPPC_MAX_PERF U8_MAX
static void intel_pstate_set_itmt_prio(int cpu)
{
struct cppc_perf_caps cppc_perf;
static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;
int ret;
ret = cppc_get_perf_caps(cpu, &cppc_perf);
if (ret)
return;
/*
* On some systems with overclocking enabled, CPPC.highest_perf is hardcoded to 0xff.
* In this case we can't use CPPC.highest_perf to enable ITMT.
* In this case we can look at MSR_HWP_CAPABILITIES bits [8:0] to decide.
*/
if (cppc_perf.highest_perf == CPPC_MAX_PERF)
cppc_perf.highest_perf = HWP_HIGHEST_PERF(READ_ONCE(all_cpu_data[cpu]->hwp_cap_cached));
/*
* The priorities can be set regardless of whether or not
* sched_set_itmt_support(true) has been called and it is valid to
* update them at any time after it has been called.
*/
sched_set_itmt_core_prio(cppc_perf.highest_perf, cpu);
if (max_highest_perf <= min_highest_perf) {
if (cppc_perf.highest_perf > max_highest_perf)
max_highest_perf = cppc_perf.highest_perf;
if (cppc_perf.highest_perf < min_highest_perf)
min_highest_perf = cppc_perf.highest_perf;
if (max_highest_perf > min_highest_perf) {
/*
* This code can be run during CPU online under the
* CPU hotplug locks, so sched_set_itmt_support()
* cannot be called from here. Queue up a work item
* to invoke it.
*/
schedule_work(&sched_itmt_work);
}
}
}
static int intel_pstate_get_cppc_guaranteed(int cpu)
{
struct cppc_perf_caps cppc_perf;
int ret;
ret = cppc_get_perf_caps(cpu, &cppc_perf);
if (ret)
return ret;
if (cppc_perf.guaranteed_perf)
return cppc_perf.guaranteed_perf;
return cppc_perf.nominal_perf;
}
#else /* CONFIG_ACPI_CPPC_LIB */
static inline void intel_pstate_set_itmt_prio(int cpu)
{
}
#endif /* CONFIG_ACPI_CPPC_LIB */
static void intel_pstate_init_acpi_perf_limits(struct cpufreq_policy *policy)
{
struct cpudata *cpu;
int ret;
int i;
if (hwp_active) {
intel_pstate_set_itmt_prio(policy->cpu);
return;
}
if (!intel_pstate_get_ppc_enable_status())
return;
cpu = all_cpu_data[policy->cpu];
ret = acpi_processor_register_performance(&cpu->acpi_perf_data,
policy->cpu);
if (ret)
return;
/*
* Check if the control value in _PSS is for PERF_CTL MSR, which should
* guarantee that the states returned by it map to the states in our
* list directly.
*/
if (cpu->acpi_perf_data.control_register.space_id !=
ACPI_ADR_SPACE_FIXED_HARDWARE)
goto err;
/*
* If there is only one entry _PSS, simply ignore _PSS and continue as
* usual without taking _PSS into account
*/
if (cpu->acpi_perf_data.state_count < 2)
goto err;
pr_debug("CPU%u - ACPI _PSS perf data\n", policy->cpu);
for (i = 0; i < cpu->acpi_perf_data.state_count; i++) {
pr_debug(" %cP%d: %u MHz, %u mW, 0x%x\n",
(i == cpu->acpi_perf_data.state ? '*' : ' '), i,
(u32) cpu->acpi_perf_data.states[i].core_frequency,
(u32) cpu->acpi_perf_data.states[i].power,
(u32) cpu->acpi_perf_data.states[i].control);
}
cpu->valid_pss_table = true;
pr_debug("_PPC limits will be enforced\n");
return;
err:
cpu->valid_pss_table = false;
acpi_processor_unregister_performance(policy->cpu);
}
static void intel_pstate_exit_perf_limits(struct cpufreq_policy *policy)
{
struct cpudata *cpu;
cpu = all_cpu_data[policy->cpu];
if (!cpu->valid_pss_table)
return;
acpi_processor_unregister_performance(policy->cpu);
}
#else /* CONFIG_ACPI */
static inline void intel_pstate_init_acpi_perf_limits(struct cpufreq_policy *policy)
{
}
static inline void intel_pstate_exit_perf_limits(struct cpufreq_policy *policy)
{
}
static inline bool intel_pstate_acpi_pm_profile_server(void)
{
return false;
}
#endif /* CONFIG_ACPI */
#ifndef CONFIG_ACPI_CPPC_LIB
static inline int intel_pstate_get_cppc_guaranteed(int cpu)
{
return -ENOTSUPP;
}
#endif /* CONFIG_ACPI_CPPC_LIB */
static int intel_pstate_freq_to_hwp_rel(struct cpudata *cpu, int freq,
unsigned int relation)
{
if (freq == cpu->pstate.turbo_freq)
return cpu->pstate.turbo_pstate;
if (freq == cpu->pstate.max_freq)
return cpu->pstate.max_pstate;
switch (relation) {
case CPUFREQ_RELATION_H:
return freq / cpu->pstate.scaling;
case CPUFREQ_RELATION_C:
return DIV_ROUND_CLOSEST(freq, cpu->pstate.scaling);
}
return DIV_ROUND_UP(freq, cpu->pstate.scaling);
}
static int intel_pstate_freq_to_hwp(struct cpudata *cpu, int freq)
{
return intel_pstate_freq_to_hwp_rel(cpu, freq, CPUFREQ_RELATION_L);
}
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
/**
* intel_pstate_hybrid_hwp_adjust - Calibrate HWP performance levels.
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
* @cpu: Target CPU.
*
* On hybrid processors, HWP may expose more performance levels than there are
* P-states accessible through the PERF_CTL interface. If that happens, the
* scaling factor between HWP performance levels and CPU frequency will be less
* than the scaling factor between P-state values and CPU frequency.
*
* In that case, adjust the CPU parameters used in computations accordingly.
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
*/
static void intel_pstate_hybrid_hwp_adjust(struct cpudata *cpu)
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
{
int perf_ctl_max_phys = cpu->pstate.max_pstate_physical;
int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling;
int perf_ctl_turbo = pstate_funcs.get_turbo(cpu->cpu);
int scaling = cpu->pstate.scaling;
int freq;
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
pr_debug("CPU%d: perf_ctl_max_phys = %d\n", cpu->cpu, perf_ctl_max_phys);
pr_debug("CPU%d: perf_ctl_turbo = %d\n", cpu->cpu, perf_ctl_turbo);
pr_debug("CPU%d: perf_ctl_scaling = %d\n", cpu->cpu, perf_ctl_scaling);
pr_debug("CPU%d: HWP_CAP guaranteed = %d\n", cpu->cpu, cpu->pstate.max_pstate);
pr_debug("CPU%d: HWP_CAP highest = %d\n", cpu->cpu, cpu->pstate.turbo_pstate);
pr_debug("CPU%d: HWP-to-frequency scaling factor: %d\n", cpu->cpu, scaling);
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
cpu->pstate.turbo_freq = rounddown(cpu->pstate.turbo_pstate * scaling,
perf_ctl_scaling);
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
cpu->pstate.max_freq = rounddown(cpu->pstate.max_pstate * scaling,
perf_ctl_scaling);
freq = perf_ctl_max_phys * perf_ctl_scaling;
cpu->pstate.max_pstate_physical = intel_pstate_freq_to_hwp(cpu, freq);
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
freq = cpu->pstate.min_pstate * perf_ctl_scaling;
cpu->pstate.min_freq = freq;
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
/*
* Cast the min P-state value retrieved via pstate_funcs.get_min() to
* the effective range of HWP performance levels.
*/
cpu->pstate.min_pstate = intel_pstate_freq_to_hwp(cpu, freq);
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
}
static inline void update_turbo_state(void)
{
u64 misc_en;
struct cpudata *cpu;
cpu = all_cpu_data[0];
rdmsrl(MSR_IA32_MISC_ENABLE, misc_en);
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
global.turbo_disabled =
(misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE ||
cpu->pstate.max_pstate == cpu->pstate.turbo_pstate);
}
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
static int min_perf_pct_min(void)
{
struct cpudata *cpu = all_cpu_data[0];
int turbo_pstate = cpu->pstate.turbo_pstate;
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
return turbo_pstate ?
(cpu->pstate.min_pstate * 100 / turbo_pstate) : 0;
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
}
static s16 intel_pstate_get_epb(struct cpudata *cpu_data)
{
u64 epb;
int ret;
if (!boot_cpu_has(X86_FEATURE_EPB))
return -ENXIO;
ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_IA32_ENERGY_PERF_BIAS, &epb);
if (ret)
return (s16)ret;
return (s16)(epb & 0x0f);
}
static s16 intel_pstate_get_epp(struct cpudata *cpu_data, u64 hwp_req_data)
{
s16 epp;
if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
/*
* When hwp_req_data is 0, means that caller didn't read
* MSR_HWP_REQUEST, so need to read and get EPP.
*/
if (!hwp_req_data) {
epp = rdmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST,
&hwp_req_data);
if (epp)
return epp;
}
epp = (hwp_req_data >> 24) & 0xff;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
} else {
/* When there is no EPP present, HWP uses EPB settings */
epp = intel_pstate_get_epb(cpu_data);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
}
return epp;
}
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
static int intel_pstate_set_epb(int cpu, s16 pref)
{
u64 epb;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
int ret;
if (!boot_cpu_has(X86_FEATURE_EPB))
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
return -ENXIO;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
ret = rdmsrl_on_cpu(cpu, MSR_IA32_ENERGY_PERF_BIAS, &epb);
if (ret)
return ret;
epb = (epb & ~0x0f) | pref;
wrmsrl_on_cpu(cpu, MSR_IA32_ENERGY_PERF_BIAS, epb);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
return 0;
}
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
/*
* EPP/EPB display strings corresponding to EPP index in the
* energy_perf_strings[]
* index String
*-------------------------------------
* 0 default
* 1 performance
* 2 balance_performance
* 3 balance_power
* 4 power
*/
static const char * const energy_perf_strings[] = {
"default",
"performance",
"balance_performance",
"balance_power",
"power",
NULL
};
static const unsigned int epp_values[] = {
HWP_EPP_PERFORMANCE,
HWP_EPP_BALANCE_PERFORMANCE,
HWP_EPP_BALANCE_POWERSAVE,
HWP_EPP_POWERSAVE
};
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data, int *raw_epp)
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
{
s16 epp;
int index = -EINVAL;
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
*raw_epp = 0;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
epp = intel_pstate_get_epp(cpu_data, 0);
if (epp < 0)
return epp;
if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
if (epp == HWP_EPP_PERFORMANCE)
return 1;
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
if (epp == HWP_EPP_BALANCE_PERFORMANCE)
return 2;
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
if (epp == HWP_EPP_BALANCE_POWERSAVE)
return 3;
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
if (epp == HWP_EPP_POWERSAVE)
return 4;
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
*raw_epp = epp;
return 0;
} else if (boot_cpu_has(X86_FEATURE_EPB)) {
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
/*
* Range:
* 0x00-0x03 : Performance
* 0x04-0x07 : Balance performance
* 0x08-0x0B : Balance power
* 0x0C-0x0F : Power
* The EPB is a 4 bit value, but our ranges restrict the
* value which can be set. Here only using top two bits
* effectively.
*/
index = (epp >> 2) + 1;
}
return index;
}
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
{
int ret;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
/*
* Use the cached HWP Request MSR value, because in the active mode the
* register itself may be updated by intel_pstate_hwp_boost_up() or
* intel_pstate_hwp_boost_down() at any time.
*/
u64 value = READ_ONCE(cpu->hwp_req_cached);
value &= ~GENMASK_ULL(31, 24);
value |= (u64)epp << 24;
/*
* The only other updater of hwp_req_cached in the active mode,
* intel_pstate_hwp_set(), is called under the same lock as this
* function, so it cannot run in parallel with the update below.
*/
WRITE_ONCE(cpu->hwp_req_cached, value);
ret = wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
if (!ret)
cpu->epp_cached = epp;
return ret;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
}
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
int pref_index, bool use_raw,
u32 raw_epp)
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
{
int epp = -EINVAL;
int ret;
if (!pref_index)
epp = cpu_data->epp_default;
if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
if (use_raw)
epp = raw_epp;
else if (epp == -EINVAL)
epp = epp_values[pref_index - 1];
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
/*
* To avoid confusion, refuse to set EPP to any values different
* from 0 (performance) if the current policy is "performance",
* because those values would be overridden.
*/
if (epp > 0 && cpu_data->policy == CPUFREQ_POLICY_PERFORMANCE)
return -EBUSY;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
ret = intel_pstate_set_epp(cpu_data, epp);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
} else {
if (epp == -EINVAL)
epp = (pref_index - 1) << 2;
ret = intel_pstate_set_epb(cpu_data->cpu, epp);
}
return ret;
}
static ssize_t show_energy_performance_available_preferences(
struct cpufreq_policy *policy, char *buf)
{
int i = 0;
int ret = 0;
while (energy_perf_strings[i] != NULL)
ret += sprintf(&buf[ret], "%s ", energy_perf_strings[i++]);
ret += sprintf(&buf[ret], "\n");
return ret;
}
cpufreq_freq_attr_ro(energy_performance_available_preferences);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
static struct cpufreq_driver intel_pstate;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
static ssize_t store_energy_performance_preference(
struct cpufreq_policy *policy, const char *buf, size_t count)
{
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
struct cpudata *cpu = all_cpu_data[policy->cpu];
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
char str_preference[21];
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
bool raw = false;
ssize_t ret;
u32 epp = 0;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
ret = sscanf(buf, "%20s", str_preference);
if (ret != 1)
return -EINVAL;
ret = match_string(energy_perf_strings, -1, str_preference);
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
if (ret < 0) {
if (!boot_cpu_has(X86_FEATURE_HWP_EPP))
return ret;
ret = kstrtouint(buf, 10, &epp);
if (ret)
return ret;
if (epp > 255)
return -EINVAL;
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
raw = true;
}
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
/*
* This function runs with the policy R/W semaphore held, which
* guarantees that the driver pointer will not change while it is
* running.
*/
if (!intel_pstate_driver)
return -EAGAIN;
mutex_lock(&intel_pstate_limits_lock);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
if (intel_pstate_driver == &intel_pstate) {
ret = intel_pstate_set_energy_pref_index(cpu, ret, raw, epp);
} else {
/*
* In the passive mode the governor needs to be stopped on the
* target CPU before the EPP update and restarted after it,
* which is super-heavy-weight, so make sure it is worth doing
* upfront.
*/
if (!raw)
epp = ret ? epp_values[ret - 1] : cpu->epp_default;
if (cpu->epp_cached != epp) {
int err;
cpufreq_stop_governor(policy);
ret = intel_pstate_set_epp(cpu, epp);
err = cpufreq_start_governor(policy);
if (!ret)
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
ret = err;
} else {
ret = 0;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
}
}
mutex_unlock(&intel_pstate_limits_lock);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
return ret ?: count;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
}
static ssize_t show_energy_performance_preference(
struct cpufreq_policy *policy, char *buf)
{
struct cpudata *cpu_data = all_cpu_data[policy->cpu];
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
int preference, raw_epp;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
preference = intel_pstate_get_energy_pref_index(cpu_data, &raw_epp);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
if (preference < 0)
return preference;
cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:01 +08:00
if (raw_epp)
return sprintf(buf, "%d\n", raw_epp);
else
return sprintf(buf, "%s\n", energy_perf_strings[preference]);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
}
cpufreq_freq_attr_rw(energy_performance_preference);
static ssize_t show_base_frequency(struct cpufreq_policy *policy, char *buf)
{
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
struct cpudata *cpu = all_cpu_data[policy->cpu];
int ratio, freq;
ratio = intel_pstate_get_cppc_guaranteed(policy->cpu);
if (ratio <= 0) {
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
u64 cap;
rdmsrl_on_cpu(policy->cpu, MSR_HWP_CAPABILITIES, &cap);
ratio = HWP_GUARANTEED_PERF(cap);
}
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
freq = ratio * cpu->pstate.scaling;
if (cpu->pstate.scaling != cpu->pstate.perf_ctl_scaling)
freq = rounddown(freq, cpu->pstate.perf_ctl_scaling);
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
return sprintf(buf, "%d\n", freq);
}
cpufreq_freq_attr_ro(base_frequency);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
static struct freq_attr *hwp_cpufreq_attrs[] = {
&energy_performance_preference,
&energy_performance_available_preferences,
&base_frequency,
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
NULL,
};
static void __intel_pstate_get_hwp_cap(struct cpudata *cpu)
{
u64 cap;
rdmsrl_on_cpu(cpu->cpu, MSR_HWP_CAPABILITIES, &cap);
WRITE_ONCE(cpu->hwp_cap_cached, cap);
cpu->pstate.max_pstate = HWP_GUARANTEED_PERF(cap);
cpu->pstate.turbo_pstate = HWP_HIGHEST_PERF(cap);
}
static void intel_pstate_get_hwp_cap(struct cpudata *cpu)
{
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
int scaling = cpu->pstate.scaling;
__intel_pstate_get_hwp_cap(cpu);
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
cpu->pstate.max_freq = cpu->pstate.max_pstate * scaling;
cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * scaling;
if (scaling != cpu->pstate.perf_ctl_scaling) {
int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling;
cpu->pstate.max_freq = rounddown(cpu->pstate.max_freq,
perf_ctl_scaling);
cpu->pstate.turbo_freq = rounddown(cpu->pstate.turbo_freq,
perf_ctl_scaling);
}
}
static void intel_pstate_hwp_set(unsigned int cpu)
{
struct cpudata *cpu_data = all_cpu_data[cpu];
int max, min;
u64 value;
s16 epp;
max = cpu_data->max_perf_ratio;
min = cpu_data->min_perf_ratio;
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
if (cpu_data->policy == CPUFREQ_POLICY_PERFORMANCE)
min = max;
rdmsrl_on_cpu(cpu, MSR_HWP_REQUEST, &value);
value &= ~HWP_MIN_PERF(~0L);
value |= HWP_MIN_PERF(min);
value &= ~HWP_MAX_PERF(~0L);
value |= HWP_MAX_PERF(max);
if (cpu_data->epp_policy == cpu_data->policy)
goto skip_epp;
cpu_data->epp_policy = cpu_data->policy;
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
if (cpu_data->policy == CPUFREQ_POLICY_PERFORMANCE) {
epp = intel_pstate_get_epp(cpu_data, value);
cpu_data->epp_powersave = epp;
/* If EPP read was failed, then don't try to write */
if (epp < 0)
goto skip_epp;
epp = 0;
} else {
/* skip setting EPP, when saved value is invalid */
if (cpu_data->epp_powersave < 0)
goto skip_epp;
/*
* No need to restore EPP when it is not zero. This
* means:
* - Policy is not changed
* - user has manually changed
* - Error reading EPB
*/
epp = intel_pstate_get_epp(cpu_data, value);
if (epp)
goto skip_epp;
epp = cpu_data->epp_powersave;
}
if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
value &= ~GENMASK_ULL(31, 24);
value |= (u64)epp << 24;
} else {
intel_pstate_set_epb(cpu, epp);
}
skip_epp:
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
WRITE_ONCE(cpu_data->hwp_req_cached, value);
wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
intel_pstate: Update frequencies of policy->cpus only from ->set_policy() The intel-pstate driver is using intel_pstate_hwp_set() from two separate paths, i.e. ->set_policy() callback and sysfs update path for the files present in /sys/devices/system/cpu/intel_pstate/ directory. While an update to the sysfs path applies to all the CPUs being managed by the driver (which essentially means all the online CPUs), the update via the ->set_policy() callback applies to a smaller group of CPUs managed by the policy for which ->set_policy() is called. And so, intel_pstate_hwp_set() should update frequencies of only the CPUs that are part of policy->cpus mask, while it is called from ->set_policy() callback. In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set() and apply the frequency changes only to the concerned CPUs. For ->set_policy() path, we are only concerned about policy->cpus, and so policy->rwsem lock taken by the core prior to calling ->set_policy() is enough to take care of any races. The larger lock acquired by get_online_cpus() is required only for the updates to sysfs files. Add another routine, intel_pstate_hwp_set_online_cpus(), and call it from the sysfs update paths. This also fixes a lockdep reported recently, where policy->rwsem and get_online_cpus() could have been acquired in any order causing an ABBA deadlock. The sequence of events leading to that was: intel_pstate_init(...) ...cpufreq_online(...) down_write(&policy->rwsem); // Locks policy->rwsem ... cpufreq_init_policy(policy); ...intel_pstate_hwp_set(); get_online_cpus(); // Temporarily locks cpu_hotplug.lock ... up_write(&policy->rwsem); pm_suspend(...) ...disable_nonboot_cpus() _cpu_down() cpu_hotplug_begin(); // Locks cpu_hotplug.lock __cpu_notify(CPU_DOWN_PREPARE, ...); ...cpufreq_offline_prepare(); down_write(&policy->rwsem); // Locks policy->rwsem Reported-and-tested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-02-22 12:57:46 +08:00
}
static void intel_pstate_hwp_offline(struct cpudata *cpu)
{
u64 value = READ_ONCE(cpu->hwp_req_cached);
int min_perf;
if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
/*
* In case the EPP has been set to "performance" by the
* active mode "performance" scaling algorithm, replace that
* temporary value with the cached EPP one.
*/
value &= ~GENMASK_ULL(31, 24);
value |= HWP_ENERGY_PERF_PREFERENCE(cpu->epp_cached);
/*
* However, make sure that EPP will be set to "performance" when
* the CPU is brought back online again and the "performance"
* scaling algorithm is still in effect.
*/
cpu->epp_policy = CPUFREQ_POLICY_UNKNOWN;
}
/*
* Clear the desired perf field in the cached HWP request value to
* prevent nonzero desired values from being leaked into the active
* mode.
*/
value &= ~HWP_DESIRED_PERF(~0L);
WRITE_ONCE(cpu->hwp_req_cached, value);
value &= ~GENMASK_ULL(31, 0);
min_perf = HWP_LOWEST_PERF(READ_ONCE(cpu->hwp_cap_cached));
/* Set hwp_max = hwp_min */
value |= HWP_MAX_PERF(min_perf);
value |= HWP_MIN_PERF(min_perf);
/* Set EPP to min */
if (boot_cpu_has(X86_FEATURE_HWP_EPP))
value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE);
wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
}
cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:00 +08:00
#define POWER_CTL_EE_ENABLE 1
#define POWER_CTL_EE_DISABLE 2
static int power_ctl_ee_state;
static void set_power_ctl_ee_state(bool input)
{
u64 power_ctl;
mutex_lock(&intel_pstate_driver_lock);
rdmsrl(MSR_IA32_POWER_CTL, power_ctl);
if (input) {
power_ctl &= ~BIT(MSR_IA32_POWER_CTL_BIT_EE);
power_ctl_ee_state = POWER_CTL_EE_ENABLE;
} else {
power_ctl |= BIT(MSR_IA32_POWER_CTL_BIT_EE);
power_ctl_ee_state = POWER_CTL_EE_DISABLE;
}
wrmsrl(MSR_IA32_POWER_CTL, power_ctl);
mutex_unlock(&intel_pstate_driver_lock);
}
static void intel_pstate_hwp_enable(struct cpudata *cpudata);
static void intel_pstate_hwp_reenable(struct cpudata *cpu)
{
intel_pstate_hwp_enable(cpu);
wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, READ_ONCE(cpu->hwp_req_cached));
}
static int intel_pstate_suspend(struct cpufreq_policy *policy)
{
struct cpudata *cpu = all_cpu_data[policy->cpu];
pr_debug("CPU %d suspending\n", cpu->cpu);
cpu->suspended = true;
return 0;
}
static int intel_pstate_resume(struct cpufreq_policy *policy)
{
struct cpudata *cpu = all_cpu_data[policy->cpu];
pr_debug("CPU %d resuming\n", cpu->cpu);
cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:00 +08:00
/* Only restore if the system default is changed */
if (power_ctl_ee_state == POWER_CTL_EE_ENABLE)
set_power_ctl_ee_state(true);
else if (power_ctl_ee_state == POWER_CTL_EE_DISABLE)
set_power_ctl_ee_state(false);
if (cpu->suspended && hwp_active) {
mutex_lock(&intel_pstate_limits_lock);
/* Re-enable HWP, because "online" has not done that. */
intel_pstate_hwp_reenable(cpu);
mutex_unlock(&intel_pstate_limits_lock);
}
cpu->suspended = false;
return 0;
}
static void intel_pstate_update_policies(void)
intel_pstate: Update frequencies of policy->cpus only from ->set_policy() The intel-pstate driver is using intel_pstate_hwp_set() from two separate paths, i.e. ->set_policy() callback and sysfs update path for the files present in /sys/devices/system/cpu/intel_pstate/ directory. While an update to the sysfs path applies to all the CPUs being managed by the driver (which essentially means all the online CPUs), the update via the ->set_policy() callback applies to a smaller group of CPUs managed by the policy for which ->set_policy() is called. And so, intel_pstate_hwp_set() should update frequencies of only the CPUs that are part of policy->cpus mask, while it is called from ->set_policy() callback. In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set() and apply the frequency changes only to the concerned CPUs. For ->set_policy() path, we are only concerned about policy->cpus, and so policy->rwsem lock taken by the core prior to calling ->set_policy() is enough to take care of any races. The larger lock acquired by get_online_cpus() is required only for the updates to sysfs files. Add another routine, intel_pstate_hwp_set_online_cpus(), and call it from the sysfs update paths. This also fixes a lockdep reported recently, where policy->rwsem and get_online_cpus() could have been acquired in any order causing an ABBA deadlock. The sequence of events leading to that was: intel_pstate_init(...) ...cpufreq_online(...) down_write(&policy->rwsem); // Locks policy->rwsem ... cpufreq_init_policy(policy); ...intel_pstate_hwp_set(); get_online_cpus(); // Temporarily locks cpu_hotplug.lock ... up_write(&policy->rwsem); pm_suspend(...) ...disable_nonboot_cpus() _cpu_down() cpu_hotplug_begin(); // Locks cpu_hotplug.lock __cpu_notify(CPU_DOWN_PREPARE, ...); ...cpufreq_offline_prepare(); down_write(&policy->rwsem); // Locks policy->rwsem Reported-and-tested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-02-22 12:57:46 +08:00
{
int cpu;
for_each_possible_cpu(cpu)
cpufreq_update_policy(cpu);
}
static void intel_pstate_update_max_freq(unsigned int cpu)
{
struct cpufreq_policy *policy = cpufreq_cpu_acquire(cpu);
struct cpudata *cpudata;
if (!policy)
return;
cpudata = all_cpu_data[cpu];
policy->cpuinfo.max_freq = global.turbo_disabled_mf ?
cpudata->pstate.max_freq : cpudata->pstate.turbo_freq;
refresh_frequency_limits(policy);
cpufreq_cpu_release(policy);
}
static void intel_pstate_update_limits(unsigned int cpu)
{
mutex_lock(&intel_pstate_driver_lock);
update_turbo_state();
/*
* If turbo has been turned on or off globally, policy limits for
* all CPUs need to be updated to reflect that.
*/
if (global.turbo_disabled_mf != global.turbo_disabled) {
global.turbo_disabled_mf = global.turbo_disabled;
arch_set_max_freq_ratio(global.turbo_disabled);
for_each_possible_cpu(cpu)
intel_pstate_update_max_freq(cpu);
} else {
cpufreq_update_policy(cpu);
}
mutex_unlock(&intel_pstate_driver_lock);
}
/************************** sysfs begin ************************/
#define show_one(file_name, object) \
static ssize_t show_##file_name \
(struct kobject *kobj, struct kobj_attribute *attr, char *buf) \
{ \
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
return sprintf(buf, "%u\n", global.object); \
}
static ssize_t intel_pstate_show_status(char *buf);
static int intel_pstate_update_status(const char *buf, size_t size);
static ssize_t show_status(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
ssize_t ret;
mutex_lock(&intel_pstate_driver_lock);
ret = intel_pstate_show_status(buf);
mutex_unlock(&intel_pstate_driver_lock);
return ret;
}
static ssize_t store_status(struct kobject *a, struct kobj_attribute *b,
const char *buf, size_t count)
{
char *p = memchr(buf, '\n', count);
int ret;
mutex_lock(&intel_pstate_driver_lock);
ret = intel_pstate_update_status(buf, p ? p - buf : count);
mutex_unlock(&intel_pstate_driver_lock);
return ret < 0 ? ret : count;
}
static ssize_t show_turbo_pct(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
struct cpudata *cpu;
int total, no_turbo, turbo_pct;
uint32_t turbo_fp;
mutex_lock(&intel_pstate_driver_lock);
if (!intel_pstate_driver) {
mutex_unlock(&intel_pstate_driver_lock);
return -EAGAIN;
}
cpu = all_cpu_data[0];
total = cpu->pstate.turbo_pstate - cpu->pstate.min_pstate + 1;
no_turbo = cpu->pstate.max_pstate - cpu->pstate.min_pstate + 1;
turbo_fp = div_fp(no_turbo, total);
turbo_pct = 100 - fp_toint(mul_fp(turbo_fp, int_tofp(100)));
mutex_unlock(&intel_pstate_driver_lock);
return sprintf(buf, "%u\n", turbo_pct);
}
static ssize_t show_num_pstates(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
struct cpudata *cpu;
int total;
mutex_lock(&intel_pstate_driver_lock);
if (!intel_pstate_driver) {
mutex_unlock(&intel_pstate_driver_lock);
return -EAGAIN;
}
cpu = all_cpu_data[0];
total = cpu->pstate.turbo_pstate - cpu->pstate.min_pstate + 1;
mutex_unlock(&intel_pstate_driver_lock);
return sprintf(buf, "%u\n", total);
}
static ssize_t show_no_turbo(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
ssize_t ret;
mutex_lock(&intel_pstate_driver_lock);
if (!intel_pstate_driver) {
mutex_unlock(&intel_pstate_driver_lock);
return -EAGAIN;
}
update_turbo_state();
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
if (global.turbo_disabled)
ret = sprintf(buf, "%u\n", global.turbo_disabled);
else
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
ret = sprintf(buf, "%u\n", global.no_turbo);
mutex_unlock(&intel_pstate_driver_lock);
return ret;
}
static ssize_t store_no_turbo(struct kobject *a, struct kobj_attribute *b,
const char *buf, size_t count)
{
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
if (ret != 1)
return -EINVAL;
mutex_lock(&intel_pstate_driver_lock);
if (!intel_pstate_driver) {
mutex_unlock(&intel_pstate_driver_lock);
return -EAGAIN;
}
mutex_lock(&intel_pstate_limits_lock);
update_turbo_state();
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
if (global.turbo_disabled) {
pr_notice_once("Turbo disabled by BIOS or unavailable on processor\n");
mutex_unlock(&intel_pstate_limits_lock);
mutex_unlock(&intel_pstate_driver_lock);
return -EPERM;
}
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
global.no_turbo = clamp_t(int, input, 0, 1);
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
if (global.no_turbo) {
struct cpudata *cpu = all_cpu_data[0];
int pct = cpu->pstate.max_pstate * 100 / cpu->pstate.turbo_pstate;
/* Squash the global minimum into the permitted range. */
if (global.min_perf_pct > pct)
global.min_perf_pct = pct;
}
cpufreq: intel_pstate: Fix global settings in active mode Commit 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync) changed intel_pstate to invoke cpufreq_update_policy() for every registered CPU on global sysfs attributes updates, but that led to undesirable effects in the active mode if the "performance" P-state selection algorithm is configufred for one CPU and the "powersave" one is chosen for all of the other CPUs. Namely, in that case, the following is possible: # cd /sys/devices/system/cpu/ # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 26 # echo performance > cpufreq/policy0/scaling_governor # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 100 # echo 94 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 26 The reason why this happens is because intel_pstate attempts to maintain two sets of global limits in the active mode, one for the "performance" P-state selection algorithm and one for the "powersave" P-state selection algorithm, but the P-state selection algorithms are set per policy, so the global limits cannot reflect all of them at the same time if they are different for different policies. In the particular situation above, the attempt to change min_perf_pct to 94 caused cpufreq_update_policy() to be run for a CPU with the "powersave" P-state selection algorithm and intel_pstate_set_policy() called by it silently switched the global limits to the "powersave" set which finally was reflected by the sysfs interface. To prevent that from happening, modify intel_pstate_update_policies() to always switch back to the set of limits that was used right before it has been invoked. Fixes: 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-01 07:07:36 +08:00
mutex_unlock(&intel_pstate_limits_lock);
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
intel_pstate_update_policies();
mutex_unlock(&intel_pstate_driver_lock);
return count;
}
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
static void update_qos_request(enum freq_qos_req_type type)
{
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
struct freq_qos_request *req;
struct cpufreq_policy *policy;
int i;
for_each_possible_cpu(i) {
struct cpudata *cpu = all_cpu_data[i];
unsigned int freq, perf_pct;
policy = cpufreq_cpu_get(i);
if (!policy)
continue;
req = policy->driver_data;
cpufreq_cpu_put(policy);
if (!req)
continue;
if (hwp_active)
intel_pstate_get_hwp_cap(cpu);
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
if (type == FREQ_QOS_MIN) {
perf_pct = global.min_perf_pct;
} else {
req++;
perf_pct = global.max_perf_pct;
}
freq = DIV_ROUND_UP(cpu->pstate.turbo_freq * perf_pct, 100);
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
if (freq_qos_update_request(req, freq) < 0)
pr_warn("Failed to update freq constraint: CPU%d\n", i);
}
}
static ssize_t store_max_perf_pct(struct kobject *a, struct kobj_attribute *b,
const char *buf, size_t count)
{
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
if (ret != 1)
return -EINVAL;
mutex_lock(&intel_pstate_driver_lock);
if (!intel_pstate_driver) {
mutex_unlock(&intel_pstate_driver_lock);
return -EAGAIN;
}
mutex_lock(&intel_pstate_limits_lock);
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
global.max_perf_pct = clamp_t(int, input, global.min_perf_pct, 100);
cpufreq: intel_pstate: Fix global settings in active mode Commit 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync) changed intel_pstate to invoke cpufreq_update_policy() for every registered CPU on global sysfs attributes updates, but that led to undesirable effects in the active mode if the "performance" P-state selection algorithm is configufred for one CPU and the "powersave" one is chosen for all of the other CPUs. Namely, in that case, the following is possible: # cd /sys/devices/system/cpu/ # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 26 # echo performance > cpufreq/policy0/scaling_governor # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 100 # echo 94 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 26 The reason why this happens is because intel_pstate attempts to maintain two sets of global limits in the active mode, one for the "performance" P-state selection algorithm and one for the "powersave" P-state selection algorithm, but the P-state selection algorithms are set per policy, so the global limits cannot reflect all of them at the same time if they are different for different policies. In the particular situation above, the attempt to change min_perf_pct to 94 caused cpufreq_update_policy() to be run for a CPU with the "powersave" P-state selection algorithm and intel_pstate_set_policy() called by it silently switched the global limits to the "powersave" set which finally was reflected by the sysfs interface. To prevent that from happening, modify intel_pstate_update_policies() to always switch back to the set of limits that was used right before it has been invoked. Fixes: 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-01 07:07:36 +08:00
mutex_unlock(&intel_pstate_limits_lock);
if (intel_pstate_driver == &intel_pstate)
intel_pstate_update_policies();
else
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
update_qos_request(FREQ_QOS_MAX);
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
mutex_unlock(&intel_pstate_driver_lock);
return count;
}
static ssize_t store_min_perf_pct(struct kobject *a, struct kobj_attribute *b,
const char *buf, size_t count)
{
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
if (ret != 1)
return -EINVAL;
mutex_lock(&intel_pstate_driver_lock);
if (!intel_pstate_driver) {
mutex_unlock(&intel_pstate_driver_lock);
return -EAGAIN;
}
mutex_lock(&intel_pstate_limits_lock);
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
global.min_perf_pct = clamp_t(int, input,
min_perf_pct_min(), global.max_perf_pct);
cpufreq: intel_pstate: Fix global settings in active mode Commit 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync) changed intel_pstate to invoke cpufreq_update_policy() for every registered CPU on global sysfs attributes updates, but that led to undesirable effects in the active mode if the "performance" P-state selection algorithm is configufred for one CPU and the "powersave" one is chosen for all of the other CPUs. Namely, in that case, the following is possible: # cd /sys/devices/system/cpu/ # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 26 # echo performance > cpufreq/policy0/scaling_governor # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 100 # echo 94 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 26 The reason why this happens is because intel_pstate attempts to maintain two sets of global limits in the active mode, one for the "performance" P-state selection algorithm and one for the "powersave" P-state selection algorithm, but the P-state selection algorithms are set per policy, so the global limits cannot reflect all of them at the same time if they are different for different policies. In the particular situation above, the attempt to change min_perf_pct to 94 caused cpufreq_update_policy() to be run for a CPU with the "powersave" P-state selection algorithm and intel_pstate_set_policy() called by it silently switched the global limits to the "powersave" set which finally was reflected by the sysfs interface. To prevent that from happening, modify intel_pstate_update_policies() to always switch back to the set of limits that was used right before it has been invoked. Fixes: 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-01 07:07:36 +08:00
mutex_unlock(&intel_pstate_limits_lock);
if (intel_pstate_driver == &intel_pstate)
intel_pstate_update_policies();
else
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
update_qos_request(FREQ_QOS_MIN);
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
mutex_unlock(&intel_pstate_driver_lock);
return count;
}
static ssize_t show_hwp_dynamic_boost(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
return sprintf(buf, "%u\n", hwp_boost);
}
static ssize_t store_hwp_dynamic_boost(struct kobject *a,
struct kobj_attribute *b,
const char *buf, size_t count)
{
unsigned int input;
int ret;
ret = kstrtouint(buf, 10, &input);
if (ret)
return ret;
mutex_lock(&intel_pstate_driver_lock);
hwp_boost = !!input;
intel_pstate_update_policies();
mutex_unlock(&intel_pstate_driver_lock);
return count;
}
cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:00 +08:00
static ssize_t show_energy_efficiency(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
u64 power_ctl;
int enable;
rdmsrl(MSR_IA32_POWER_CTL, power_ctl);
enable = !!(power_ctl & BIT(MSR_IA32_POWER_CTL_BIT_EE));
return sprintf(buf, "%d\n", !enable);
}
static ssize_t store_energy_efficiency(struct kobject *a, struct kobj_attribute *b,
const char *buf, size_t count)
{
bool input;
int ret;
ret = kstrtobool(buf, &input);
if (ret)
return ret;
set_power_ctl_ee_state(input);
return count;
}
show_one(max_perf_pct, max_perf_pct);
show_one(min_perf_pct, min_perf_pct);
define_one_global_rw(status);
define_one_global_rw(no_turbo);
define_one_global_rw(max_perf_pct);
define_one_global_rw(min_perf_pct);
define_one_global_ro(turbo_pct);
define_one_global_ro(num_pstates);
define_one_global_rw(hwp_dynamic_boost);
cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:00 +08:00
define_one_global_rw(energy_efficiency);
static struct attribute *intel_pstate_attributes[] = {
&status.attr,
&no_turbo.attr,
NULL
};
static const struct attribute_group intel_pstate_attr_group = {
.attrs = intel_pstate_attributes,
};
cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:00 +08:00
static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
static struct kobject *intel_pstate_kobject;
static void __init intel_pstate_sysfs_expose_params(void)
{
int rc;
intel_pstate_kobject = kobject_create_and_add("intel_pstate",
&cpu_subsys.dev_root->kobj);
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
if (WARN_ON(!intel_pstate_kobject))
return;
rc = sysfs_create_group(intel_pstate_kobject, &intel_pstate_attr_group);
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
if (WARN_ON(rc))
return;
if (!boot_cpu_has(X86_FEATURE_HYBRID_CPU)) {
rc = sysfs_create_file(intel_pstate_kobject, &turbo_pct.attr);
WARN_ON(rc);
rc = sysfs_create_file(intel_pstate_kobject, &num_pstates.attr);
WARN_ON(rc);
}
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
/*
* If per cpu limits are enforced there are no global limits, so
* return without creating max/min_perf_pct attributes
*/
if (per_cpu_limits)
return;
rc = sysfs_create_file(intel_pstate_kobject, &max_perf_pct.attr);
WARN_ON(rc);
rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
WARN_ON(rc);
cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:00 +08:00
if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
WARN_ON(rc);
}
}
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
2020-10-09 11:30:38 +08:00
static void __init intel_pstate_sysfs_remove(void)
{
if (!intel_pstate_kobject)
return;
sysfs_remove_group(intel_pstate_kobject, &intel_pstate_attr_group);
if (!boot_cpu_has(X86_FEATURE_HYBRID_CPU)) {
sysfs_remove_file(intel_pstate_kobject, &num_pstates.attr);
sysfs_remove_file(intel_pstate_kobject, &turbo_pct.attr);
}
2020-10-09 11:30:38 +08:00
if (!per_cpu_limits) {
sysfs_remove_file(intel_pstate_kobject, &max_perf_pct.attr);
sysfs_remove_file(intel_pstate_kobject, &min_perf_pct.attr);
if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids))
sysfs_remove_file(intel_pstate_kobject, &energy_efficiency.attr);
}
kobject_put(intel_pstate_kobject);
}
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
{
int rc;
if (!hwp_active)
return;
rc = sysfs_create_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
WARN_ON_ONCE(rc);
}
static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
{
if (!hwp_active)
return;
sysfs_remove_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
}
/************************** sysfs end ************************/
static void intel_pstate_hwp_enable(struct cpudata *cpudata)
{
/* First disable HWP notification interrupt as we don't process them */
if (boot_cpu_has(X86_FEATURE_HWP_NOTIFY))
wrmsrl_on_cpu(cpudata->cpu, MSR_HWP_INTERRUPT, 0x00);
wrmsrl_on_cpu(cpudata->cpu, MSR_PM_ENABLE, 0x1);
cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-07 05:32:16 +08:00
if (cpudata->epp_default == -EINVAL)
cpudata->epp_default = intel_pstate_get_epp(cpudata, 0);
}
static int atom_get_min_pstate(int not_used)
{
u64 value;
rdmsrl(MSR_ATOM_CORE_RATIOS, value);
return (value >> 8) & 0x7F;
}
static int atom_get_max_pstate(int not_used)
{
u64 value;
rdmsrl(MSR_ATOM_CORE_RATIOS, value);
return (value >> 16) & 0x7F;
}
static int atom_get_turbo_pstate(int not_used)
{
u64 value;
rdmsrl(MSR_ATOM_CORE_TURBO_RATIOS, value);
return value & 0x7F;
}
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
static u64 atom_get_val(struct cpudata *cpudata, int pstate)
{
u64 val;
int32_t vid_fp;
u32 vid;
val = (u64)pstate << 8;
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
if (global.no_turbo && !global.turbo_disabled)
val |= (u64)1 << 32;
vid_fp = cpudata->vid.min + mul_fp(
int_tofp(pstate - cpudata->pstate.min_pstate),
cpudata->vid.ratio);
vid_fp = clamp_t(int32_t, vid_fp, cpudata->vid.min, cpudata->vid.max);
vid = ceiling_fp(vid_fp);
if (pstate > cpudata->pstate.max_pstate)
vid = cpudata->vid.turbo;
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
return val | vid;
}
static int silvermont_get_scaling(void)
{
u64 value;
int i;
/* Defined in Table 35-6 from SDM (Sept 2015) */
static int silvermont_freq_table[] = {
83300, 100000, 133300, 116700, 80000};
rdmsrl(MSR_FSB_FREQ, value);
i = value & 0x7;
WARN_ON(i > 4);
return silvermont_freq_table[i];
}
static int airmont_get_scaling(void)
{
u64 value;
int i;
/* Defined in Table 35-10 from SDM (Sept 2015) */
static int airmont_freq_table[] = {
83300, 100000, 133300, 116700, 80000,
93300, 90000, 88900, 87500};
rdmsrl(MSR_FSB_FREQ, value);
i = value & 0xF;
WARN_ON(i > 8);
return airmont_freq_table[i];
}
static void atom_get_vid(struct cpudata *cpudata)
{
u64 value;
rdmsrl(MSR_ATOM_CORE_VIDS, value);
cpudata->vid.min = int_tofp((value >> 8) & 0x7f);
cpudata->vid.max = int_tofp((value >> 16) & 0x7f);
cpudata->vid.ratio = div_fp(
cpudata->vid.max - cpudata->vid.min,
int_tofp(cpudata->pstate.max_pstate -
cpudata->pstate.min_pstate));
rdmsrl(MSR_ATOM_CORE_TURBO_VIDS, value);
cpudata->vid.turbo = value & 0x7f;
}
static int core_get_min_pstate(int cpu)
{
u64 value;
rdmsrl_on_cpu(cpu, MSR_PLATFORM_INFO, &value);
return (value >> 40) & 0xFF;
}
static int core_get_max_pstate_physical(int cpu)
{
u64 value;
rdmsrl_on_cpu(cpu, MSR_PLATFORM_INFO, &value);
return (value >> 8) & 0xFF;
}
static int core_get_tdp_ratio(int cpu, u64 plat_info)
{
/* Check how many TDP levels present */
if (plat_info & 0x600000000) {
u64 tdp_ctrl;
u64 tdp_ratio;
int tdp_msr;
int err;
/* Get the TDP level (0, 1, 2) to get ratios */
err = rdmsrl_safe_on_cpu(cpu, MSR_CONFIG_TDP_CONTROL, &tdp_ctrl);
if (err)
return err;
/* TDP MSR are continuous starting at 0x648 */
tdp_msr = MSR_CONFIG_TDP_NOMINAL + (tdp_ctrl & 0x03);
err = rdmsrl_safe_on_cpu(cpu, tdp_msr, &tdp_ratio);
if (err)
return err;
/* For level 1 and 2, bits[23:16] contain the ratio */
if (tdp_ctrl & 0x03)
tdp_ratio >>= 16;
tdp_ratio &= 0xff; /* ratios are only 8 bits long */
pr_debug("tdp_ratio %x\n", (int)tdp_ratio);
return (int)tdp_ratio;
}
return -ENXIO;
}
static int core_get_max_pstate(int cpu)
{
u64 tar;
u64 plat_info;
int max_pstate;
int tdp_ratio;
int err;
rdmsrl_on_cpu(cpu, MSR_PLATFORM_INFO, &plat_info);
max_pstate = (plat_info >> 8) & 0xFF;
tdp_ratio = core_get_tdp_ratio(cpu, plat_info);
if (tdp_ratio <= 0)
return max_pstate;
if (hwp_active) {
/* Turbo activation ratio is not used on HWP platforms */
return tdp_ratio;
}
err = rdmsrl_safe_on_cpu(cpu, MSR_TURBO_ACTIVATION_RATIO, &tar);
if (!err) {
int tar_levels;
/* Do some sanity checking for safety */
tar_levels = tar & 0xff;
if (tdp_ratio - 1 == tar_levels) {
max_pstate = tar_levels;
pr_debug("max_pstate=TAC %x\n", max_pstate);
}
}
return max_pstate;
}
static int core_get_turbo_pstate(int cpu)
{
u64 value;
int nont, ret;
rdmsrl_on_cpu(cpu, MSR_TURBO_RATIO_LIMIT, &value);
nont = core_get_max_pstate(cpu);
ret = (value) & 255;
if (ret <= nont)
ret = nont;
return ret;
}
static inline int core_get_scaling(void)
{
return 100000;
}
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
static u64 core_get_val(struct cpudata *cpudata, int pstate)
{
u64 val;
val = (u64)pstate << 8;
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
if (global.no_turbo && !global.turbo_disabled)
val |= (u64)1 << 32;
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
return val;
}
static int knl_get_aperf_mperf_shift(void)
{
return 10;
}
static int knl_get_turbo_pstate(int cpu)
{
u64 value;
int nont, ret;
rdmsrl_on_cpu(cpu, MSR_TURBO_RATIO_LIMIT, &value);
nont = core_get_max_pstate(cpu);
ret = (((value) >> 8) & 0xFF);
if (ret <= nont)
ret = nont;
return ret;
}
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
static void hybrid_get_type(void *data)
{
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
u8 *cpu_type = data;
*cpu_type = get_this_hybrid_cpu_type();
}
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
static int hybrid_get_cpu_scaling(int cpu)
{
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
u8 cpu_type = 0;
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
smp_call_function_single(cpu, hybrid_get_type, &cpu_type, 1);
/* P-cores have a smaller perf level-to-freqency scaling factor. */
if (cpu_type == 0x40)
return 78741;
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
return core_get_scaling();
}
static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
{
trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
cpu->pstate.current_pstate = pstate;
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
/*
* Generally, there is no guarantee that this code will always run on
* the CPU being updated, so force the register update to run on the
* right CPU.
*/
wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
pstate_funcs.get_val(cpu, pstate));
}
static void intel_pstate_set_min_pstate(struct cpudata *cpu)
{
intel_pstate_set_pstate(cpu, cpu->pstate.min_pstate);
}
static void intel_pstate_max_within_limits(struct cpudata *cpu)
{
int pstate = max(cpu->pstate.min_pstate, cpu->max_perf_ratio);
update_turbo_state();
intel_pstate_set_pstate(cpu, pstate);
}
static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
{
int perf_ctl_max_phys = pstate_funcs.get_max_physical(cpu->cpu);
int perf_ctl_scaling = pstate_funcs.get_scaling();
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
cpu->pstate.min_pstate = pstate_funcs.get_min(cpu->cpu);
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
cpu->pstate.max_pstate_physical = perf_ctl_max_phys;
cpu->pstate.perf_ctl_scaling = perf_ctl_scaling;
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
if (hwp_active && !hwp_mode_bdw) {
__intel_pstate_get_hwp_cap(cpu);
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
if (pstate_funcs.get_cpu_scaling) {
cpu->pstate.scaling = pstate_funcs.get_cpu_scaling(cpu->cpu);
if (cpu->pstate.scaling != perf_ctl_scaling)
intel_pstate_hybrid_hwp_adjust(cpu);
} else {
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
cpu->pstate.scaling = perf_ctl_scaling;
}
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
} else {
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
cpu->pstate.scaling = perf_ctl_scaling;
cpu->pstate.max_pstate = pstate_funcs.get_max(cpu->cpu);
cpu->pstate.turbo_pstate = pstate_funcs.get_turbo(cpu->cpu);
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
}
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
if (cpu->pstate.scaling == perf_ctl_scaling) {
cpu->pstate.min_freq = cpu->pstate.min_pstate * perf_ctl_scaling;
cpu->pstate.max_freq = cpu->pstate.max_pstate * perf_ctl_scaling;
cpu->pstate.turbo_freq = cpu->pstate.turbo_pstate * perf_ctl_scaling;
}
if (pstate_funcs.get_aperf_mperf_shift)
cpu->aperf_mperf_shift = pstate_funcs.get_aperf_mperf_shift();
if (pstate_funcs.get_vid)
pstate_funcs.get_vid(cpu);
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
intel_pstate_set_min_pstate(cpu);
}
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
/*
* Long hold time will keep high perf limits for long time,
* which negatively impacts perf/watt for some workloads,
* like specpower. 3ms is based on experiements on some
* workoads.
*/
static int hwp_boost_hold_time_ns = 3 * NSEC_PER_MSEC;
static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
{
u64 hwp_req = READ_ONCE(cpu->hwp_req_cached);
u64 hwp_cap = READ_ONCE(cpu->hwp_cap_cached);
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
u32 max_limit = (hwp_req & 0xff00) >> 8;
u32 min_limit = (hwp_req & 0xff);
u32 boost_level1;
/*
* Cases to consider (User changes via sysfs or boot time):
* If, P0 (Turbo max) = P1 (Guaranteed max) = min:
* No boost, return.
* If, P0 (Turbo max) > P1 (Guaranteed max) = min:
* Should result in one level boost only for P0.
* If, P0 (Turbo max) = P1 (Guaranteed max) > min:
* Should result in two level boost:
* (min + p1)/2 and P1.
* If, P0 (Turbo max) > P1 (Guaranteed max) > min:
* Should result in three level boost:
* (min + p1)/2, P1 and P0.
*/
/* If max and min are equal or already at max, nothing to boost */
if (max_limit == min_limit || cpu->hwp_boost_min >= max_limit)
return;
if (!cpu->hwp_boost_min)
cpu->hwp_boost_min = min_limit;
/* level at half way mark between min and guranteed */
boost_level1 = (HWP_GUARANTEED_PERF(hwp_cap) + min_limit) >> 1;
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
if (cpu->hwp_boost_min < boost_level1)
cpu->hwp_boost_min = boost_level1;
else if (cpu->hwp_boost_min < HWP_GUARANTEED_PERF(hwp_cap))
cpu->hwp_boost_min = HWP_GUARANTEED_PERF(hwp_cap);
else if (cpu->hwp_boost_min == HWP_GUARANTEED_PERF(hwp_cap) &&
max_limit != HWP_GUARANTEED_PERF(hwp_cap))
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
cpu->hwp_boost_min = max_limit;
else
return;
hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | cpu->hwp_boost_min;
wrmsrl(MSR_HWP_REQUEST, hwp_req);
cpu->last_update = cpu->sample.time;
}
static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
{
if (cpu->hwp_boost_min) {
bool expired;
/* Check if we are idle for hold time to boost down */
expired = time_after64(cpu->sample.time, cpu->last_update +
hwp_boost_hold_time_ns);
if (expired) {
wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
cpu->hwp_boost_min = 0;
}
}
cpu->last_update = cpu->sample.time;
}
static inline void intel_pstate_update_util_hwp_local(struct cpudata *cpu,
u64 time)
{
cpu->sample.time = time;
if (cpu->sched_flags & SCHED_CPUFREQ_IOWAIT) {
bool do_io = false;
cpu->sched_flags = 0;
/*
* Set iowait_boost flag and update time. Since IO WAIT flag
* is set all the time, we can't just conclude that there is
* some IO bound activity is scheduled on this CPU with just
* one occurrence. If we receive at least two in two
* consecutive ticks, then we treat as boost candidate.
*/
if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC))
do_io = true;
cpu->last_io_update = time;
if (do_io)
intel_pstate_hwp_boost_up(cpu);
} else {
intel_pstate_hwp_boost_down(cpu);
}
}
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
u64 time, unsigned int flags)
{
struct cpudata *cpu = container_of(data, struct cpudata, update_util);
cpu->sched_flags |= flags;
if (smp_processor_id() == cpu->cpu)
intel_pstate_update_util_hwp_local(cpu, time);
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
}
static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
{
struct sample *sample = &cpu->sample;
sample->core_avg_perf = div_ext_fp(sample->aperf, sample->mperf);
}
static inline bool intel_pstate_sample(struct cpudata *cpu, u64 time)
{
u64 aperf, mperf;
unsigned long flags;
u64 tsc;
local_irq_save(flags);
rdmsrl(MSR_IA32_APERF, aperf);
rdmsrl(MSR_IA32_MPERF, mperf);
tsc = rdtsc();
if (cpu->prev_mperf == mperf || cpu->prev_tsc == tsc) {
local_irq_restore(flags);
return false;
}
local_irq_restore(flags);
cpu->last_sample_time = cpu->sample.time;
cpu->sample.time = time;
cpu->sample.aperf = aperf;
cpu->sample.mperf = mperf;
cpu->sample.tsc = tsc;
cpu->sample.aperf -= cpu->prev_aperf;
cpu->sample.mperf -= cpu->prev_mperf;
cpu->sample.tsc -= cpu->prev_tsc;
cpu->prev_aperf = aperf;
cpu->prev_mperf = mperf;
cpu->prev_tsc = tsc;
intel_pstate: Avoid extra invocation of intel_pstate_sample() The initialization of intel_pstate for a given CPU involves populating the fields of its struct cpudata that represent the previous sample, but currently that is done in a problematic way. Namely, intel_pstate_init_cpu() makes an extra call to intel_pstate_sample() so it reads the current register values that will be used to populate the "previous sample" record during the next invocation of intel_pstate_sample(). However, after commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) that doesn't work for last_sample_time, because the time value is passed to intel_pstate_sample() as an argument now. Passing 0 to it from intel_pstate_init_cpu() is problematic, because that causes cpu->last_sample_time == 0 to be visible in get_target_pstate_use_performance() (and hence the extra cpu->last_sample_time > 0 check in there) and effectively allows the first invocation of intel_pstate_sample() from intel_pstate_update_util() to happen immediately after the initialization which may lead to a significant "turn on" effect in the governor algorithm. To mitigate that issue, rework the initialization to avoid the extra intel_pstate_sample() call from intel_pstate_init_cpu(). Instead, make intel_pstate_sample() return false if it has been called with cpu->sample.time equal to zero, which will make intel_pstate_update_util() skip the sample in that case, and reset cpu->sample.time from intel_pstate_set_update_util_hook() to make the algorithm start properly every time the hook is set. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-02 07:06:21 +08:00
/*
* First time this function is invoked in a given cycle, all of the
* previous sample data fields are equal to zero or stale and they must
* be populated with meaningful numbers for things to work, so assume
* that sample.time will always be reset before setting the utilization
* update hook and make the caller skip the sample then.
*/
if (cpu->last_sample_time) {
intel_pstate_calc_avg_perf(cpu);
return true;
}
return false;
}
static inline int32_t get_avg_frequency(struct cpudata *cpu)
{
return mul_ext_fp(cpu->sample.core_avg_perf, cpu_khz);
}
cpufreq: intel_pstate: Use average P-State instead of current P-State The result returned by pid_calc() is subtracted from current_pstate (which is the P-State requested during the last period) in order to obtain the target P-State for the current iteration. However, current_pstate may not reflect the real current P-State of the CPU. In particular, that P-State may be higher because of the frequency sharing per module. The theory is: - The load is the percentage of time spent in C0 and is related to the average P-State during the same period. - The last requested P-State can be completely different than the average P-State (because of frequency sharing or throttling). - The P-State shift computed by the pid_calc is based on the load computed at average P-State, so the shift must be relative to this average P-State. Using the average P-State instead of current P-State improves power without significant performance penalty in cases when a task migrates from one core to other core sharing frequency and voltage. Performance and power comparison with this patch on Cherry Trail platform using Android: Benchmark ?Perf ?Power FishTank 10.45% 3.1% SmartBench-Gaming -0.1% -10.4% SmartBench-Productivity -0.8% -10.4% CandyCrush n/a -17.4% AngryBirds n/a -5.9% videoPlayback n/a -13.9% audioPlayback n/a -4.9% IcyRocks-20-50 0.0% -38.4% iozone RR -0.16% -1.3% iozone RW 0.74% -1.3% Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-23 02:46:09 +08:00
static inline int32_t get_avg_pstate(struct cpudata *cpu)
{
return mul_ext_fp(cpu->pstate.max_pstate_physical,
cpu->sample.core_avg_perf);
cpufreq: intel_pstate: Use average P-State instead of current P-State The result returned by pid_calc() is subtracted from current_pstate (which is the P-State requested during the last period) in order to obtain the target P-State for the current iteration. However, current_pstate may not reflect the real current P-State of the CPU. In particular, that P-State may be higher because of the frequency sharing per module. The theory is: - The load is the percentage of time spent in C0 and is related to the average P-State during the same period. - The last requested P-State can be completely different than the average P-State (because of frequency sharing or throttling). - The P-State shift computed by the pid_calc is based on the load computed at average P-State, so the shift must be relative to this average P-State. Using the average P-State instead of current P-State improves power without significant performance penalty in cases when a task migrates from one core to other core sharing frequency and voltage. Performance and power comparison with this patch on Cherry Trail platform using Android: Benchmark ?Perf ?Power FishTank 10.45% 3.1% SmartBench-Gaming -0.1% -10.4% SmartBench-Productivity -0.8% -10.4% CandyCrush n/a -17.4% AngryBirds n/a -5.9% videoPlayback n/a -13.9% audioPlayback n/a -4.9% IcyRocks-20-50 0.0% -38.4% iozone RR -0.16% -1.3% iozone RW 0.74% -1.3% Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-23 02:46:09 +08:00
}
static inline int32_t get_target_pstate(struct cpudata *cpu)
{
struct sample *sample = &cpu->sample;
int32_t busy_frac;
int target, avg_pstate;
busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift,
sample->tsc);
if (busy_frac < cpu->iowait_boost)
busy_frac = cpu->iowait_boost;
sample->busy_scaled = busy_frac * 100;
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
target = global.no_turbo || global.turbo_disabled ?
cpu->pstate.max_pstate : cpu->pstate.turbo_pstate;
target += target >> 2;
target = mul_fp(target, busy_frac);
if (target < cpu->pstate.min_pstate)
target = cpu->pstate.min_pstate;
/*
* If the average P-state during the previous cycle was higher than the
* current target, add 50% of the difference to the target to reduce
* possible performance oscillations and offset possible performance
* loss related to moving the workload from one CPU to another within
* a package/module.
*/
avg_pstate = get_avg_pstate(cpu);
if (avg_pstate > target)
target += (avg_pstate - target) >> 1;
return target;
}
static int intel_pstate_prepare_request(struct cpudata *cpu, int pstate)
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
{
int min_pstate = max(cpu->pstate.min_pstate, cpu->min_perf_ratio);
int max_pstate = max(min_pstate, cpu->max_perf_ratio);
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
return clamp_t(int, pstate, min_pstate, max_pstate);
}
static void intel_pstate_update_pstate(struct cpudata *cpu, int pstate)
{
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
if (pstate == cpu->pstate.current_pstate)
return;
cpu->pstate.current_pstate = pstate;
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
}
static void intel_pstate_adjust_pstate(struct cpudata *cpu)
{
int from = cpu->pstate.current_pstate;
struct sample *sample;
int target_pstate;
update_turbo_state();
target_pstate = get_target_pstate(cpu);
target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
intel_pstate_update_pstate(cpu, target_pstate);
sample = &cpu->sample;
trace_pstate_sample(mul_ext_fp(100, sample->core_avg_perf),
fp_toint(sample->busy_scaled),
from,
cpu->pstate.current_pstate,
sample->mperf,
sample->aperf,
sample->tsc,
get_avg_frequency(cpu),
fp_toint(cpu->iowait_boost * 100));
}
static void intel_pstate_update_util(struct update_util_data *data, u64 time,
unsigned int flags)
{
struct cpudata *cpu = container_of(data, struct cpudata, update_util);
u64 delta_ns;
sched: cpufreq: Allow remote cpufreq callbacks With Android UI and benchmarks the latency of cpufreq response to certain scheduling events can become very critical. Currently, callbacks into cpufreq governors are only made from the scheduler if the target CPU of the event is the same as the current CPU. This means there are certain situations where a target CPU may not run the cpufreq governor for some time. One testcase to show this behavior is where a task starts running on CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the system is configured such that the new tasks should receive maximum demand initially, this should result in CPU0 increasing frequency immediately. But because of the above mentioned limitation though, this does not occur. This patch updates the scheduler core to call the cpufreq callbacks for remote CPUs as well. The schedutil, ondemand and conservative governors are updated to process cpufreq utilization update hooks called for remote CPUs where the remote CPU is managed by the cpufreq policy of the local CPU. The intel_pstate driver is updated to always reject remote callbacks. This is tested with couple of usecases (Android: hackbench, recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit octa-core, single policy). Only galleryfling showed minor improvements, while others didn't had much deviation. The reason being that this patch only targets a corner case, where following are required to be true to improve performance and that doesn't happen too often with these tests: - Task is migrated to another CPU. - The task has high demand, and should take the target CPU to higher OPPs. - And the target CPU doesn't call into the cpufreq governor until the next tick. Based on initial work from Steve Muckle. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Saravana Kannan <skannan@codeaurora.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-28 14:46:38 +08:00
/* Don't allow remote callbacks */
if (smp_processor_id() != cpu->cpu)
return;
delta_ns = time - cpu->last_update;
if (flags & SCHED_CPUFREQ_IOWAIT) {
/* Start over if the CPU may have been idle. */
if (delta_ns > TICK_NSEC) {
cpu->iowait_boost = ONE_EIGHTH_FP;
} else if (cpu->iowait_boost >= ONE_EIGHTH_FP) {
cpu->iowait_boost <<= 1;
if (cpu->iowait_boost > int_tofp(1))
cpu->iowait_boost = int_tofp(1);
} else {
cpu->iowait_boost = ONE_EIGHTH_FP;
}
} else if (cpu->iowait_boost) {
/* Clear iowait_boost if the CPU may have been idle. */
if (delta_ns > TICK_NSEC)
cpu->iowait_boost = 0;
else
cpu->iowait_boost >>= 1;
}
cpu->last_update = time;
delta_ns = time - cpu->sample.time;
if ((s64)delta_ns < INTEL_PSTATE_SAMPLING_INTERVAL)
return;
if (intel_pstate_sample(cpu, time))
intel_pstate_adjust_pstate(cpu);
}
static struct pstate_funcs core_funcs = {
.get_max = core_get_max_pstate,
.get_max_physical = core_get_max_pstate_physical,
.get_min = core_get_min_pstate,
.get_turbo = core_get_turbo_pstate,
.get_scaling = core_get_scaling,
.get_val = core_get_val,
};
static const struct pstate_funcs silvermont_funcs = {
.get_max = atom_get_max_pstate,
.get_max_physical = atom_get_max_pstate,
.get_min = atom_get_min_pstate,
.get_turbo = atom_get_turbo_pstate,
.get_val = atom_get_val,
.get_scaling = silvermont_get_scaling,
.get_vid = atom_get_vid,
};
static const struct pstate_funcs airmont_funcs = {
.get_max = atom_get_max_pstate,
.get_max_physical = atom_get_max_pstate,
.get_min = atom_get_min_pstate,
.get_turbo = atom_get_turbo_pstate,
.get_val = atom_get_val,
.get_scaling = airmont_get_scaling,
.get_vid = atom_get_vid,
};
static const struct pstate_funcs knl_funcs = {
.get_max = core_get_max_pstate,
.get_max_physical = core_get_max_pstate_physical,
.get_min = core_get_min_pstate,
.get_turbo = knl_get_turbo_pstate,
.get_aperf_mperf_shift = knl_get_aperf_mperf_shift,
.get_scaling = core_get_scaling,
.get_val = core_get_val,
};
#define X86_MATCH(model, policy) \
X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, INTEL_FAM6_##model, \
X86_FEATURE_APERFMPERF, &policy)
static const struct x86_cpu_id intel_pstate_cpu_ids[] = {
X86_MATCH(SANDYBRIDGE, core_funcs),
X86_MATCH(SANDYBRIDGE_X, core_funcs),
X86_MATCH(ATOM_SILVERMONT, silvermont_funcs),
X86_MATCH(IVYBRIDGE, core_funcs),
X86_MATCH(HASWELL, core_funcs),
X86_MATCH(BROADWELL, core_funcs),
X86_MATCH(IVYBRIDGE_X, core_funcs),
X86_MATCH(HASWELL_X, core_funcs),
X86_MATCH(HASWELL_L, core_funcs),
X86_MATCH(HASWELL_G, core_funcs),
X86_MATCH(BROADWELL_G, core_funcs),
X86_MATCH(ATOM_AIRMONT, airmont_funcs),
X86_MATCH(SKYLAKE_L, core_funcs),
X86_MATCH(BROADWELL_X, core_funcs),
X86_MATCH(SKYLAKE, core_funcs),
X86_MATCH(BROADWELL_D, core_funcs),
X86_MATCH(XEON_PHI_KNL, knl_funcs),
X86_MATCH(XEON_PHI_KNM, knl_funcs),
X86_MATCH(ATOM_GOLDMONT, core_funcs),
X86_MATCH(ATOM_GOLDMONT_PLUS, core_funcs),
X86_MATCH(SKYLAKE_X, core_funcs),
X86_MATCH(COMETLAKE, core_funcs),
X86_MATCH(ICELAKE_X, core_funcs),
X86_MATCH(TIGERLAKE, core_funcs),
{}
};
MODULE_DEVICE_TABLE(x86cpu, intel_pstate_cpu_ids);
static const struct x86_cpu_id intel_pstate_cpu_oob_ids[] __initconst = {
X86_MATCH(BROADWELL_D, core_funcs),
X86_MATCH(BROADWELL_X, core_funcs),
X86_MATCH(SKYLAKE_X, core_funcs),
X86_MATCH(ICELAKE_X, core_funcs),
{}
};
cpufreq: intel_pstate: Disable energy efficiency optimization Some Kabylake desktop processors may not reach max turbo when running in HWP mode, even if running under sustained 100% utilization. This occurs when the HWP.EPP (Energy Performance Preference) is set to "balance_power" (0x80) -- the default on most systems. It occurs because the platform BIOS may erroneously enable an energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not recommended to be enabled on this SKU. On the failing systems, this BIOS issue was not discovered when the desktop motherboard was tested with Windows, because the BIOS also neglects to provide the ACPI/CPPC table, that Windows requires to enable HWP, and so Windows runs in legacy P-state mode, where this setting has no effect. Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and so it runs in HWP mode, exposing this incorrect BIOS configuration. There are several ways to address this problem. First, Linux can also run in legacy P-state mode on this system. As intel_pstate is how Linux enables HWP, booting with "intel_pstate=disable" will run in acpi-cpufreq/ondemand legacy p-state mode. Or second, the "performance" governor can be used with intel_pstate, which will modify HWP.EPP to 0. Or third, starting in 4.10, the /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference attribute in can be updated from "balance_power" to "performance". Or fourth, apply this patch, which fixes the erroneous setting of MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default configuration to function as designed. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-04 06:18:39 +08:00
static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[] = {
X86_MATCH(KABYLAKE, core_funcs),
cpufreq: intel_pstate: Disable energy efficiency optimization Some Kabylake desktop processors may not reach max turbo when running in HWP mode, even if running under sustained 100% utilization. This occurs when the HWP.EPP (Energy Performance Preference) is set to "balance_power" (0x80) -- the default on most systems. It occurs because the platform BIOS may erroneously enable an energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not recommended to be enabled on this SKU. On the failing systems, this BIOS issue was not discovered when the desktop motherboard was tested with Windows, because the BIOS also neglects to provide the ACPI/CPPC table, that Windows requires to enable HWP, and so Windows runs in legacy P-state mode, where this setting has no effect. Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and so it runs in HWP mode, exposing this incorrect BIOS configuration. There are several ways to address this problem. First, Linux can also run in legacy P-state mode on this system. As intel_pstate is how Linux enables HWP, booting with "intel_pstate=disable" will run in acpi-cpufreq/ondemand legacy p-state mode. Or second, the "performance" governor can be used with intel_pstate, which will modify HWP.EPP to 0. Or third, starting in 4.10, the /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference attribute in can be updated from "balance_power" to "performance". Or fourth, apply this patch, which fixes the erroneous setting of MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default configuration to function as designed. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-04 06:18:39 +08:00
{}
};
static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] = {
X86_MATCH(SKYLAKE_X, core_funcs),
X86_MATCH(SKYLAKE, core_funcs),
{}
};
static int intel_pstate_init_cpu(unsigned int cpunum)
{
struct cpudata *cpu;
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
cpu = all_cpu_data[cpunum];
if (!cpu) {
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
cpu = kzalloc(sizeof(*cpu), GFP_KERNEL);
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
if (!cpu)
return -ENOMEM;
all_cpu_data[cpunum] = cpu;
cpu->cpu = cpunum;
cpu->epp_default = -EINVAL;
if (hwp_active) {
const struct x86_cpu_id *id;
cpufreq: intel_pstate: Disable energy efficiency optimization Some Kabylake desktop processors may not reach max turbo when running in HWP mode, even if running under sustained 100% utilization. This occurs when the HWP.EPP (Energy Performance Preference) is set to "balance_power" (0x80) -- the default on most systems. It occurs because the platform BIOS may erroneously enable an energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not recommended to be enabled on this SKU. On the failing systems, this BIOS issue was not discovered when the desktop motherboard was tested with Windows, because the BIOS also neglects to provide the ACPI/CPPC table, that Windows requires to enable HWP, and so Windows runs in legacy P-state mode, where this setting has no effect. Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and so it runs in HWP mode, exposing this incorrect BIOS configuration. There are several ways to address this problem. First, Linux can also run in legacy P-state mode on this system. As intel_pstate is how Linux enables HWP, booting with "intel_pstate=disable" will run in acpi-cpufreq/ondemand legacy p-state mode. Or second, the "performance" governor can be used with intel_pstate, which will modify HWP.EPP to 0. Or third, starting in 4.10, the /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference attribute in can be updated from "balance_power" to "performance". Or fourth, apply this patch, which fixes the erroneous setting of MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default configuration to function as designed. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-04 06:18:39 +08:00
intel_pstate_hwp_enable(cpu);
id = x86_match_cpu(intel_pstate_hwp_boost_ids);
if (id && intel_pstate_acpi_pm_profile_server())
hwp_boost = true;
}
} else if (hwp_active) {
/*
* Re-enable HWP in case this happens after a resume from ACPI
* S3 if the CPU was offline during the whole system/resume
* cycle.
*/
intel_pstate_hwp_reenable(cpu);
}
cpu->epp_powersave = -EINVAL;
cpu->epp_policy = 0;
intel_pstate_get_cpu_pstates(cpu);
pr_debug("controlling: cpu %d\n", cpunum);
return 0;
}
intel_pstate: Avoid extra invocation of intel_pstate_sample() The initialization of intel_pstate for a given CPU involves populating the fields of its struct cpudata that represent the previous sample, but currently that is done in a problematic way. Namely, intel_pstate_init_cpu() makes an extra call to intel_pstate_sample() so it reads the current register values that will be used to populate the "previous sample" record during the next invocation of intel_pstate_sample(). However, after commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) that doesn't work for last_sample_time, because the time value is passed to intel_pstate_sample() as an argument now. Passing 0 to it from intel_pstate_init_cpu() is problematic, because that causes cpu->last_sample_time == 0 to be visible in get_target_pstate_use_performance() (and hence the extra cpu->last_sample_time > 0 check in there) and effectively allows the first invocation of intel_pstate_sample() from intel_pstate_update_util() to happen immediately after the initialization which may lead to a significant "turn on" effect in the governor algorithm. To mitigate that issue, rework the initialization to avoid the extra intel_pstate_sample() call from intel_pstate_init_cpu(). Instead, make intel_pstate_sample() return false if it has been called with cpu->sample.time equal to zero, which will make intel_pstate_update_util() skip the sample in that case, and reset cpu->sample.time from intel_pstate_set_update_util_hook() to make the algorithm start properly every time the hook is set. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-02 07:06:21 +08:00
static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
{
intel_pstate: Avoid extra invocation of intel_pstate_sample() The initialization of intel_pstate for a given CPU involves populating the fields of its struct cpudata that represent the previous sample, but currently that is done in a problematic way. Namely, intel_pstate_init_cpu() makes an extra call to intel_pstate_sample() so it reads the current register values that will be used to populate the "previous sample" record during the next invocation of intel_pstate_sample(). However, after commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) that doesn't work for last_sample_time, because the time value is passed to intel_pstate_sample() as an argument now. Passing 0 to it from intel_pstate_init_cpu() is problematic, because that causes cpu->last_sample_time == 0 to be visible in get_target_pstate_use_performance() (and hence the extra cpu->last_sample_time > 0 check in there) and effectively allows the first invocation of intel_pstate_sample() from intel_pstate_update_util() to happen immediately after the initialization which may lead to a significant "turn on" effect in the governor algorithm. To mitigate that issue, rework the initialization to avoid the extra intel_pstate_sample() call from intel_pstate_init_cpu(). Instead, make intel_pstate_sample() return false if it has been called with cpu->sample.time equal to zero, which will make intel_pstate_update_util() skip the sample in that case, and reset cpu->sample.time from intel_pstate_set_update_util_hook() to make the algorithm start properly every time the hook is set. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-02 07:06:21 +08:00
struct cpudata *cpu = all_cpu_data[cpu_num];
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
if (hwp_active && !hwp_boost)
return;
if (cpu->update_util_set)
return;
intel_pstate: Avoid extra invocation of intel_pstate_sample() The initialization of intel_pstate for a given CPU involves populating the fields of its struct cpudata that represent the previous sample, but currently that is done in a problematic way. Namely, intel_pstate_init_cpu() makes an extra call to intel_pstate_sample() so it reads the current register values that will be used to populate the "previous sample" record during the next invocation of intel_pstate_sample(). However, after commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) that doesn't work for last_sample_time, because the time value is passed to intel_pstate_sample() as an argument now. Passing 0 to it from intel_pstate_init_cpu() is problematic, because that causes cpu->last_sample_time == 0 to be visible in get_target_pstate_use_performance() (and hence the extra cpu->last_sample_time > 0 check in there) and effectively allows the first invocation of intel_pstate_sample() from intel_pstate_update_util() to happen immediately after the initialization which may lead to a significant "turn on" effect in the governor algorithm. To mitigate that issue, rework the initialization to avoid the extra intel_pstate_sample() call from intel_pstate_init_cpu(). Instead, make intel_pstate_sample() return false if it has been called with cpu->sample.time equal to zero, which will make intel_pstate_update_util() skip the sample in that case, and reset cpu->sample.time from intel_pstate_set_update_util_hook() to make the algorithm start properly every time the hook is set. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-02 07:06:21 +08:00
/* Prevent intel_pstate_update_util() from using stale data. */
cpu->sample.time = 0;
cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
(hwp_active ?
intel_pstate_update_util_hwp :
intel_pstate_update_util));
cpu->update_util_set = true;
}
static void intel_pstate_clear_update_util_hook(unsigned int cpu)
{
struct cpudata *cpu_data = all_cpu_data[cpu];
if (!cpu_data->update_util_set)
return;
cpufreq_remove_update_util_hook(cpu);
cpu_data->update_util_set = false;
synchronize_rcu();
}
static int intel_pstate_get_max_freq(struct cpudata *cpu)
{
return global.turbo_disabled || global.no_turbo ?
cpu->pstate.max_freq : cpu->pstate.turbo_freq;
}
cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-01-27 06:40:11 +08:00
static void intel_pstate_update_perf_limits(struct cpudata *cpu,
unsigned int policy_min,
unsigned int policy_max)
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
{
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
int perf_ctl_scaling = cpu->pstate.perf_ctl_scaling;
int32_t max_policy_perf, min_policy_perf;
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
max_policy_perf = policy_max / perf_ctl_scaling;
if (policy_max == policy_min) {
min_policy_perf = max_policy_perf;
} else {
min_policy_perf = policy_min / perf_ctl_scaling;
min_policy_perf = clamp_t(int32_t, min_policy_perf,
0, max_policy_perf);
}
/*
* HWP needs some special consideration, because HWP_REQUEST uses
* abstract values to represent performance rather than pure ratios.
*/
if (hwp_active && cpu->pstate.scaling != perf_ctl_scaling) {
int freq;
freq = max_policy_perf * perf_ctl_scaling;
max_policy_perf = intel_pstate_freq_to_hwp(cpu, freq);
freq = min_policy_perf * perf_ctl_scaling;
min_policy_perf = intel_pstate_freq_to_hwp(cpu, freq);
}
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
pr_debug("cpu:%d min_policy_perf:%d max_policy_perf:%d\n",
cpu->cpu, min_policy_perf, max_policy_perf);
/* Normalize user input to [min_perf, max_perf] */
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
if (per_cpu_limits) {
cpu->min_perf_ratio = min_policy_perf;
cpu->max_perf_ratio = max_policy_perf;
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
} else {
int turbo_max = cpu->pstate.turbo_pstate;
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
int32_t global_min, global_max;
/* Global limits are in percent of the maximum turbo P-state. */
global_max = DIV_ROUND_UP(turbo_max * global.max_perf_pct, 100);
global_min = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
global_min = clamp_t(int32_t, global_min, 0, global_max);
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-01-27 06:40:11 +08:00
pr_debug("cpu:%d global_min:%d global_max:%d\n", cpu->cpu,
global_min, global_max);
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
cpu->min_perf_ratio = max(min_policy_perf, global_min);
cpu->min_perf_ratio = min(cpu->min_perf_ratio, max_policy_perf);
cpu->max_perf_ratio = min(max_policy_perf, global_max);
cpu->max_perf_ratio = max(min_policy_perf, cpu->max_perf_ratio);
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
/* Make sure min_perf <= max_perf */
cpu->min_perf_ratio = min(cpu->min_perf_ratio,
cpu->max_perf_ratio);
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
}
cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-01-27 06:40:11 +08:00
pr_debug("cpu:%d max_perf_ratio:%d min_perf_ratio:%d\n", cpu->cpu,
cpu->max_perf_ratio,
cpu->min_perf_ratio);
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
}
static int intel_pstate_set_policy(struct cpufreq_policy *policy)
{
cpufreq: intel_pstate: Adjust policy->max When policy->max is changed via _PPC or sysfs and is more than the max non turbo frequency, it does not really change resulting performance in some processors. When policy->max results in a P-State ratio more than the turbo activation ratio, then processor can choose any P-State up to max turbo. So the user or _PPC setting has no value, but this can cause undesirable side effects like: - Showing reduced max percentage in Intel P-State sysfs - It can cause reduced max performance under certain boundary conditions: The requested max scaling frequency either via _PPC or via cpufreq-sysfs, will be converted into a fixed floating point max percent scale. In majority of the cases this will result in correct max. But not 100% of the time. If the _PPC is requested at a point where the calculation lead to a lower max, this can result in a lower P-State then expected and it will impact performance. Example of this condition using a Broadwell laptop with config TDP. ACPI _PSS table from a Broadwell laptop 2301000 2300000 2200000 2000000 1900000 1800000 1700000 1500000 1400000 1300000 1100000 1000000 900000 800000 600000 500000 The actual results by disabling config TDP so that we can get what is requested on or below 2300000Khz. scaling_max_freq Max Requested P-State Resultant scaling max ---------------------------------------- ---------------------- 2400000 18 2900000 (max turbo) 2300000 17 2300000 (max physical non turbo) 2200000 15 2100000 2100000 15 2100000 2000000 13 1900000 1900000 13 1900000 1800000 12 1800000 1700000 11 1700000 1600000 10 1600000 1500000 f 1500000 1400000 e 1400000 1300000 d 1300000 1200000 c 1200000 1100000 a 1000000 1000000 a 1000000 900000 9 900000 800000 8 800000 700000 7 700000 600000 6 600000 500000 5 500000 ------------------------------------------------------------------ Now set the config TDP level 1 ratio as 0x0b (equivalent to 1100000KHz) in BIOS (not every system will let you adjust this). The turbo activation ratio will be set to one less than that, which will be 0x0a (So any request above 1000000KHz should result in turbo region assuming no thermal limits). Here _PPC will request max to 1100000KHz (which basically should still result in turbo as this is more than the turbo activation ratio up to max allowable turbo frequency), but actual calculation resulted in a max ceiling P-State which is 0x0a. So under any load condition, this driver will not request turbo P-States. This will be a huge performance hit. When config TDP feature is ON, if the _PPC points to a frequency above turbo activation ratio, the performance can still reach max turbo. In this case we don't need to treat this as the reduced frequency in set_policy callback. In this change when config TDP is active (by checking if the physical max non turbo ratio is more than the current max non turbo ratio), any request above current max non turbo is treated as full performance. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Minor cleanups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-28 06:48:07 +08:00
struct cpudata *cpu;
if (!policy->cpuinfo.max_freq)
return -ENODEV;
pr_debug("set_policy cpuinfo.max %u policy->max %u\n",
policy->cpuinfo.max_freq, policy->max);
cpu = all_cpu_data[policy->cpu];
cpu->policy = policy->policy;
mutex_lock(&intel_pstate_limits_lock);
cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-01-27 06:40:11 +08:00
intel_pstate_update_perf_limits(cpu, policy->min, policy->max);
cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy If the current P-state selection algorithm is set to "performance" in intel_pstate_set_policy(), the limits may be initialized from scratch, but only if no_turbo is not set and the maximum frequency allowed for the given CPU (i.e. the policy object representing it) is at least equal to the max frequency supported by the CPU. In all of the other cases, the limits will not be updated. For example, the following can happen: # cat intel_pstate/status active # echo performance > cpufreq/policy0/scaling_governor # cat intel_pstate/min_perf_pct 100 # echo 94 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 100 # cat cpufreq/policy0/scaling_max_freq 3100000 echo 3000000 > cpufreq/policy0/scaling_max_freq # cat intel_pstate/min_perf_pct 94 # echo 95 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 95 That is confusing for two reasons. First, the initial attempt to change min_perf_pct to 94 seems to have no effect, even though setting the global limits should always work. Second, after changing scaling_max_freq for policy0 the global min_perf_pct attribute shows 94, even though it should have not been affected by that operation in principle. Moreover, the final attempt to change min_perf_pct to 95 worked as expected, because scaling_max_freq for the only policy with scaling_governor equal to "performance" was different from the maximum at that time. To make all that confusion go away, modify intel_pstate_set_policy() so that it doesn't reinitialize the limits at all. At the same time, change intel_pstate_set_performance_limits() to set min_sysfs_pct to 100 in the "performance" limits set so that switching the P-state selection algorithm to "performance" causes intel_pstate/min_perf_pct in sysfs to go to 100 (or whatever value min_sysfs_pct in the "performance" limits is set to later). That requires per-CPU limits to be initialized explicitly rather than by copying the global limits to avoid setting min_sysfs_pct in the per-CPU limits to 100. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-03 06:29:12 +08:00
if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE) {
/*
* NOHZ_FULL CPUs need this as the governor callback may not
* be invoked on them.
*/
intel_pstate_clear_update_util_hook(policy->cpu);
intel_pstate_max_within_limits(cpu);
} else {
intel_pstate_set_update_util_hook(policy->cpu);
}
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
if (hwp_active) {
/*
* When hwp_boost was active before and dynamically it
* was turned off, in that case we need to clear the
* update util hook.
*/
if (!hwp_boost)
intel_pstate_clear_update_util_hook(policy->cpu);
intel_pstate_hwp_set(policy->cpu);
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks Added two utility functions to HWP boost up gradually and boost down to the default cached HWP request values. Boost up: Boost up updates HWP request minimum value in steps. This minimum value can reach upto at HWP request maximum values depends on how frequently, this boost up function is called. At max, boost up will take three steps to reach the maximum, depending on the current HWP request levels and HWP capabilities. For example, if the current settings are: If P0 (Turbo max) = P1 (Guaranteed max) = min No boost at all. If P0 (Turbo max) > P1 (Guaranteed max) = min Should result in one level boost only for P0. If P0 (Turbo max) = P1 (Guaranteed max) > min Should result in two level boost: (min + p1)/2 and P1. If P0 (Turbo max) > P1 (Guaranteed max) > min Should result in three level boost: (min + p1)/2, P1 and P0. We don't set any level between P0 and P1 as there is no guarantee that they will be honored. Boost down: After the system is idle for hold time of 3ms, the HWP request is reset to the default value from HWP init or user modified one via sysfs. Caching of HWP Request and Capabilities Store the HWP request value last set using MSR_HWP_REQUEST and read MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility functions. These boost utility functions calculated limits are based on the latest HWP request value, which can be modified by setpolicy() callback. So if user space modifies the minimum perf value, that will be accounted for every time the boost up is called. There will be case when there can be contention with the user modified minimum perf, in that case user value will gain precedence. For example just before HWP_REQUEST MSR is updated from setpolicy() callback, the boost up function is called via scheduler tick callback. Here the cached MSR value is already the latest and limits are updated based on the latest user limits, but on return the MSR write callback called from setpolicy() callback will update the HWP_REQUEST value. This will be used till next time the boost up function is called. In addition add a variable to control HWP dynamic boosting. When HWP dynamic boost is active then set the HWP specific update util hook. The contents in the utility hooks will be filled in the subsequent patches. Reported-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-06 05:42:39 +08:00
}
mutex_unlock(&intel_pstate_limits_lock);
return 0;
}
cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-01-27 06:40:11 +08:00
static void intel_pstate_adjust_policy_max(struct cpudata *cpu,
struct cpufreq_policy_data *policy)
{
if (!hwp_active &&
cpu->pstate.max_pstate_physical > cpu->pstate.max_pstate &&
policy->max < policy->cpuinfo.max_freq &&
policy->max > cpu->pstate.max_freq) {
pr_debug("policy->max > max non turbo frequency\n");
policy->max = policy->cpuinfo.max_freq;
}
}
static void intel_pstate_verify_cpu_policy(struct cpudata *cpu,
struct cpufreq_policy_data *policy)
{
cpufreq: intel_pstate: Use most recent guaranteed performance values When turbo has been disabled by the BIOS, but HWP_CAP.GUARANTEED is changed later, user space may want to take advantage of this increased guaranteed performance. HWP_CAP.GUARANTEED is not a static value. It can be adjusted by an out-of-band agent or during an Intel Speed Select performance level change. The HWP_CAP.MAX is still the maximum achievable performance with turbo disabled by the BIOS, so HWP_CAP.GUARANTEED can still change as long as it remains less than or equal to HWP_CAP.MAX. When HWP_CAP.GUARANTEED is changed, the sysfs base_frequency attribute shows the most recent guaranteed frequency value. This attribute can be used by user space software to update the scaling min/max limits of the CPU. Currently, the ->setpolicy() callback already uses the latest HWP_CAP values when setting HWP_REQ, but the ->verify() callback will restrict the user settings to the to old guaranteed performance value which prevents user space from making use of the extra CPU capacity theoretically available to it after increasing HWP_CAP.GUARANTEED. To address this, read HWP_CAP in intel_pstate_verify_cpu_policy() to obtain the maximum P-state that can be used and use that to confine the policy max limit instead of using the cached and possibly stale pstate.max_freq value for this purpose. For consistency, update intel_pstate_update_perf_limits() to use the maximum available P-state returned by intel_pstate_get_hwp_max() to compute the maximum frequency instead of using the return value of intel_pstate_get_max_freq() which, again, may be stale. This issue is a side-effect of fixing the scaling frequency limits in commit eacc9c5a927e ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled") which corrected the setting of the reduced scaling frequency values, but caused stale HWP_CAP.GUARANTEED to be used in the case at hand. Fixes: eacc9c5a927e ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-18 03:17:49 +08:00
int max_freq;
update_turbo_state();
cpufreq: intel_pstate: Use most recent guaranteed performance values When turbo has been disabled by the BIOS, but HWP_CAP.GUARANTEED is changed later, user space may want to take advantage of this increased guaranteed performance. HWP_CAP.GUARANTEED is not a static value. It can be adjusted by an out-of-band agent or during an Intel Speed Select performance level change. The HWP_CAP.MAX is still the maximum achievable performance with turbo disabled by the BIOS, so HWP_CAP.GUARANTEED can still change as long as it remains less than or equal to HWP_CAP.MAX. When HWP_CAP.GUARANTEED is changed, the sysfs base_frequency attribute shows the most recent guaranteed frequency value. This attribute can be used by user space software to update the scaling min/max limits of the CPU. Currently, the ->setpolicy() callback already uses the latest HWP_CAP values when setting HWP_REQ, but the ->verify() callback will restrict the user settings to the to old guaranteed performance value which prevents user space from making use of the extra CPU capacity theoretically available to it after increasing HWP_CAP.GUARANTEED. To address this, read HWP_CAP in intel_pstate_verify_cpu_policy() to obtain the maximum P-state that can be used and use that to confine the policy max limit instead of using the cached and possibly stale pstate.max_freq value for this purpose. For consistency, update intel_pstate_update_perf_limits() to use the maximum available P-state returned by intel_pstate_get_hwp_max() to compute the maximum frequency instead of using the return value of intel_pstate_get_max_freq() which, again, may be stale. This issue is a side-effect of fixing the scaling frequency limits in commit eacc9c5a927e ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled") which corrected the setting of the reduced scaling frequency values, but caused stale HWP_CAP.GUARANTEED to be used in the case at hand. Fixes: eacc9c5a927e ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-18 03:17:49 +08:00
if (hwp_active) {
intel_pstate_get_hwp_cap(cpu);
max_freq = global.no_turbo || global.turbo_disabled ?
cpu->pstate.max_freq : cpu->pstate.turbo_freq;
cpufreq: intel_pstate: Use most recent guaranteed performance values When turbo has been disabled by the BIOS, but HWP_CAP.GUARANTEED is changed later, user space may want to take advantage of this increased guaranteed performance. HWP_CAP.GUARANTEED is not a static value. It can be adjusted by an out-of-band agent or during an Intel Speed Select performance level change. The HWP_CAP.MAX is still the maximum achievable performance with turbo disabled by the BIOS, so HWP_CAP.GUARANTEED can still change as long as it remains less than or equal to HWP_CAP.MAX. When HWP_CAP.GUARANTEED is changed, the sysfs base_frequency attribute shows the most recent guaranteed frequency value. This attribute can be used by user space software to update the scaling min/max limits of the CPU. Currently, the ->setpolicy() callback already uses the latest HWP_CAP values when setting HWP_REQ, but the ->verify() callback will restrict the user settings to the to old guaranteed performance value which prevents user space from making use of the extra CPU capacity theoretically available to it after increasing HWP_CAP.GUARANTEED. To address this, read HWP_CAP in intel_pstate_verify_cpu_policy() to obtain the maximum P-state that can be used and use that to confine the policy max limit instead of using the cached and possibly stale pstate.max_freq value for this purpose. For consistency, update intel_pstate_update_perf_limits() to use the maximum available P-state returned by intel_pstate_get_hwp_max() to compute the maximum frequency instead of using the return value of intel_pstate_get_max_freq() which, again, may be stale. This issue is a side-effect of fixing the scaling frequency limits in commit eacc9c5a927e ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled") which corrected the setting of the reduced scaling frequency values, but caused stale HWP_CAP.GUARANTEED to be used in the case at hand. Fixes: eacc9c5a927e ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-18 03:17:49 +08:00
} else {
max_freq = intel_pstate_get_max_freq(cpu);
}
cpufreq_verify_within_limits(policy, policy->cpuinfo.min_freq, max_freq);
cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-01-27 06:40:11 +08:00
intel_pstate_adjust_policy_max(cpu, policy);
}
static int intel_pstate_verify_policy(struct cpufreq_policy_data *policy)
{
intel_pstate_verify_cpu_policy(all_cpu_data[policy->cpu], policy);
return 0;
}
static int intel_cpufreq_cpu_offline(struct cpufreq_policy *policy)
{
struct cpudata *cpu = all_cpu_data[policy->cpu];
pr_debug("CPU %d going offline\n", cpu->cpu);
if (cpu->suspended)
return 0;
/*
* If the CPU is an SMT thread and it goes offline with the performance
* settings different from the minimum, it will prevent its sibling
* from getting to lower performance levels, so force the minimum
* performance on CPU offline to prevent that from happening.
*/
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
if (hwp_active)
intel_pstate_hwp_offline(cpu);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
else
intel_pstate_set_min_pstate(cpu);
intel_pstate_exit_perf_limits(policy);
return 0;
}
static int intel_pstate_cpu_online(struct cpufreq_policy *policy)
{
struct cpudata *cpu = all_cpu_data[policy->cpu];
pr_debug("CPU %d going online\n", cpu->cpu);
intel_pstate_init_acpi_perf_limits(policy);
if (hwp_active) {
/*
* Re-enable HWP and clear the "suspended" flag to let "resume"
* know that it need not do that.
*/
intel_pstate_hwp_reenable(cpu);
cpu->suspended = false;
}
return 0;
}
static int intel_pstate_cpu_offline(struct cpufreq_policy *policy)
{
intel_pstate_clear_update_util_hook(policy->cpu);
return intel_cpufreq_cpu_offline(policy);
}
static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
{
pr_debug("CPU %d exiting\n", policy->cpu);
policy->fast_switch_possible = false;
return 0;
}
static int __intel_pstate_cpu_init(struct cpufreq_policy *policy)
{
struct cpudata *cpu;
int rc;
rc = intel_pstate_init_cpu(policy->cpu);
if (rc)
return rc;
cpu = all_cpu_data[policy->cpu];
cpu->max_perf_ratio = 0xFF;
cpu->min_perf_ratio = 0;
/* cpuinfo and default policy values */
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
policy->cpuinfo.min_freq = cpu->pstate.min_freq;
update_turbo_state();
global.turbo_disabled_mf = global.turbo_disabled;
cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit a240c4aa5d0f is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-18 07:57:39 +08:00
policy->cpuinfo.max_freq = global.turbo_disabled ?
cpu->pstate.max_freq : cpu->pstate.turbo_freq;
policy->min = policy->cpuinfo.min_freq;
policy->max = policy->cpuinfo.max_freq;
intel_pstate_init_acpi_perf_limits(policy);
policy->fast_switch_possible = true;
return 0;
}
static int intel_pstate_cpu_init(struct cpufreq_policy *policy)
{
int ret = __intel_pstate_cpu_init(policy);
if (ret)
return ret;
/*
* Set the policy to powersave to provide a valid fallback value in case
* the default cpufreq governor is neither powersave nor performance.
*/
policy->policy = CPUFREQ_POLICY_POWERSAVE;
if (hwp_active) {
struct cpudata *cpu = all_cpu_data[policy->cpu];
cpu->epp_cached = intel_pstate_get_epp(cpu, 0);
}
return 0;
}
static struct cpufreq_driver intel_pstate = {
.flags = CPUFREQ_CONST_LOOPS,
.verify = intel_pstate_verify_policy,
.setpolicy = intel_pstate_set_policy,
.suspend = intel_pstate_suspend,
.resume = intel_pstate_resume,
.init = intel_pstate_cpu_init,
.exit = intel_pstate_cpu_exit,
.offline = intel_pstate_cpu_offline,
.online = intel_pstate_cpu_online,
.update_limits = intel_pstate_update_limits,
.name = "intel_pstate",
};
cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-01-27 06:40:11 +08:00
static int intel_cpufreq_verify_policy(struct cpufreq_policy_data *policy)
{
struct cpudata *cpu = all_cpu_data[policy->cpu];
intel_pstate_verify_cpu_policy(cpu, policy);
cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-01-27 06:40:11 +08:00
intel_pstate_update_perf_limits(cpu, policy->min, policy->max);
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
return 0;
}
/* Use of trace in passive mode:
*
* In passive mode the trace core_busy field (also known as the
* performance field, and lablelled as such on the graphs; also known as
* core_avg_perf) is not needed and so is re-assigned to indicate if the
* driver call was via the normal or fast switch path. Various graphs
* output from the intel_pstate_tracer.py utility that include core_busy
* (or performance or core_avg_perf) have a fixed y-axis from 0 to 100%,
* so we use 10 to indicate the normal path through the driver, and
* 90 to indicate the fast switch path through the driver.
* The scaled_busy field is not used, and is set to 0.
*/
#define INTEL_PSTATE_TRACE_TARGET 10
#define INTEL_PSTATE_TRACE_FAST_SWITCH 90
static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, int old_pstate)
{
struct sample *sample;
if (!trace_pstate_sample_enabled())
return;
if (!intel_pstate_sample(cpu, ktime_get()))
return;
sample = &cpu->sample;
trace_pstate_sample(trace_type,
0,
old_pstate,
cpu->pstate.current_pstate,
sample->mperf,
sample->aperf,
sample->tsc,
get_avg_frequency(cpu),
fp_toint(cpu->iowait_boost * 100));
}
static void intel_cpufreq_hwp_update(struct cpudata *cpu, u32 min, u32 max,
u32 desired, bool fast_switch)
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
{
u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
value &= ~HWP_MIN_PERF(~0L);
value |= HWP_MIN_PERF(min);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
value &= ~HWP_MAX_PERF(~0L);
value |= HWP_MAX_PERF(max);
value &= ~HWP_DESIRED_PERF(~0L);
value |= HWP_DESIRED_PERF(desired);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
if (value == prev)
return;
WRITE_ONCE(cpu->hwp_req_cached, value);
if (fast_switch)
wrmsrl(MSR_HWP_REQUEST, value);
else
wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
}
static void intel_cpufreq_perf_ctl_update(struct cpudata *cpu,
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
u32 target_pstate, bool fast_switch)
{
if (fast_switch)
wrmsrl(MSR_IA32_PERF_CTL,
pstate_funcs.get_val(cpu, target_pstate));
else
wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
pstate_funcs.get_val(cpu, target_pstate));
}
static int intel_cpufreq_update_pstate(struct cpufreq_policy *policy,
int target_pstate, bool fast_switch)
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
{
struct cpudata *cpu = all_cpu_data[policy->cpu];
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
int old_pstate = cpu->pstate.current_pstate;
target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
if (hwp_active) {
int max_pstate = policy->strict_target ?
target_pstate : cpu->max_perf_ratio;
intel_cpufreq_hwp_update(cpu, target_pstate, max_pstate, 0,
fast_switch);
} else if (target_pstate != old_pstate) {
intel_cpufreq_perf_ctl_update(cpu, target_pstate, fast_switch);
}
cpu->pstate.current_pstate = target_pstate;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
INTEL_PSTATE_TRACE_TARGET, old_pstate);
return target_pstate;
}
static int intel_cpufreq_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation)
{
struct cpudata *cpu = all_cpu_data[policy->cpu];
struct cpufreq_freqs freqs;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
int target_pstate;
update_turbo_state();
freqs.old = policy->cur;
freqs.new = target_freq;
cpufreq_freq_transition_begin(policy, &freqs);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
target_pstate = intel_pstate_freq_to_hwp_rel(cpu, freqs.new, relation);
target_pstate = intel_cpufreq_update_pstate(policy, target_pstate, false);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
freqs.new = target_pstate * cpu->pstate.scaling;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
cpufreq_freq_transition_end(policy, &freqs, false);
return 0;
}
static unsigned int intel_cpufreq_fast_switch(struct cpufreq_policy *policy,
unsigned int target_freq)
{
struct cpudata *cpu = all_cpu_data[policy->cpu];
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
int target_pstate;
update_turbo_state();
target_pstate = intel_pstate_freq_to_hwp(cpu, target_freq);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
target_pstate = intel_cpufreq_update_pstate(policy, target_pstate, true);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
return target_pstate * cpu->pstate.scaling;
}
static void intel_cpufreq_adjust_perf(unsigned int cpunum,
unsigned long min_perf,
unsigned long target_perf,
unsigned long capacity)
{
struct cpudata *cpu = all_cpu_data[cpunum];
u64 hwp_cap = READ_ONCE(cpu->hwp_cap_cached);
int old_pstate = cpu->pstate.current_pstate;
int cap_pstate, min_pstate, max_pstate, target_pstate;
update_turbo_state();
cap_pstate = global.turbo_disabled ? HWP_GUARANTEED_PERF(hwp_cap) :
HWP_HIGHEST_PERF(hwp_cap);
/* Optimization: Avoid unnecessary divisions. */
target_pstate = cap_pstate;
if (target_perf < capacity)
target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
min_pstate = cap_pstate;
if (min_perf < capacity)
min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
if (min_pstate < cpu->pstate.min_pstate)
min_pstate = cpu->pstate.min_pstate;
if (min_pstate < cpu->min_perf_ratio)
min_pstate = cpu->min_perf_ratio;
if (min_pstate > cpu->max_perf_ratio)
min_pstate = cpu->max_perf_ratio;
max_pstate = min(cap_pstate, cpu->max_perf_ratio);
if (max_pstate < min_pstate)
max_pstate = min_pstate;
target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);
cpu->pstate.current_pstate = target_pstate;
intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
}
static int intel_cpufreq_cpu_init(struct cpufreq_policy *policy)
{
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
struct freq_qos_request *req;
struct cpudata *cpu;
struct device *dev;
int ret, freq;
dev = get_cpu_device(policy->cpu);
if (!dev)
return -ENODEV;
ret = __intel_pstate_cpu_init(policy);
if (ret)
return ret;
policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
/* This reflects the intel_pstate_get_cpu_pstates() setting. */
policy->cur = policy->cpuinfo.min_freq;
req = kcalloc(2, sizeof(*req), GFP_KERNEL);
if (!req) {
ret = -ENOMEM;
goto pstate_exit;
}
cpu = all_cpu_data[policy->cpu];
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
if (hwp_active) {
u64 value;
policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
intel_pstate_get_hwp_cap(cpu);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
WRITE_ONCE(cpu->hwp_req_cached, value);
cpu->epp_cached = intel_pstate_get_epp(cpu, value);
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
} else {
policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
}
freq = DIV_ROUND_UP(cpu->pstate.turbo_freq * global.min_perf_pct, 100);
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
ret = freq_qos_add_request(&policy->constraints, req, FREQ_QOS_MIN,
freq);
if (ret < 0) {
dev_err(dev, "Failed to add min-freq constraint (%d)\n", ret);
goto free_req;
}
freq = DIV_ROUND_UP(cpu->pstate.turbo_freq * global.max_perf_pct, 100);
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
ret = freq_qos_add_request(&policy->constraints, req + 1, FREQ_QOS_MAX,
freq);
if (ret < 0) {
dev_err(dev, "Failed to add max-freq constraint (%d)\n", ret);
goto remove_min_req;
}
policy->driver_data = req;
return 0;
remove_min_req:
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
freq_qos_remove_request(req);
free_req:
kfree(req);
pstate_exit:
intel_pstate_exit_perf_limits(policy);
return ret;
}
static int intel_cpufreq_cpu_exit(struct cpufreq_policy *policy)
{
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
struct freq_qos_request *req;
req = policy->driver_data;
cpufreq: Use per-policy frequency QoS Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-16 18:47:06 +08:00
freq_qos_remove_request(req + 1);
freq_qos_remove_request(req);
kfree(req);
return intel_pstate_cpu_exit(policy);
}
static int intel_cpufreq_suspend(struct cpufreq_policy *policy)
{
intel_pstate_suspend(policy);
if (hwp_active) {
struct cpudata *cpu = all_cpu_data[policy->cpu];
u64 value = READ_ONCE(cpu->hwp_req_cached);
/*
* Clear the desired perf field in MSR_HWP_REQUEST in case
* intel_cpufreq_adjust_perf() is in use and the last value
* written by it may not be suitable.
*/
value &= ~HWP_DESIRED_PERF(~0L);
wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
WRITE_ONCE(cpu->hwp_req_cached, value);
}
return 0;
}
static struct cpufreq_driver intel_cpufreq = {
.flags = CPUFREQ_CONST_LOOPS,
.verify = intel_cpufreq_verify_policy,
.target = intel_cpufreq_target,
.fast_switch = intel_cpufreq_fast_switch,
.init = intel_cpufreq_cpu_init,
.exit = intel_cpufreq_cpu_exit,
.offline = intel_cpufreq_cpu_offline,
.online = intel_pstate_cpu_online,
.suspend = intel_cpufreq_suspend,
.resume = intel_pstate_resume,
.update_limits = intel_pstate_update_limits,
.name = "intel_cpufreq",
};
static struct cpufreq_driver *default_driver;
static void intel_pstate_driver_cleanup(void)
{
unsigned int cpu;
cpus_read_lock();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
if (intel_pstate_driver == &intel_pstate)
intel_pstate_clear_update_util_hook(cpu);
kfree(all_cpu_data[cpu]);
all_cpu_data[cpu] = NULL;
}
}
cpus_read_unlock();
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
intel_pstate_driver = NULL;
}
static int intel_pstate_register_driver(struct cpufreq_driver *driver)
{
int ret;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
if (driver == &intel_pstate)
intel_pstate_sysfs_expose_hwp_dynamic_boost();
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
memset(&global, 0, sizeof(global));
global.max_perf_pct = 100;
intel_pstate_driver = driver;
ret = cpufreq_register_driver(intel_pstate_driver);
if (ret) {
intel_pstate_driver_cleanup();
return ret;
}
cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-23 06:58:57 +08:00
global.min_perf_pct = min_perf_pct_min();
return 0;
}
static ssize_t intel_pstate_show_status(char *buf)
{
if (!intel_pstate_driver)
return sprintf(buf, "off\n");
return sprintf(buf, "%s\n", intel_pstate_driver == &intel_pstate ?
"active" : "passive");
}
static int intel_pstate_update_status(const char *buf, size_t size)
{
if (size == 3 && !strncmp(buf, "off", size)) {
if (!intel_pstate_driver)
return -EINVAL;
if (hwp_active)
return -EBUSY;
cpufreq_unregister_driver(intel_pstate_driver);
intel_pstate_driver_cleanup();
return 0;
}
if (size == 6 && !strncmp(buf, "active", size)) {
if (intel_pstate_driver) {
if (intel_pstate_driver == &intel_pstate)
return 0;
cpufreq_unregister_driver(intel_pstate_driver);
}
return intel_pstate_register_driver(&intel_pstate);
}
if (size == 7 && !strncmp(buf, "passive", size)) {
if (intel_pstate_driver) {
if (intel_pstate_driver == &intel_cpufreq)
return 0;
cpufreq_unregister_driver(intel_pstate_driver);
intel_pstate_sysfs_hide_hwp_dynamic_boost();
}
return intel_pstate_register_driver(&intel_cpufreq);
}
return -EINVAL;
}
static int no_load __initdata;
static int no_hwp __initdata;
static int hwp_only __initdata;
static unsigned int force_load __initdata;
static int __init intel_pstate_msrs_not_valid(void)
{
if (!pstate_funcs.get_max(0) ||
!pstate_funcs.get_min(0) ||
!pstate_funcs.get_turbo(0))
return -ENODEV;
return 0;
}
static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
{
pstate_funcs.get_max = funcs->get_max;
pstate_funcs.get_max_physical = funcs->get_max_physical;
pstate_funcs.get_min = funcs->get_min;
pstate_funcs.get_turbo = funcs->get_turbo;
pstate_funcs.get_scaling = funcs->get_scaling;
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-19 06:20:02 +08:00
pstate_funcs.get_val = funcs->get_val;
pstate_funcs.get_vid = funcs->get_vid;
pstate_funcs.get_aperf_mperf_shift = funcs->get_aperf_mperf_shift;
}
#ifdef CONFIG_ACPI
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
static bool __init intel_pstate_no_acpi_pss(void)
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
{
int i;
for_each_possible_cpu(i) {
acpi_status status;
union acpi_object *pss;
struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
struct acpi_processor *pr = per_cpu(processors, i);
if (!pr)
continue;
status = acpi_evaluate_object(pr->handle, "_PSS", NULL, &buffer);
if (ACPI_FAILURE(status))
continue;
pss = buffer.pointer;
if (pss && pss->type == ACPI_TYPE_PACKAGE) {
kfree(pss);
return false;
}
kfree(pss);
}
pr_debug("ACPI _PSS not found\n");
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
return true;
}
static bool __init intel_pstate_no_acpi_pcch(void)
{
acpi_status status;
acpi_handle handle;
status = acpi_get_handle(NULL, "\\_SB", &handle);
if (ACPI_FAILURE(status))
goto not_found;
if (acpi_has_method(handle, "PCCH"))
return false;
not_found:
pr_debug("ACPI PCCH not found\n");
return true;
}
static bool __init intel_pstate_has_acpi_ppc(void)
{
int i;
for_each_possible_cpu(i) {
struct acpi_processor *pr = per_cpu(processors, i);
if (!pr)
continue;
if (acpi_has_method(pr->handle, "_PPC"))
return true;
}
pr_debug("ACPI _PPC not found\n");
return false;
}
enum {
PSS,
PPC,
};
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
/* Hardware vendor-specific info that has its own power management modes */
static struct acpi_platform_list plat_info[] __initdata = {
{"HP ", "ProLiant", 0, ACPI_SIG_FADT, all_versions, NULL, PSS},
{"ORACLE", "X4-2 ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X4-2L ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X4-2B ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X3-2 ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X3-2L ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X3-2B ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X4470M2 ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X4270M3 ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X4270M2 ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X4170M2 ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X4170 M3", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X4275 M3", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "X6-2 ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{"ORACLE", "Sudbury ", 0, ACPI_SIG_FADT, all_versions, NULL, PPC},
{ } /* End */
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
};
#define BITMASK_OOB (BIT(8) | BIT(18))
static bool __init intel_pstate_platform_pwr_mgmt_exists(void)
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
{
const struct x86_cpu_id *id;
u64 misc_pwr;
int idx;
id = x86_match_cpu(intel_pstate_cpu_oob_ids);
if (id) {
rdmsrl(MSR_MISC_PWR_MGMT, misc_pwr);
if (misc_pwr & BITMASK_OOB) {
pr_debug("Bit 8 or 18 in the MISC_PWR_MGMT MSR set\n");
pr_debug("P states are controlled in Out of Band mode by the firmware/hardware\n");
return true;
}
}
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
idx = acpi_match_platform_list(plat_info);
if (idx < 0)
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
return false;
switch (plat_info[idx].data) {
case PSS:
if (!intel_pstate_no_acpi_pss())
return false;
return intel_pstate_no_acpi_pcch();
case PPC:
return intel_pstate_has_acpi_ppc() && !force_load;
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
}
return false;
}
static void intel_pstate_request_control_from_smm(void)
{
/*
* It may be unsafe to request P-states control from SMM if _PPC support
* has not been enabled.
*/
if (acpi_ppc)
acpi_processor_pstate_control();
}
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
#else /* CONFIG_ACPI not enabled */
static inline bool intel_pstate_platform_pwr_mgmt_exists(void) { return false; }
static inline bool intel_pstate_has_acpi_ppc(void) { return false; }
static inline void intel_pstate_request_control_from_smm(void) {}
intel_pstate: skip the driver if ACPI has power mgmt option Do not load the Intel pstate driver if the platform firmware (ACPI BIOS) supports the power management alternatives. The ACPI BIOS indicates that the OS control mode can be used if the _PSS (Performance Supported States) is defined in ACPI table. For the OS control mode, the Intel pstate driver will be loaded. HP BIOS has several power management modes (firmware, OS-control and so on). For the OS control mode in HP BIOS, the Intel p-state driver will be loaded. When the customer chooses the firmware power management in HP BIOS, the Intel p-state driver will be ignored. I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...) have their firmware power management. Vendors should make sure their firmware power management works properly, and they can go for adding their vendor info to the variable. I have verified the patch on HP ProLiant servers. The patch worked correctly. Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com> [rjw: Fixed up !CONFIG_ACPI build] [Linda Knippers: As Adrian has recently left HP, I retested the updated patch on an HP ProLiant server and verified that it is behaving correctly. When the BIOS is configured for OS control for power management, the intel_pstate driver loads as expected. When the BIOS is configured to provide the power management, the intel_pstate driver does not load and we get the pcc_cpufreq driver instead.] Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-31 23:24:05 +08:00
#endif /* CONFIG_ACPI */
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
#define INTEL_PSTATE_HWP_BROADWELL 0x01
#define X86_MATCH_HWP(model, hwp_mode) \
X86_MATCH_VENDOR_FAM_MODEL_FEATURE(INTEL, 6, INTEL_FAM6_##model, \
X86_FEATURE_HWP, hwp_mode)
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
static const struct x86_cpu_id hwp_support_ids[] __initconst = {
X86_MATCH_HWP(BROADWELL_X, INTEL_PSTATE_HWP_BROADWELL),
X86_MATCH_HWP(BROADWELL_D, INTEL_PSTATE_HWP_BROADWELL),
X86_MATCH_HWP(ANY, 0),
{}
};
static bool intel_pstate_hwp_is_enabled(void)
{
u64 value;
rdmsrl(MSR_PM_ENABLE, value);
return !!(value & 0x1);
}
static int __init intel_pstate_init(void)
{
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
const struct x86_cpu_id *id;
int rc;
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return -ENODEV;
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
id = x86_match_cpu(hwp_support_ids);
if (id) {
bool hwp_forced = intel_pstate_hwp_is_enabled();
if (hwp_forced)
pr_info("HWP enabled by BIOS\n");
else if (no_load)
return -ENODEV;
copy_cpu_funcs(&core_funcs);
/*
* Avoid enabling HWP for processors without EPP support,
* because that means incomplete HWP implementation which is a
* corner case and supporting it is generally problematic.
*
* If HWP is enabled already, though, there is no choice but to
* deal with it.
*/
if ((!no_hwp && boot_cpu_has(X86_FEATURE_HWP_EPP)) || hwp_forced) {
hwp_active++;
cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0 When scaling max/min settings are changed, internally they are converted to a ratio using the max turbo 1 core turbo frequency. This works fine when 1 core max is same irrespective of the core. But under Turbo 3.0, this will not be the case. For example: Core 0: max turbo pstate: 43 (4.3GHz) Core 1: max turbo pstate: 45 (4.5GHz) In this case 1 core turbo ratio will be maximum of all, so it will be 45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores ,then on core one it will be = max_state * policy->max / max_freq; = 43 * (4000000/4500000) = 38 (3.8GHz) = 38 which is 200MHz less than the desired. On core2, it will be correctly set to ratio 40 (4GHz). Same holds true for scaling min frequency limit. So this requires usage of correct turbo max frequency for core one, which in this case is 4.3GHz. So we need to adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that core. This change uses the HWP capability of a core to adjust max turbo frequency. But since Broadwell HWP doesn't use ratios in the HWP capabilities, we have to use legacy max 1 core turbo ratio. This is not a problem as the HWP capabilities don't differ among cores in Broadwell. We need to check for non Broadwell CPU model for applying this change, though. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-19 03:47:45 +08:00
hwp_mode_bdw = id->driver_data;
intel_pstate.attr = hwp_cpufreq_attrs;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
intel_cpufreq.attr = hwp_cpufreq_attrs;
cpufreq: intel_pstate: Avoid missing HWP max updates in passive mode If the cpufreq policy max limit is changed when intel_pstate operates in the passive mode with HWP enabled and the "powersave" governor is used on top of it, the HWP max limit is not updated as appropriate. Namely, in the "powersave" governor case, the target P-state is always equal to the policy min limit, so if the latter does not change, intel_cpufreq_adjust_hwp() is not invoked to update the HWP Request MSR due to the "target_pstate != old_pstate" check in intel_cpufreq_update_pstate(), so the HWP max limit is not updated as a result. Also, if the CPUFREQ_NEED_UPDATE_LIMITS flag is not set for the driver and the target frequency does not change along with the policy max limit, the "target_freq == policy->cur" check in __cpufreq_driver_target() prevents the driver's ->target() callback from being invoked at all, so the HWP max limit is not updated. To prevent that occurring, set the CPUFREQ_NEED_UPDATE_LIMITS flag in the intel_cpufreq driver structure if HWP is enabled and modify intel_cpufreq_update_pstate() to do the "target_pstate != old_pstate" check only in the non-HWP case and let intel_cpufreq_adjust_hwp() always run in the HWP case (it will update HWP Request only if the cached value of the register is different from the new one including the limits, so if neither the target P-state value nor the max limit changes, the register write will still be avoided). Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled") Reported-by: Zhang Rui <rui.zhang@intel.com> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ... Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Zhang Rui <rui.zhang@intel.com>
2020-10-23 23:35:32 +08:00
intel_cpufreq.flags |= CPUFREQ_NEED_UPDATE_LIMITS;
intel_cpufreq.adjust_perf = intel_cpufreq_adjust_perf;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
if (!default_driver)
default_driver = &intel_pstate;
if (boot_cpu_has(X86_FEATURE_HYBRID_CPU))
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores commit f5c8cf2a4992dd929fa0c2f25c09ee69b8dcbce1 upstream. Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") attempted to use the information from CPPC (the nominal performance in particular) to obtain the scaling factor allowing the frequency to be computed if the HWP performance level of the given CPU is known or vice versa. However, it turns out that on some platforms this doesn't work, because the CPPC information on them does not align with the contents of the MSR_HWP_CAPABILITIES registers. This basically means that the only way to make intel_pstate work on all of the hybrid platforms to date is to use the observation that on all of them the scaling factor between the HWP performance levels and frequency for P-cores is 78741 (approximately 100000/1.27). For E-cores it is 100000, which is the same as for all of the non-hybrid "core" platforms and does not require any changes. Accordingly, make intel_pstate use 78741 as the scaling factor between HWP performance levels and frequency for P-cores on all hybrid platforms and drop the dependency of the HWP calibration code on CPPC. Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-25 03:22:48 +08:00
pstate_funcs.get_cpu_scaling = hybrid_get_cpu_scaling;
goto hwp_cpu_matched;
}
pr_info("HWP not enabled\n");
} else {
if (no_load)
return -ENODEV;
id = x86_match_cpu(intel_pstate_cpu_ids);
if (!id) {
pr_info("CPU model not supported\n");
return -ENODEV;
}
copy_cpu_funcs((struct pstate_funcs *)id->driver_data);
}
if (intel_pstate_msrs_not_valid()) {
pr_info("Invalid MSRs\n");
return -ENODEV;
}
/* Without HWP start in the passive mode. */
if (!default_driver)
default_driver = &intel_cpufreq;
hwp_cpu_matched:
/*
* The Intel pstate driver will be ignored if the platform
* firmware has its own power management modes.
*/
if (intel_pstate_platform_pwr_mgmt_exists()) {
pr_info("P-states controlled by the platform\n");
return -ENODEV;
}
if (!hwp_active && hwp_only)
return -ENOTSUPP;
pr_info("Intel P-state driver initializing\n");
treewide: Use array_size() in vzalloc() The vzalloc() function has no 2-factor argument form, so multiplication factors need to be wrapped in array_size(). This patch replaces cases of: vzalloc(a * b) with: vzalloc(array_size(a, b)) as well as handling cases of: vzalloc(a * b * c) with: vzalloc(array3_size(a, b, c)) This does, however, attempt to ignore constant size factors like: vzalloc(4 * 1024) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( vzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | vzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( vzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(char) * COUNT + COUNT , ...) | vzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( vzalloc( - sizeof(TYPE) * (COUNT_ID) + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_ID + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_CONST + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT_ID) + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_ID + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_CONST + array_size(COUNT_CONST, sizeof(THING)) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ vzalloc( - SIZE * COUNT + array_size(COUNT, SIZE) , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( vzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( vzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( vzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( vzalloc(C1 * C2 * C3, ...) | vzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants. @@ expression E1, E2; constant C1, C2; @@ ( vzalloc(C1 * C2, ...) | vzalloc( - E1 * E2 + array_size(E1, E2) , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 05:27:37 +08:00
all_cpu_data = vzalloc(array_size(sizeof(void *), num_possible_cpus()));
if (!all_cpu_data)
return -ENOMEM;
intel_pstate_request_control_from_smm();
intel_pstate_sysfs_expose_params();
mutex_lock(&intel_pstate_driver_lock);
rc = intel_pstate_register_driver(default_driver);
mutex_unlock(&intel_pstate_driver_lock);
2020-10-09 11:30:38 +08:00
if (rc) {
intel_pstate_sysfs_remove();
return rc;
2020-10-09 11:30:38 +08:00
}
cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:00 +08:00
if (hwp_active) {
const struct x86_cpu_id *id;
id = x86_match_cpu(intel_pstate_cpu_ee_disable_ids);
if (id) {
set_power_ctl_ee_state(false);
pr_info("Disabling energy efficiency optimization\n");
}
pr_info("HWP enabled\n");
cpufreq: intel_pstate: hybrid: CPU-specific scaling factor The scaling factor between HWP performance levels and CPU frequency may be different for different types of CPUs in a hybrid processor and in general the HWP performance levels need not correspond to "P-states" representing values that would be written to MSR_IA32_PERF_CTL if HWP was disabled. However, the policy limits control in cpufreq is defined in terms of CPU frequency, so it is necessary to map the frequency limits set through that interface to HWP performance levels with reasonable accuracy and the behavior of that interface on hybrid processors has to be compatible with its behavior on non-hybrid ones. To address this problem, use the observations that (1) on hybrid processors the sysfs interface can operate by mapping frequency to "P-states" and translating those "P-states" to specific HWP performance levels of the given CPU and (2) the scaling factor between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be regarded as a known value. Moreover, the mapping between the HWP performance levels and CPU frequency can be assumed to be linear and such that HWP performance level 0 correspond to the frequency value of 0, so it is only necessary to know the frequency corresponding to one specific HWP performance level to compute the scaling factor applicable to all of them. One possibility is to take the nominal performance value from CPPC, if available, and use cpu_khz as the corresponding frequency. If the CPPC capabilities interface is not there or the nominal performance value provided by it is out of range, though, something else needs to be done. Namely, the guaranteed performance level either from CPPC or from MSR_HWP_CAPABILITIES can be used instead, but the corresponding frequency needs to be determined. That can be done by computing the product of the (known) scaling factor between the MSR_IA32_PERF_CTL P-states and CPU frequency (the PERF_CTL scaling factor) and the P-state value referred to as the "TDP ratio". If the HWP-to-frequency scaling factor value obtained in one of the ways above turns out to be euqal to the PERF_CTL scaling factor, it can be assumed that the number of HWP performance levels is equal to the number of P-states and the given CPU can be handled as though this was not a hybrid processor. Otherwise, one more adjustment may still need to be made, because the HWP-to-frequency scaling factor computed so far may not be accurate enough (e.g. because the CPPC information does not match the exact behavior of the processor). Specifically, in that case the frequency corresponding to the highest HWP performance value from MSR_HWP_CAPABILITIES (computed as the product of that value and the HWP-to-frequency scaling factor) cannot exceed the frequency that corresponds to the maximum 1-core turbo P-state value from MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may need to be adjusted accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-12 22:19:30 +08:00
} else if (boot_cpu_has(X86_FEATURE_HYBRID_CPU)) {
pr_warn("Problematic setup: Hybrid processor with disabled HWP\n");
cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-27 02:34:00 +08:00
}
return 0;
}
device_initcall(intel_pstate_init);
static int __init intel_pstate_setup(char *str)
{
if (!str)
return -EINVAL;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
if (!strcmp(str, "disable"))
no_load = 1;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
else if (!strcmp(str, "active"))
default_driver = &intel_pstate;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
else if (!strcmp(str, "passive"))
default_driver = &intel_cpufreq;
cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2020-08-06 20:03:55 +08:00
if (!strcmp(str, "no_hwp"))
no_hwp = 1;
if (!strcmp(str, "force"))
force_load = 1;
if (!strcmp(str, "hwp_only"))
hwp_only = 1;
cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-26 04:20:40 +08:00
if (!strcmp(str, "per_cpu_perf_limits"))
per_cpu_limits = true;
#ifdef CONFIG_ACPI
if (!strcmp(str, "support_acpi_ppc"))
acpi_ppc = true;
#endif
return 0;
}
early_param("intel_pstate", intel_pstate_setup);
MODULE_AUTHOR("Dirk Brandewie <dirk.j.brandewie@intel.com>");
MODULE_DESCRIPTION("'intel_pstate' - P state driver Intel Core processors");
MODULE_LICENSE("GPL");