From 33aa46f252c703e42c81a76696cd0c240f2281e4 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 25 Mar 2020 15:03:35 +0100 Subject: [PATCH 01/51] cpufreq: intel_pstate: Use passive mode by default without HWP After recent changes allowing scale-invariant utilization to be used on x86, the schedutil governor on top of intel_pstate in the passive mode should be on par with (or better than) the active mode "powersave" algorithm of intel_pstate on systems in which hardware-managed P-states (HWP) are not used, so it should not be necessary to use the internal scaling algorithm in those cases. Accordingly, modify intel_pstate to start in the passive mode by default if the processor at hand does not support HWP of if the driver is requested to avoid using HWP through the kernel command line. Among other things, that will allow utilization clamps and the support for RT/DL tasks in the schedutil governor to be utilized on systems in which intel_pstate is used. Signed-off-by: Rafael J. Wysocki --- Documentation/admin-guide/pm/intel_pstate.rst | 32 +++++++++++-------- drivers/cpufreq/intel_pstate.c | 3 +- 2 files changed, 21 insertions(+), 14 deletions(-) diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index ad392f3aee06..39d80bc29ccd 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -62,9 +62,10 @@ on the capabilities of the processor. Active Mode ----------- -This is the default operation mode of ``intel_pstate``. If it works in this -mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` -policies contains the string "intel_pstate". +This is the default operation mode of ``intel_pstate`` for processors with +hardware-managed P-states (HWP) support. If it works in this mode, the +``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` policies +contains the string "intel_pstate". In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and provides its own scaling algorithms for P-state selection. Those algorithms @@ -138,12 +139,13 @@ internal P-state selection logic to be less performance-focused. Active Mode Without HWP ~~~~~~~~~~~~~~~~~~~~~~~ -This is the default operation mode for processors that do not support the HWP -feature. It also is used by default with the ``intel_pstate=no_hwp`` argument -in the kernel command line. However, in this mode ``intel_pstate`` may refuse -to work with the given processor if it does not recognize it. [Note that -``intel_pstate`` will never refuse to work with any processor with the HWP -feature enabled.] +This operation mode is optional for processors that do not support the HWP +feature or when the ``intel_pstate=no_hwp`` argument is passed to the kernel in +the command line. The active mode is used in those cases if the +``intel_pstate=active`` argument is passed to the kernel in the command line. +In this mode ``intel_pstate`` may refuse to work with processors that are not +recognized by it. [Note that ``intel_pstate`` will never refuse to work with +any processor with the HWP feature enabled.] In this mode ``intel_pstate`` registers utilization update callbacks with the CPU scheduler in order to run a P-state selection algorithm, either @@ -188,10 +190,14 @@ is not set. Passive Mode ------------ -This mode is used if the ``intel_pstate=passive`` argument is passed to the -kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting too). -Like in the active mode without HWP support, in this mode ``intel_pstate`` may -refuse to work with the given processor if it does not recognize it. +This is the default operation mode of ``intel_pstate`` for processors without +hardware-managed P-states (HWP) support. It is always used if the +``intel_pstate=passive`` argument is passed to the kernel in the command line +regardless of whether or not the given processor supports HWP. [Note that the +``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used +without ``intel_pstate=active``.] Like in the active mode without HWP support, +in this mode ``intel_pstate`` may refuse to work with processors that are not +recognized by it. If the driver works in this mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq". diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 4d1e25d1ced1..66ab6523c3eb 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -2771,6 +2771,8 @@ static int __init intel_pstate_init(void) pr_info("Invalid MSRs\n"); return -ENODEV; } + /* Without HWP start in the passive mode. */ + default_driver = &intel_cpufreq; hwp_cpu_matched: /* @@ -2816,7 +2818,6 @@ static int __init intel_pstate_setup(char *str) if (!strcmp(str, "disable")) { no_load = 1; } else if (!strcmp(str, "passive")) { - pr_info("Passive mode enabled\n"); default_driver = &intel_cpufreq; no_hwp = 1; } From 107d47b2b95ef478d71f3bf36201886d7475427a Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:29:30 +0200 Subject: [PATCH 02/51] PM: sleep: core: Simplify the SMART_SUSPEND flag handling The code to handle the SMART_SUSPEND driver PM flag is hard to follow and somewhat inconsistent with respect to devices without middle-layer (subsystem) callbacks. Namely, for those devices the core takes the role of a middle layer in providing the expected ordering of execution of callbacks (under the assumption that the drivers setting SMART_SUSPEND can reuse their PM-runtime callbacks directly for system-wide suspend). To that end, it prevents driver ->suspend_late and ->suspend_noirq callbacks from being executed for devices that are still runtime-suspended in __device_suspend_late(), because running the same callback funtion that was previously run by PM-runtime for them may be invalid. However, it does that only for devices without any middle-layer callbacks for the late/noirq/early suspend/resume phases even though it would be simpler and more consistent to skip the driver-lavel callbacks for all devices with SMART_SUSPEND set that are runtime-suspended in __device_suspend_late(). Simplify the code in accordance with the above observation. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern --- drivers/base/power/main.c | 118 +++++++++++++------------------------- 1 file changed, 39 insertions(+), 79 deletions(-) diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index fdd508a78ffd..5d0225573bbe 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -561,24 +561,6 @@ static void dpm_watchdog_clear(struct dpm_watchdog *wd) /*------------------------- Resume routines -------------------------*/ -/** - * suspend_event - Return a "suspend" message for given "resume" one. - * @resume_msg: PM message representing a system-wide resume transition. - */ -static pm_message_t suspend_event(pm_message_t resume_msg) -{ - switch (resume_msg.event) { - case PM_EVENT_RESUME: - return PMSG_SUSPEND; - case PM_EVENT_THAW: - case PM_EVENT_RESTORE: - return PMSG_FREEZE; - case PM_EVENT_RECOVER: - return PMSG_HIBERNATE; - } - return PMSG_ON; -} - /** * dev_pm_may_skip_resume - System-wide device resume optimization check. * @dev: Target device. @@ -656,37 +638,36 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn if (!dpm_wait_for_superior(dev, async)) goto Out; - skip_resume = dev_pm_may_skip_resume(dev); - callback = dpm_subsys_resume_noirq_cb(dev, state, &info); - if (callback) + if (callback) { + skip_resume = false; goto Run; + } + skip_resume = dev_pm_may_skip_resume(dev); if (skip_resume) goto Skip; - if (dev_pm_smart_suspend_and_suspended(dev)) { - pm_message_t suspend_msg = suspend_event(state); - - /* - * If "freeze" callbacks have been skipped during a transition - * related to hibernation, the subsequent "thaw" callbacks must - * be skipped too or bad things may happen. Otherwise, resume - * callbacks are going to be run for the device, so its runtime - * PM status must be changed to reflect the new state after the - * transition under way. - */ - if (!dpm_subsys_suspend_late_cb(dev, suspend_msg, NULL) && - !dpm_subsys_suspend_noirq_cb(dev, suspend_msg, NULL)) { - if (state.event == PM_EVENT_THAW) { - skip_resume = true; - goto Skip; - } else { - pm_runtime_set_active(dev); - } - } + /* + * If "freeze" driver callbacks have been skipped during hibernation, + * because the device was runtime-suspended in __device_suspend_late(), + * the corresponding "thaw" callbacks must be skipped too, because + * running them for a runtime-suspended device may not be valid. + */ + if (dev_pm_smart_suspend_and_suspended(dev) && + state.event == PM_EVENT_THAW) { + skip_resume = true; + goto Skip; } + /* + * The device is going to be resumed, so set its PM-runtime status to + * "active", but do that only if DPM_FLAG_SMART_SUSPEND is set to avoid + * confusing drivers that don't use it. + */ + if (dev_pm_smart_suspend_and_suspended(dev)) + pm_runtime_set_active(dev); + if (dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); @@ -1274,32 +1255,6 @@ static pm_callback_t dpm_subsys_suspend_noirq_cb(struct device *dev, return callback; } -static bool device_must_resume(struct device *dev, pm_message_t state, - bool no_subsys_suspend_noirq) -{ - pm_message_t resume_msg = resume_event(state); - - /* - * If all of the device driver's "noirq", "late" and "early" callbacks - * are invoked directly by the core, the decision to allow the device to - * stay in suspend can be based on its current runtime PM status and its - * wakeup settings. - */ - if (no_subsys_suspend_noirq && - !dpm_subsys_suspend_late_cb(dev, state, NULL) && - !dpm_subsys_resume_early_cb(dev, resume_msg, NULL) && - !dpm_subsys_resume_noirq_cb(dev, resume_msg, NULL)) - return !pm_runtime_status_suspended(dev) && - (resume_msg.event != PM_EVENT_RESUME || - (device_can_wakeup(dev) && !device_may_wakeup(dev))); - - /* - * The only safe strategy here is to require that if the device may not - * be left in suspend, resume callbacks must be invoked for it. - */ - return !dev->power.may_skip_resume; -} - /** * __device_suspend_noirq - Execute a "noirq suspend" callback for given device. * @dev: Device to handle. @@ -1313,7 +1268,6 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a { pm_callback_t callback; const char *info; - bool no_subsys_cb = false; int error = 0; TRACE_DEVICE(dev); @@ -1331,9 +1285,7 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a if (callback) goto Run; - no_subsys_cb = !dpm_subsys_suspend_late_cb(dev, state, NULL); - - if (dev_pm_smart_suspend_and_suspended(dev) && no_subsys_cb) + if (dev_pm_smart_suspend_and_suspended(dev)) goto Skip; if (dev->driver && dev->driver->pm) { @@ -1351,13 +1303,16 @@ Run: Skip: dev->power.is_noirq_suspended = true; - if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) { - dev->power.must_resume = dev->power.must_resume || - atomic_read(&dev->power.usage_count) > 1 || - device_must_resume(dev, state, no_subsys_cb); - } else { + /* + * Skipping the resume of devices that were in use right before the + * system suspend (as indicated by their PM-runtime usage counters) + * would be suboptimal. Also resume them if doing that is not allowed + * to be skipped. + */ + if (atomic_read(&dev->power.usage_count) > 1 || + !(dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED) && + dev->power.may_skip_resume)) dev->power.must_resume = true; - } if (dev->power.must_resume) dpm_superior_set_must_resume(dev); @@ -1539,9 +1494,14 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as if (callback) goto Run; - if (dev_pm_smart_suspend_and_suspended(dev) && - !dpm_subsys_suspend_noirq_cb(dev, state, NULL)) + if (dev_pm_smart_suspend_and_suspended(dev)) { + /* + * In principle, the resume of the device may be skippend if it + * remains in runtime suspend at this point. + */ + dev->power.may_skip_resume = true; goto Skip; + } if (dev->driver && dev->driver->pm) { info = "late driver "; From 30205377ddbb717ee451e872fd59511f4f76373d Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:51:44 +0200 Subject: [PATCH 03/51] PM: sleep: core: Fold functions into their callers Fold four functions in the PM core that each have only one caller now into their callers. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern --- drivers/base/power/main.c | 198 ++++++++++++-------------------------- 1 file changed, 60 insertions(+), 138 deletions(-) diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 5d0225573bbe..75d7cdb4de9c 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -573,43 +573,6 @@ bool dev_pm_may_skip_resume(struct device *dev) return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE; } -static pm_callback_t dpm_subsys_resume_noirq_cb(struct device *dev, - pm_message_t state, - const char **info_p) -{ - pm_callback_t callback; - const char *info; - - if (dev->pm_domain) { - info = "noirq power domain "; - callback = pm_noirq_op(&dev->pm_domain->ops, state); - } else if (dev->type && dev->type->pm) { - info = "noirq type "; - callback = pm_noirq_op(dev->type->pm, state); - } else if (dev->class && dev->class->pm) { - info = "noirq class "; - callback = pm_noirq_op(dev->class->pm, state); - } else if (dev->bus && dev->bus->pm) { - info = "noirq bus "; - callback = pm_noirq_op(dev->bus->pm, state); - } else { - return NULL; - } - - if (info_p) - *info_p = info; - - return callback; -} - -static pm_callback_t dpm_subsys_suspend_noirq_cb(struct device *dev, - pm_message_t state, - const char **info_p); - -static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, - pm_message_t state, - const char **info_p); - /** * device_resume_noirq - Execute a "noirq resume" callback for given device. * @dev: Device to handle. @@ -621,8 +584,8 @@ static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, */ static int device_resume_noirq(struct device *dev, pm_message_t state, bool async) { - pm_callback_t callback; - const char *info; + pm_callback_t callback = NULL; + const char *info = NULL; bool skip_resume; int error = 0; @@ -638,7 +601,19 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn if (!dpm_wait_for_superior(dev, async)) goto Out; - callback = dpm_subsys_resume_noirq_cb(dev, state, &info); + if (dev->pm_domain) { + info = "noirq power domain "; + callback = pm_noirq_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "noirq type "; + callback = pm_noirq_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "noirq class "; + callback = pm_noirq_op(dev->class->pm, state); + } else if (dev->bus && dev->bus->pm) { + info = "noirq bus "; + callback = pm_noirq_op(dev->bus->pm, state); + } if (callback) { skip_resume = false; goto Run; @@ -791,35 +766,6 @@ void dpm_resume_noirq(pm_message_t state) cpuidle_resume(); } -static pm_callback_t dpm_subsys_resume_early_cb(struct device *dev, - pm_message_t state, - const char **info_p) -{ - pm_callback_t callback; - const char *info; - - if (dev->pm_domain) { - info = "early power domain "; - callback = pm_late_early_op(&dev->pm_domain->ops, state); - } else if (dev->type && dev->type->pm) { - info = "early type "; - callback = pm_late_early_op(dev->type->pm, state); - } else if (dev->class && dev->class->pm) { - info = "early class "; - callback = pm_late_early_op(dev->class->pm, state); - } else if (dev->bus && dev->bus->pm) { - info = "early bus "; - callback = pm_late_early_op(dev->bus->pm, state); - } else { - return NULL; - } - - if (info_p) - *info_p = info; - - return callback; -} - /** * device_resume_early - Execute an "early resume" callback for given device. * @dev: Device to handle. @@ -830,8 +776,8 @@ static pm_callback_t dpm_subsys_resume_early_cb(struct device *dev, */ static int device_resume_early(struct device *dev, pm_message_t state, bool async) { - pm_callback_t callback; - const char *info; + pm_callback_t callback = NULL; + const char *info = NULL; int error = 0; TRACE_DEVICE(dev); @@ -846,9 +792,19 @@ static int device_resume_early(struct device *dev, pm_message_t state, bool asyn if (!dpm_wait_for_superior(dev, async)) goto Out; - callback = dpm_subsys_resume_early_cb(dev, state, &info); - - if (!callback && dev->driver && dev->driver->pm) { + if (dev->pm_domain) { + info = "early power domain "; + callback = pm_late_early_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "early type "; + callback = pm_late_early_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "early class "; + callback = pm_late_early_op(dev->class->pm, state); + } else if (dev->bus && dev->bus->pm) { + info = "early bus "; + callback = pm_late_early_op(dev->bus->pm, state); + } else if (dev->driver && dev->driver->pm) { info = "early driver "; callback = pm_late_early_op(dev->driver->pm, state); } @@ -1226,35 +1182,6 @@ static void dpm_superior_set_must_resume(struct device *dev) device_links_read_unlock(idx); } -static pm_callback_t dpm_subsys_suspend_noirq_cb(struct device *dev, - pm_message_t state, - const char **info_p) -{ - pm_callback_t callback; - const char *info; - - if (dev->pm_domain) { - info = "noirq power domain "; - callback = pm_noirq_op(&dev->pm_domain->ops, state); - } else if (dev->type && dev->type->pm) { - info = "noirq type "; - callback = pm_noirq_op(dev->type->pm, state); - } else if (dev->class && dev->class->pm) { - info = "noirq class "; - callback = pm_noirq_op(dev->class->pm, state); - } else if (dev->bus && dev->bus->pm) { - info = "noirq bus "; - callback = pm_noirq_op(dev->bus->pm, state); - } else { - return NULL; - } - - if (info_p) - *info_p = info; - - return callback; -} - /** * __device_suspend_noirq - Execute a "noirq suspend" callback for given device. * @dev: Device to handle. @@ -1266,8 +1193,8 @@ static pm_callback_t dpm_subsys_suspend_noirq_cb(struct device *dev, */ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool async) { - pm_callback_t callback; - const char *info; + pm_callback_t callback = NULL; + const char *info = NULL; int error = 0; TRACE_DEVICE(dev); @@ -1281,7 +1208,19 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a if (dev->power.syscore || dev->power.direct_complete) goto Complete; - callback = dpm_subsys_suspend_noirq_cb(dev, state, &info); + if (dev->pm_domain) { + info = "noirq power domain "; + callback = pm_noirq_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "noirq type "; + callback = pm_noirq_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "noirq class "; + callback = pm_noirq_op(dev->class->pm, state); + } else if (dev->bus && dev->bus->pm) { + info = "noirq bus "; + callback = pm_noirq_op(dev->bus->pm, state); + } if (callback) goto Run; @@ -1429,35 +1368,6 @@ static void dpm_propagate_wakeup_to_parent(struct device *dev) spin_unlock_irq(&parent->power.lock); } -static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, - pm_message_t state, - const char **info_p) -{ - pm_callback_t callback; - const char *info; - - if (dev->pm_domain) { - info = "late power domain "; - callback = pm_late_early_op(&dev->pm_domain->ops, state); - } else if (dev->type && dev->type->pm) { - info = "late type "; - callback = pm_late_early_op(dev->type->pm, state); - } else if (dev->class && dev->class->pm) { - info = "late class "; - callback = pm_late_early_op(dev->class->pm, state); - } else if (dev->bus && dev->bus->pm) { - info = "late bus "; - callback = pm_late_early_op(dev->bus->pm, state); - } else { - return NULL; - } - - if (info_p) - *info_p = info; - - return callback; -} - /** * __device_suspend_late - Execute a "late suspend" callback for given device. * @dev: Device to handle. @@ -1468,8 +1378,8 @@ static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, */ static int __device_suspend_late(struct device *dev, pm_message_t state, bool async) { - pm_callback_t callback; - const char *info; + pm_callback_t callback = NULL; + const char *info = NULL; int error = 0; TRACE_DEVICE(dev); @@ -1490,7 +1400,19 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as if (dev->power.syscore || dev->power.direct_complete) goto Complete; - callback = dpm_subsys_suspend_late_cb(dev, state, &info); + if (dev->pm_domain) { + info = "late power domain "; + callback = pm_late_early_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "late type "; + callback = pm_late_early_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "late class "; + callback = pm_late_early_op(dev->class->pm, state); + } else if (dev->bus && dev->bus->pm) { + info = "late bus "; + callback = pm_late_early_op(dev->bus->pm, state); + } if (callback) goto Run; From 6e176bf8d46194353163c2cb660808bc633b45d9 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:52:08 +0200 Subject: [PATCH 04/51] PM: sleep: core: Do not skip callbacks in the resume phase The current code in device_resume_noirq() causes the entire early resume and resume phases of device suspend to be skipped for devices for which the noirq resume phase have been skipped (due to the LEAVE_SUSPENDED flag being set) on the premise that those devices should stay in runtime-suspend after system-wide resume. However, that may not be correct in two situations. First, the middle layer (subsystem) noirq resume callback may be missing for a given device, but its early resume callback may be present and it may need to do something even if it decides to skip the driver callback. Second, if the device's wakeup settings were adjusted in the suspend phase without resuming the device (that was in runtime suspend at that time), they most likely need to be adjusted again in the resume phase and so the driver callback in that phase needs to be run. For the above reason, modify the core to allow the middle layer ->resume_late callback to run even if its ->resume_noirq callback is missing (and the core has skipped the driver-level callback in that phase) and to allow all device callbacks to run in the resume phase. Also make the core set the PM-runtime status of devices with SMART_SUSPEND set whose resume callbacks are not skipped to "active" in the "noirq" resume phase and update the affected subsystems (PCI and ACPI) accordingly. After this change, middle-layer (subsystem) callbacks will always be invoked in all phases of system suspend and resume and driver callbacks will always run in the prepare, suspend, resume, and complete phases for all devices. For devices with SMART_SUSPEND set, driver callbacks will be skipped in the late and noirq phases of system suspend if those devices remain in runtime suspend in __device_suspend_late(). Driver callbacks will also be skipped for them during the noirq and early phases of the "thaw" transition related to hibernation in that case. Setting LEAVE_SUSPENDED means that the driver allows its callbacks to be skipped in the noirq and early phases of system resume, but some additional conditions need to be met for that to happen (among other things, the power.may_skip_resume flag needs to be set for the device during system suspend for the driver callbacks to be skipped during the subsequent resume transition). For all devices with SMART_SUSPEND set whose driver callbacks are invoked during system resume, the PM-runtime status will be set to "active" (by the core). Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern Acked-by: Bjorn Helgaas --- Documentation/power/pci.rst | 5 +-- drivers/acpi/acpi_lpss.c | 6 +-- drivers/acpi/device_pm.c | 15 +++---- drivers/base/power/main.c | 85 ++++++++++++++++++------------------- drivers/pci/pci-driver.c | 18 ++++---- 5 files changed, 62 insertions(+), 67 deletions(-) diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index 0924d29636ad..a39b2461919a 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1035,10 +1035,7 @@ This flag is checked by the PM core, but the PCI bus type informs the PM core which devices may be left in suspend from its perspective (that happens during the "noirq" phase of system-wide suspend and analogous transitions) and next it uses the dev_pm_may_skip_resume() helper to decide whether or not to return from -pci_pm_resume_noirq() early, as the PM core will skip the remaining resume -callbacks for the device during the transition under way and will set its -runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for -it. +pci_pm_resume_noirq() and pci_pm_resume_early() upfront. 3.2. Device Runtime Power Management ------------------------------------ diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c index dee999938213..c4a84df6cc98 100644 --- a/drivers/acpi/acpi_lpss.c +++ b/drivers/acpi/acpi_lpss.c @@ -1093,6 +1093,9 @@ static int acpi_lpss_resume_early(struct device *dev) if (pdata->dev_desc->resume_from_noirq) return 0; + if (dev_pm_may_skip_resume(dev)) + return 0; + return acpi_lpss_do_resume_early(dev); } @@ -1105,9 +1108,6 @@ static int acpi_lpss_resume_noirq(struct device *dev) if (dev_pm_may_skip_resume(dev)) return 0; - if (dev_pm_smart_suspend_and_suspended(dev)) - pm_runtime_set_active(dev); - ret = pm_generic_resume_noirq(dev); if (ret) return ret; diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index b2263ec67b43..399684085f85 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -1132,14 +1132,6 @@ static int acpi_subsys_resume_noirq(struct device *dev) if (dev_pm_may_skip_resume(dev)) return 0; - /* - * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend - * during system suspend, so update their runtime PM status to "active" - * as they will be put into D0 going forward. - */ - if (dev_pm_smart_suspend_and_suspended(dev)) - pm_runtime_set_active(dev); - return pm_generic_resume_noirq(dev); } @@ -1153,7 +1145,12 @@ static int acpi_subsys_resume_noirq(struct device *dev) */ static int acpi_subsys_resume_early(struct device *dev) { - int ret = acpi_dev_resume(dev); + int ret; + + if (dev_pm_may_skip_resume(dev)) + return 0; + + ret = acpi_dev_resume(dev); return ret ? ret : pm_generic_resume_early(dev); } diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 75d7cdb4de9c..25b0302188d8 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -565,12 +565,22 @@ static void dpm_watchdog_clear(struct dpm_watchdog *wd) * dev_pm_may_skip_resume - System-wide device resume optimization check. * @dev: Target device. * - * Checks whether or not the device may be left in suspend after a system-wide - * transition to the working state. + * Return: + * - %false if the transition under way is RESTORE. + * - The return value of dev_pm_smart_suspend_and_suspended() if the transition + * under way is THAW. + * - The logical negation of %power.must_resume otherwise (that is, when the + * transition under way is RESUME). */ bool dev_pm_may_skip_resume(struct device *dev) { - return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE; + if (pm_transition.event == PM_EVENT_RESTORE) + return false; + + if (pm_transition.event == PM_EVENT_THAW) + return dev_pm_smart_suspend_and_suspended(dev); + + return !dev->power.must_resume; } /** @@ -601,6 +611,22 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn if (!dpm_wait_for_superior(dev, async)) goto Out; + skip_resume = dev_pm_may_skip_resume(dev); + /* + * If the driver callback is skipped below or by the middle layer + * callback and device_resume_early() also skips the driver callback for + * this device later, it needs to appear as "suspended" to PM-runtime, + * so change its status accordingly. + * + * Otherwise, the device is going to be resumed, so set its PM-runtime + * status to "active", but do that only if DPM_FLAG_SMART_SUSPEND is set + * to avoid confusing drivers that don't use it. + */ + if (skip_resume) + pm_runtime_set_suspended(dev); + else if (dev_pm_smart_suspend_and_suspended(dev)) + pm_runtime_set_active(dev); + if (dev->pm_domain) { info = "noirq power domain "; callback = pm_noirq_op(&dev->pm_domain->ops, state); @@ -614,35 +640,12 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn info = "noirq bus "; callback = pm_noirq_op(dev->bus->pm, state); } - if (callback) { - skip_resume = false; + if (callback) goto Run; - } - skip_resume = dev_pm_may_skip_resume(dev); if (skip_resume) goto Skip; - /* - * If "freeze" driver callbacks have been skipped during hibernation, - * because the device was runtime-suspended in __device_suspend_late(), - * the corresponding "thaw" callbacks must be skipped too, because - * running them for a runtime-suspended device may not be valid. - */ - if (dev_pm_smart_suspend_and_suspended(dev) && - state.event == PM_EVENT_THAW) { - skip_resume = true; - goto Skip; - } - - /* - * The device is going to be resumed, so set its PM-runtime status to - * "active", but do that only if DPM_FLAG_SMART_SUSPEND is set to avoid - * confusing drivers that don't use it. - */ - if (dev_pm_smart_suspend_and_suspended(dev)) - pm_runtime_set_active(dev); - if (dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); @@ -654,20 +657,6 @@ Run: Skip: dev->power.is_noirq_suspended = false; - if (skip_resume) { - /* Make the next phases of resume skip the device. */ - dev->power.is_late_suspended = false; - dev->power.is_suspended = false; - /* - * The device is going to be left in suspend, but it might not - * have been in runtime suspend before the system suspended, so - * its runtime PM status needs to be updated to avoid confusing - * the runtime PM framework when runtime PM is enabled for the - * device again. - */ - pm_runtime_set_suspended(dev); - } - Out: complete_all(&dev->power.completion); TRACE_RESUME(error); @@ -804,15 +793,25 @@ static int device_resume_early(struct device *dev, pm_message_t state, bool asyn } else if (dev->bus && dev->bus->pm) { info = "early bus "; callback = pm_late_early_op(dev->bus->pm, state); - } else if (dev->driver && dev->driver->pm) { + } + if (callback) + goto Run; + + if (dev_pm_may_skip_resume(dev)) + goto Skip; + + if (dev->driver && dev->driver->pm) { info = "early driver "; callback = pm_late_early_op(dev->driver->pm, state); } +Run: error = dpm_run_callback(callback, dev, state, info); + +Skip: dev->power.is_late_suspended = false; - Out: +Out: TRACE_RESUME(error); pm_runtime_enable(dev); diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 0454ca0e4e3f..685fbf044911 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -896,14 +896,6 @@ static int pci_pm_resume_noirq(struct device *dev) if (dev_pm_may_skip_resume(dev)) return 0; - /* - * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend - * during system suspend, so update their runtime PM status to "active" - * as they are going to be put into D0 shortly. - */ - if (dev_pm_smart_suspend_and_suspended(dev)) - pm_runtime_set_active(dev); - /* * In the suspend-to-idle case, devices left in D0 during suspend will * stay in D0, so it is not necessary to restore or update their @@ -928,6 +920,14 @@ static int pci_pm_resume_noirq(struct device *dev) return 0; } +static int pci_pm_resume_early(struct device *dev) +{ + if (dev_pm_may_skip_resume(dev)) + return 0; + + return pm_generic_resume_early(dev); +} + static int pci_pm_resume(struct device *dev) { struct pci_dev *pci_dev = to_pci_dev(dev); @@ -961,6 +961,7 @@ static int pci_pm_resume(struct device *dev) #define pci_pm_suspend_late NULL #define pci_pm_suspend_noirq NULL #define pci_pm_resume NULL +#define pci_pm_resume_early NULL #define pci_pm_resume_noirq NULL #endif /* !CONFIG_SUSPEND */ @@ -1358,6 +1359,7 @@ static const struct dev_pm_ops pci_dev_pm_ops = { .suspend = pci_pm_suspend, .suspend_late = pci_pm_suspend_late, .resume = pci_pm_resume, + .resume_early = pci_pm_resume_early, .freeze = pci_pm_freeze, .thaw = pci_pm_thaw, .poweroff = pci_pm_poweroff, From 0fe8a1be599ab97f840ba22d98cb8f24a9f9e872 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:52:19 +0200 Subject: [PATCH 05/51] PM: sleep: core: Rework the power.may_skip_resume handling Because the power.may_skip_resume device status bit is taken into account in combination with the DPM_FLAG_LEAVE_SUSPENDED driver flag, it can be set to 'true' for all devices in the "suspend" phase of a suspend-resume cycle, so do that. Then, neither the PM core nor the middle-layer (sybsystem) code handling it needs to set it to 'true' any more and it just has to be cleared if there is a reason to avoid skipping the "noirq" and "early" resume callbacks provided by the driver, so update the code in question accordingly. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern Acked-by: Bjorn Helgaas --- drivers/acpi/device_pm.c | 8 +++----- drivers/base/power/main.c | 10 ++-------- drivers/pci/pci-driver.c | 8 +++----- 3 files changed, 8 insertions(+), 18 deletions(-) diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index 399684085f85..1b02d7dc7d34 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -1100,10 +1100,8 @@ int acpi_subsys_suspend_noirq(struct device *dev) { int ret; - if (dev_pm_smart_suspend_and_suspended(dev)) { - dev->power.may_skip_resume = true; + if (dev_pm_smart_suspend_and_suspended(dev)) return 0; - } ret = pm_generic_suspend_noirq(dev); if (ret) @@ -1116,8 +1114,8 @@ int acpi_subsys_suspend_noirq(struct device *dev) * acpi_subsys_complete() to take care of fixing up the device's state * anyway, if need be. */ - dev->power.may_skip_resume = device_may_wakeup(dev) || - !device_can_wakeup(dev); + if (device_can_wakeup(dev) && !device_may_wakeup(dev)) + dev->power.may_skip_resume = false; return 0; } diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 25b0302188d8..5adf0be6aa47 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1415,14 +1415,8 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as if (callback) goto Run; - if (dev_pm_smart_suspend_and_suspended(dev)) { - /* - * In principle, the resume of the device may be skippend if it - * remains in runtime suspend at this point. - */ - dev->power.may_skip_resume = true; + if (dev_pm_smart_suspend_and_suspended(dev)) goto Skip; - } if (dev->driver && dev->driver->pm) { info = "late driver "; @@ -1647,7 +1641,7 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) dev->power.direct_complete = false; } - dev->power.may_skip_resume = false; + dev->power.may_skip_resume = true; dev->power.must_resume = false; dpm_watchdog_set(&wd, dev); diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 685fbf044911..ce220b1987df 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -789,10 +789,8 @@ static int pci_pm_suspend_noirq(struct device *dev) struct pci_dev *pci_dev = to_pci_dev(dev); const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; - if (dev_pm_smart_suspend_and_suspended(dev)) { - dev->power.may_skip_resume = true; + if (dev_pm_smart_suspend_and_suspended(dev)) return 0; - } if (pci_has_legacy_pm_support(pci_dev)) return pci_legacy_suspend_late(dev, PMSG_SUSPEND); @@ -880,8 +878,8 @@ Fixup: * pci_pm_complete() to take care of fixing up the device's state * anyway, if need be. */ - dev->power.may_skip_resume = device_may_wakeup(dev) || - !device_can_wakeup(dev); + if (device_can_wakeup(dev) && !device_may_wakeup(dev)) + dev->power.may_skip_resume = false; return 0; } From 76c70cb58ce30264af4b714109ee756da25d830a Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:52:30 +0200 Subject: [PATCH 06/51] PM: sleep: core: Rename dev_pm_may_skip_resume() The name of dev_pm_may_skip_resume() may be easily confused with the power.may_skip_resume flag which is not checked by that function, so rename the former as dev_pm_skip_resume(). No functional impact. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern Acked-by: Bjorn Helgaas --- Documentation/power/pci.rst | 2 +- drivers/acpi/acpi_lpss.c | 4 ++-- drivers/acpi/device_pm.c | 4 ++-- drivers/base/power/main.c | 8 ++++---- drivers/pci/pci-driver.c | 4 ++-- include/linux/pm.h | 2 +- 6 files changed, 12 insertions(+), 12 deletions(-) diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index a39b2461919a..aa1c7fce6cd0 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1034,7 +1034,7 @@ device to be left in suspend after system-wide transitions to the working state. This flag is checked by the PM core, but the PCI bus type informs the PM core which devices may be left in suspend from its perspective (that happens during the "noirq" phase of system-wide suspend and analogous transitions) and next it -uses the dev_pm_may_skip_resume() helper to decide whether or not to return from +uses the dev_pm_skip_resume() helper to decide whether or not to return from pci_pm_resume_noirq() and pci_pm_resume_early() upfront. 3.2. Device Runtime Power Management diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c index c4a84df6cc98..7632df1a5be3 100644 --- a/drivers/acpi/acpi_lpss.c +++ b/drivers/acpi/acpi_lpss.c @@ -1093,7 +1093,7 @@ static int acpi_lpss_resume_early(struct device *dev) if (pdata->dev_desc->resume_from_noirq) return 0; - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; return acpi_lpss_do_resume_early(dev); @@ -1105,7 +1105,7 @@ static int acpi_lpss_resume_noirq(struct device *dev) int ret; /* Follow acpi_subsys_resume_noirq(). */ - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; ret = pm_generic_resume_noirq(dev); diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index 1b02d7dc7d34..8c2a091728a9 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -1127,7 +1127,7 @@ EXPORT_SYMBOL_GPL(acpi_subsys_suspend_noirq); */ static int acpi_subsys_resume_noirq(struct device *dev) { - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; return pm_generic_resume_noirq(dev); @@ -1145,7 +1145,7 @@ static int acpi_subsys_resume_early(struct device *dev) { int ret; - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; ret = acpi_dev_resume(dev); diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 5adf0be6aa47..f98eced0f200 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -562,7 +562,7 @@ static void dpm_watchdog_clear(struct dpm_watchdog *wd) /*------------------------- Resume routines -------------------------*/ /** - * dev_pm_may_skip_resume - System-wide device resume optimization check. + * dev_pm_skip_resume - System-wide device resume optimization check. * @dev: Target device. * * Return: @@ -572,7 +572,7 @@ static void dpm_watchdog_clear(struct dpm_watchdog *wd) * - The logical negation of %power.must_resume otherwise (that is, when the * transition under way is RESUME). */ -bool dev_pm_may_skip_resume(struct device *dev) +bool dev_pm_skip_resume(struct device *dev) { if (pm_transition.event == PM_EVENT_RESTORE) return false; @@ -611,7 +611,7 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn if (!dpm_wait_for_superior(dev, async)) goto Out; - skip_resume = dev_pm_may_skip_resume(dev); + skip_resume = dev_pm_skip_resume(dev); /* * If the driver callback is skipped below or by the middle layer * callback and device_resume_early() also skips the driver callback for @@ -797,7 +797,7 @@ static int device_resume_early(struct device *dev, pm_message_t state, bool asyn if (callback) goto Run; - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) goto Skip; if (dev->driver && dev->driver->pm) { diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index ce220b1987df..decf82595340 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -891,7 +891,7 @@ static int pci_pm_resume_noirq(struct device *dev) pci_power_t prev_state = pci_dev->current_state; bool skip_bus_pm = pci_dev->skip_bus_pm; - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; /* @@ -920,7 +920,7 @@ static int pci_pm_resume_noirq(struct device *dev) static int pci_pm_resume_early(struct device *dev) { - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; return pm_generic_resume_early(dev); diff --git a/include/linux/pm.h b/include/linux/pm.h index e057d1fa2469..d89b7099f241 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -758,7 +758,7 @@ extern int pm_generic_poweroff_late(struct device *dev); extern int pm_generic_poweroff(struct device *dev); extern void pm_generic_complete(struct device *dev); -extern bool dev_pm_may_skip_resume(struct device *dev); +extern bool dev_pm_skip_resume(struct device *dev); extern bool dev_pm_smart_suspend_and_suspended(struct device *dev); #else /* !CONFIG_PM_SLEEP */ From fa2bfead910322e44e7e0bb74364ac198a2abd32 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:52:48 +0200 Subject: [PATCH 07/51] PM: sleep: core: Rename dev_pm_smart_suspend_and_suspended() Because all callers of dev_pm_smart_suspend_and_suspended use it only for checking whether or not to skip driver suspend callbacks for a device, rename it to dev_pm_skip_suspend() in analogy with dev_pm_skip_resume(). No functional impact. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern Acked-by: Bjorn Helgaas --- drivers/acpi/acpi_lpss.c | 6 +++--- drivers/acpi/device_pm.c | 8 ++++---- drivers/base/power/main.c | 13 ++++++------- drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/pci-driver.c | 8 ++++---- include/linux/pm.h | 2 +- 6 files changed, 19 insertions(+), 20 deletions(-) diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c index 7632df1a5be3..5e2bfbcf526f 100644 --- a/drivers/acpi/acpi_lpss.c +++ b/drivers/acpi/acpi_lpss.c @@ -1041,7 +1041,7 @@ static int acpi_lpss_do_suspend_late(struct device *dev) { int ret; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; ret = pm_generic_suspend_late(dev); @@ -1169,7 +1169,7 @@ static int acpi_lpss_poweroff_late(struct device *dev) { struct lpss_private_data *pdata = acpi_driver_data(ACPI_COMPANION(dev)); - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; if (pdata->dev_desc->resume_from_noirq) @@ -1182,7 +1182,7 @@ static int acpi_lpss_poweroff_noirq(struct device *dev) { struct lpss_private_data *pdata = acpi_driver_data(ACPI_COMPANION(dev)); - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; if (pdata->dev_desc->resume_from_noirq) { diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index 8c2a091728a9..ae234d731d42 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -1084,7 +1084,7 @@ int acpi_subsys_suspend_late(struct device *dev) { int ret; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; ret = pm_generic_suspend_late(dev); @@ -1100,7 +1100,7 @@ int acpi_subsys_suspend_noirq(struct device *dev) { int ret; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; ret = pm_generic_suspend_noirq(dev); @@ -1213,7 +1213,7 @@ static int acpi_subsys_poweroff_late(struct device *dev) { int ret; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; ret = pm_generic_poweroff_late(dev); @@ -1229,7 +1229,7 @@ static int acpi_subsys_poweroff_late(struct device *dev) */ static int acpi_subsys_poweroff_noirq(struct device *dev) { - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; return pm_generic_poweroff_noirq(dev); diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index f98eced0f200..3170d93e29f9 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -567,8 +567,7 @@ static void dpm_watchdog_clear(struct dpm_watchdog *wd) * * Return: * - %false if the transition under way is RESTORE. - * - The return value of dev_pm_smart_suspend_and_suspended() if the transition - * under way is THAW. + * - Return value of dev_pm_skip_suspend() if the transition under way is THAW. * - The logical negation of %power.must_resume otherwise (that is, when the * transition under way is RESUME). */ @@ -578,7 +577,7 @@ bool dev_pm_skip_resume(struct device *dev) return false; if (pm_transition.event == PM_EVENT_THAW) - return dev_pm_smart_suspend_and_suspended(dev); + return dev_pm_skip_suspend(dev); return !dev->power.must_resume; } @@ -624,7 +623,7 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn */ if (skip_resume) pm_runtime_set_suspended(dev); - else if (dev_pm_smart_suspend_and_suspended(dev)) + else if (dev_pm_skip_suspend(dev)) pm_runtime_set_active(dev); if (dev->pm_domain) { @@ -1223,7 +1222,7 @@ static int __device_suspend_noirq(struct device *dev, pm_message_t state, bool a if (callback) goto Run; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) goto Skip; if (dev->driver && dev->driver->pm) { @@ -1415,7 +1414,7 @@ static int __device_suspend_late(struct device *dev, pm_message_t state, bool as if (callback) goto Run; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) goto Skip; if (dev->driver && dev->driver->pm) { @@ -2003,7 +2002,7 @@ void device_pm_check_callbacks(struct device *dev) spin_unlock_irq(&dev->power.lock); } -bool dev_pm_smart_suspend_and_suspended(struct device *dev) +bool dev_pm_skip_suspend(struct device *dev) { return dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND) && pm_runtime_status_suspended(dev); diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c index 312cc45c44c7..bf779f291f15 100644 --- a/drivers/pci/hotplug/pciehp_core.c +++ b/drivers/pci/hotplug/pciehp_core.c @@ -275,7 +275,7 @@ static int pciehp_suspend(struct pcie_device *dev) * If the port is already runtime suspended we can keep it that * way. */ - if (dev_pm_smart_suspend_and_suspended(&dev->port->dev)) + if (dev_pm_skip_suspend(&dev->port->dev)) return 0; pciehp_disable_interrupt(dev); diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index decf82595340..da6510af1221 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -776,7 +776,7 @@ static int pci_pm_suspend(struct device *dev) static int pci_pm_suspend_late(struct device *dev) { - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; pci_fixup_device(pci_fixup_suspend, to_pci_dev(dev)); @@ -789,7 +789,7 @@ static int pci_pm_suspend_noirq(struct device *dev) struct pci_dev *pci_dev = to_pci_dev(dev); const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; if (pci_has_legacy_pm_support(pci_dev)) @@ -1126,7 +1126,7 @@ static int pci_pm_poweroff(struct device *dev) static int pci_pm_poweroff_late(struct device *dev) { - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; pci_fixup_device(pci_fixup_suspend, to_pci_dev(dev)); @@ -1139,7 +1139,7 @@ static int pci_pm_poweroff_noirq(struct device *dev) struct pci_dev *pci_dev = to_pci_dev(dev); const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_skip_suspend(dev)) return 0; if (pci_has_legacy_pm_support(pci_dev)) diff --git a/include/linux/pm.h b/include/linux/pm.h index d89b7099f241..8c59a7f0bcf4 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -759,7 +759,7 @@ extern int pm_generic_poweroff(struct device *dev); extern void pm_generic_complete(struct device *dev); extern bool dev_pm_skip_resume(struct device *dev); -extern bool dev_pm_smart_suspend_and_suspended(struct device *dev); +extern bool dev_pm_skip_suspend(struct device *dev); #else /* !CONFIG_PM_SLEEP */ From e07515563d010d8b32967634e8dc2fdc732c1aa6 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:53:01 +0200 Subject: [PATCH 08/51] PM: sleep: core: Rename DPM_FLAG_NEVER_SKIP Rename DPM_FLAG_NEVER_SKIP to DPM_FLAG_NO_DIRECT_COMPLETE which matches its purpose more closely. No functional impact. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Bjorn Helgaas # for PCI parts Acked-by: Jeff Kirsher Acked-by: Alan Stern Acked-by: Bjorn Helgaas Acked-by: Alex Deucher --- Documentation/driver-api/pm/devices.rst | 6 +++--- Documentation/power/pci.rst | 10 +++++----- drivers/base/power/main.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 +- drivers/gpu/drm/i915/intel_runtime_pm.c | 2 +- drivers/gpu/drm/radeon/radeon_kms.c | 2 +- drivers/misc/mei/pci-me.c | 2 +- drivers/misc/mei/pci-txe.c | 2 +- drivers/net/ethernet/intel/e1000e/netdev.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/igc/igc_main.c | 2 +- drivers/pci/pcie/portdrv_pci.c | 2 +- include/linux/pm.h | 6 +++--- 13 files changed, 21 insertions(+), 21 deletions(-) diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index f66c7b9126ea..4ace0eba4506 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -361,9 +361,9 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``. runtime PM disabled. This feature also can be controlled by device drivers by using the - ``DPM_FLAG_NEVER_SKIP`` and ``DPM_FLAG_SMART_PREPARE`` driver power - management flags. [Typically, they are set at the time the driver is - probed against the device in question by passing them to the + ``DPM_FLAG_NO_DIRECT_COMPLETE`` and ``DPM_FLAG_SMART_PREPARE`` driver + power management flags. [Typically, they are set at the time the driver + is probed against the device in question by passing them to the :c:func:`dev_pm_set_driver_flags` helper function.] If the first of these flags is set, the PM core will not apply the direct-complete procedure described above to the given device and, consequenty, to any diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index aa1c7fce6cd0..9e1408121bea 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1004,11 +1004,11 @@ including the PCI bus type. The flags should be set once at the driver probe time with the help of the dev_pm_set_driver_flags() function and they should not be updated directly afterwards. -The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete -mechanism allowing device suspend/resume callbacks to be skipped if the device -is in runtime suspend when the system suspend starts. That also affects all of -the ancestors of the device, so this flag should only be used if absolutely -necessary. +The DPM_FLAG_NO_DIRECT_COMPLETE flag prevents the PM core from using the +direct-complete mechanism allowing device suspend/resume callbacks to be skipped +if the device is in runtime suspend when the system suspend starts. That also +affects all of the ancestors of the device, so this flag should only be used if +absolutely necessary. The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a positive value from pci_pm_prepare() if the ->prepare callback provided by the diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 3170d93e29f9..dbc1e5e7346b 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1844,7 +1844,7 @@ unlock: spin_lock_irq(&dev->power.lock); dev->power.direct_complete = state.event == PM_EVENT_SUSPEND && (ret > 0 || dev->power.no_pm_callbacks) && - !dev_pm_test_driver_flags(dev, DPM_FLAG_NEVER_SKIP); + !dev_pm_test_driver_flags(dev, DPM_FLAG_NO_DIRECT_COMPLETE); spin_unlock_irq(&dev->power.lock); return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index fd1dc3236eca..a9086ea1ab60 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -191,7 +191,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags) } if (adev->runpm) { - dev_pm_set_driver_flags(dev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(dev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_use_autosuspend(dev->dev); pm_runtime_set_autosuspend_delay(dev->dev, 5000); pm_runtime_set_active(dev->dev); diff --git a/drivers/gpu/drm/i915/intel_runtime_pm.c b/drivers/gpu/drm/i915/intel_runtime_pm.c index ad719c9602af..9cb2d7548daa 100644 --- a/drivers/gpu/drm/i915/intel_runtime_pm.c +++ b/drivers/gpu/drm/i915/intel_runtime_pm.c @@ -549,7 +549,7 @@ void intel_runtime_pm_enable(struct intel_runtime_pm *rpm) * becaue the HDA driver may require us to enable the audio power * domain during system suspend. */ - dev_pm_set_driver_flags(kdev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(kdev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_set_autosuspend_delay(kdev, 10000); /* 10s */ pm_runtime_mark_last_busy(kdev); diff --git a/drivers/gpu/drm/radeon/radeon_kms.c b/drivers/gpu/drm/radeon/radeon_kms.c index 58176db85952..372962358a18 100644 --- a/drivers/gpu/drm/radeon/radeon_kms.c +++ b/drivers/gpu/drm/radeon/radeon_kms.c @@ -158,7 +158,7 @@ int radeon_driver_load_kms(struct drm_device *dev, unsigned long flags) } if (radeon_is_px(dev)) { - dev_pm_set_driver_flags(dev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(dev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_use_autosuspend(dev->dev); pm_runtime_set_autosuspend_delay(dev->dev, 5000); pm_runtime_set_active(dev->dev); diff --git a/drivers/misc/mei/pci-me.c b/drivers/misc/mei/pci-me.c index 3d21c38e2dbb..53f16f3bd091 100644 --- a/drivers/misc/mei/pci-me.c +++ b/drivers/misc/mei/pci-me.c @@ -240,7 +240,7 @@ static int mei_me_probe(struct pci_dev *pdev, const struct pci_device_id *ent) * MEI requires to resume from runtime suspend mode * in order to perform link reset flow upon system suspend. */ - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); /* * ME maps runtime suspend/resume to D0i states, diff --git a/drivers/misc/mei/pci-txe.c b/drivers/misc/mei/pci-txe.c index beacf2a2f2b5..4bf26ce61044 100644 --- a/drivers/misc/mei/pci-txe.c +++ b/drivers/misc/mei/pci-txe.c @@ -128,7 +128,7 @@ static int mei_txe_probe(struct pci_dev *pdev, const struct pci_device_id *ent) * MEI requires to resume from runtime suspend mode * in order to perform link reset flow upon system suspend. */ - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); /* * TXE maps runtime suspend/resume to own power gating states, diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 177c6da80c57..2730b1c7dddb 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -7549,7 +7549,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent) e1000_print_device_info(adapter); - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); if (pci_dev_run_wake(pdev) && hw->mac.type < e1000_pch_cnp) pm_runtime_put_noidle(&pdev->dev); diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index b46bff8fe056..8bb3db2cbd41 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -3445,7 +3445,7 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) } } - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_put_noidle(&pdev->dev); return 0; diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c index 69fa1ce1f927..59fc0097438f 100644 --- a/drivers/net/ethernet/intel/igc/igc_main.c +++ b/drivers/net/ethernet/intel/igc/igc_main.c @@ -4825,7 +4825,7 @@ static int igc_probe(struct pci_dev *pdev, pcie_print_link_status(pdev); netdev_info(netdev, "MAC: %pM\n", netdev->dev_addr); - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_put_noidle(&pdev->dev); diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index 160d67c59310..3acf151ae015 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -115,7 +115,7 @@ static int pcie_portdrv_probe(struct pci_dev *dev, pci_save_state(dev); - dev_pm_set_driver_flags(&dev->dev, DPM_FLAG_NEVER_SKIP | + dev_pm_set_driver_flags(&dev->dev, DPM_FLAG_NO_DIRECT_COMPLETE | DPM_FLAG_SMART_SUSPEND); if (pci_bridge_d3_possible(dev)) { diff --git a/include/linux/pm.h b/include/linux/pm.h index 8c59a7f0bcf4..cdb8fbd6ab18 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -544,7 +544,7 @@ struct pm_subsys_data { * These flags can be set by device drivers at the probe time. They need not be * cleared by the drivers as the driver core will take care of that. * - * NEVER_SKIP: Do not skip all system suspend/resume callbacks for the device. + * NO_DIRECT_COMPLETE: Do not apply direct-complete optimization to the device. * SMART_PREPARE: Check the return value of the driver's ->prepare callback. * SMART_SUSPEND: No need to resume the device from runtime suspend. * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible. @@ -554,7 +554,7 @@ struct pm_subsys_data { * their ->prepare callbacks if the driver's ->prepare callback returns 0 (in * other words, the system suspend/resume callbacks can only be skipped for the * device if its driver doesn't object against that). This flag has no effect - * if NEVER_SKIP is set. + * if NO_DIRECT_COMPLETE is set. * * Setting SMART_SUSPEND instructs bus types and PM domains which may want to * runtime resume the device upfront during system suspend that doing so is not @@ -565,7 +565,7 @@ struct pm_subsys_data { * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the * driver prefers the device to be left in suspend after system resume. */ -#define DPM_FLAG_NEVER_SKIP BIT(0) +#define DPM_FLAG_NO_DIRECT_COMPLETE BIT(0) #define DPM_FLAG_SMART_PREPARE BIT(1) #define DPM_FLAG_SMART_SUSPEND BIT(2) #define DPM_FLAG_LEAVE_SUSPENDED BIT(3) From 2a3f34750b8b07df42ab4b30b70e029d46e0d7f3 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:53:20 +0200 Subject: [PATCH 09/51] PM: sleep: core: Rename DPM_FLAG_LEAVE_SUSPENDED Rename DPM_FLAG_LEAVE_SUSPENDED to DPM_FLAG_MAY_SKIP_RESUME which matches its purpose more closely. No functional impact. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Wolfram Sang # for I2C Acked-by: Alan Stern Acked-by: Bjorn Helgaas --- Documentation/driver-api/pm/devices.rst | 4 ++-- Documentation/power/pci.rst | 2 +- drivers/acpi/acpi_tad.c | 2 +- drivers/base/power/main.c | 2 +- drivers/i2c/busses/i2c-designware-platdrv.c | 4 ++-- include/linux/pm.h | 6 +++--- 6 files changed, 10 insertions(+), 10 deletions(-) diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index 4ace0eba4506..f342c7549b4c 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -803,7 +803,7 @@ general.] However, it often is desirable to leave devices in suspend after system transitions to the working state, especially if those devices had been in runtime suspend before the preceding system-wide suspend (or analogous) -transition. Device drivers can use the ``DPM_FLAG_LEAVE_SUSPENDED`` flag to +transition. Device drivers can use the ``DPM_FLAG_MAY_SKIP_RESUME`` flag to indicate to the PM core (and middle-layer code) that they prefer the specific devices handled by them to be left suspended and they have no problems with skipping their system-wide resume callbacks for this reason. Whether or not the @@ -825,7 +825,7 @@ device really can be left in suspend. For devices whose "noirq", "late" and "early" driver callbacks are invoked directly by the PM core, all of the system-wide resume callbacks are skipped if -``DPM_FLAG_LEAVE_SUSPENDED`` is set and the device is in runtime suspend during +``DPM_FLAG_MAY_SKIP_RESUME`` is set and the device is in runtime suspend during the ``suspend_noirq`` (or analogous) phase or the transition under way is a proper system suspend (rather than anything related to hibernation) and the device's wakeup settings are suitable for runtime PM (that is, it cannot diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index 9e1408121bea..f09b382b4621 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1029,7 +1029,7 @@ into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), the function will set the power.direct_complete flag for it (to make the PM core skip the subsequent "thaw" callbacks for it) and return. -Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the +Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver prefers the device to be left in suspend after system-wide transitions to the working state. This flag is checked by the PM core, but the PCI bus type informs the PM core which devices may be left in suspend from its perspective (that happens during diff --git a/drivers/acpi/acpi_tad.c b/drivers/acpi/acpi_tad.c index 33a4bcdaa4d7..7d45cce0c3c1 100644 --- a/drivers/acpi/acpi_tad.c +++ b/drivers/acpi/acpi_tad.c @@ -624,7 +624,7 @@ static int acpi_tad_probe(struct platform_device *pdev) */ device_init_wakeup(dev, true); dev_pm_set_driver_flags(dev, DPM_FLAG_SMART_SUSPEND | - DPM_FLAG_LEAVE_SUSPENDED); + DPM_FLAG_MAY_SKIP_RESUME); /* * The platform bus type layer tells the ACPI PM domain powers up the * device, so set the runtime PM status of it to "active". diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index dbc1e5e7346b..aaa4aaf41d27 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1247,7 +1247,7 @@ Skip: * to be skipped. */ if (atomic_read(&dev->power.usage_count) > 1 || - !(dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED) && + !(dev_pm_test_driver_flags(dev, DPM_FLAG_MAY_SKIP_RESUME) && dev->power.may_skip_resume)) dev->power.must_resume = true; diff --git a/drivers/i2c/busses/i2c-designware-platdrv.c b/drivers/i2c/busses/i2c-designware-platdrv.c index 5536673060cc..c429d664f655 100644 --- a/drivers/i2c/busses/i2c-designware-platdrv.c +++ b/drivers/i2c/busses/i2c-designware-platdrv.c @@ -357,12 +357,12 @@ static int dw_i2c_plat_probe(struct platform_device *pdev) if (dev->flags & ACCESS_NO_IRQ_SUSPEND) { dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_SMART_PREPARE | - DPM_FLAG_LEAVE_SUSPENDED); + DPM_FLAG_MAY_SKIP_RESUME); } else { dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_SMART_PREPARE | DPM_FLAG_SMART_SUSPEND | - DPM_FLAG_LEAVE_SUSPENDED); + DPM_FLAG_MAY_SKIP_RESUME); } /* The code below assumes runtime PM to be disabled. */ diff --git a/include/linux/pm.h b/include/linux/pm.h index cdb8fbd6ab18..35796fc49e7a 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -547,7 +547,7 @@ struct pm_subsys_data { * NO_DIRECT_COMPLETE: Do not apply direct-complete optimization to the device. * SMART_PREPARE: Check the return value of the driver's ->prepare callback. * SMART_SUSPEND: No need to resume the device from runtime suspend. - * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible. + * MAY_SKIP_RESUME: Avoid resuming the device during system resume if possible. * * Setting SMART_PREPARE instructs bus types and PM domains which may want * system suspend/resume callbacks to be skipped for the device to return 0 from @@ -562,13 +562,13 @@ struct pm_subsys_data { * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by * the driver if they decide to leave the device in runtime suspend. * - * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the + * Setting MAY_SKIP_RESUME informs the PM core and middle-layer code that the * driver prefers the device to be left in suspend after system resume. */ #define DPM_FLAG_NO_DIRECT_COMPLETE BIT(0) #define DPM_FLAG_SMART_PREPARE BIT(1) #define DPM_FLAG_SMART_SUSPEND BIT(2) -#define DPM_FLAG_LEAVE_SUSPENDED BIT(3) +#define DPM_FLAG_MAY_SKIP_RESUME BIT(3) struct dev_pm_info { pm_message_t power_state; From 2fff3f73e8c27801b84d2315e1a49bce96b00eff Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:55:32 +0200 Subject: [PATCH 10/51] Documentation: PM: sleep: Update driver flags documentation Update the documentation of the driver flags for system-wide power management to reflect the current code flows and be more consistent. Signed-off-by: Rafael J. Wysocki --- Documentation/driver-api/pm/devices.rst | 135 ++++++++++++++++-------- Documentation/power/pci.rst | 41 +++---- include/linux/pm.h | 22 +--- 3 files changed, 115 insertions(+), 83 deletions(-) diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index f342c7549b4c..782cb37073a3 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -772,62 +772,107 @@ the state of devices (possibly except for resuming them from runtime suspend) from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before* invoking device drivers' ``->suspend`` callbacks (or equivalent). +.. _smart_suspend_flag: + +The ``DPM_FLAG_SMART_SUSPEND`` Driver Flag +------------------------------------------ + Some bus types and PM domains have a policy to resume all devices from runtime suspend upfront in their ``->suspend`` callbacks, but that may not be really necessary if the driver of the device can cope with runtime-suspended devices. The driver can indicate that by setting ``DPM_FLAG_SMART_SUSPEND`` in -:c:member:`power.driver_flags` at the probe time, by passing it to the -:c:func:`dev_pm_set_driver_flags` helper. That also may cause middle-layer code +:c:member:`power.driver_flags` at the probe time with the help of the +:c:func:`dev_pm_set_driver_flags` helper routine. + +However, setting that flag also causes the PM core and middle-layer code (bus types, PM domains etc.) to skip the ``->suspend_late`` and ``->suspend_noirq`` callbacks provided by the driver if the device remains in -runtime suspend at the beginning of the ``suspend_late`` phase of system-wide -suspend (or in the ``poweroff_late`` phase of hibernation), when runtime PM -has been disabled for it, under the assumption that its state should not change -after that point until the system-wide transition is over (the PM core itself -does that for devices whose "noirq", "late" and "early" system-wide PM callbacks -are executed directly by it). If that happens, the driver's system-wide resume -callbacks, if present, may still be invoked during the subsequent system-wide -resume transition and the device's runtime power management status may be set -to "active" before enabling runtime PM for it, so the driver must be prepared to -cope with the invocation of its system-wide resume callbacks back-to-back with -its ``->runtime_suspend`` one (without the intervening ``->runtime_resume`` and -so on) and the final state of the device must reflect the "active" runtime PM -status in that case. +runtime suspend during the ``suspend_late`` phase of system-wide suspend (or +during the ``poweroff_late`` or ``freeze_late`` phase of hibernation), +after runtime PM was disabled for it. [Without doing that, the same driver +callback might be executed twice in a row for the same device, which would not +be valid in general.] If the middle-layer system-wide PM callbacks are present +for the device, they are responsible for doing the above, and the PM core takes +care of it otherwise. + +In addition, with ``DPM_FLAG_SMART_SUSPEND`` set, the driver's ``->thaw_late`` +and ``->thaw_noirq`` callbacks are skipped if the device remained in runtime +suspend during the preceding "freeze" transition related to hibernation. +Again, if the middle-layer callbacks are present for the device, they are +responsible for doing that, or the PM core takes care of it otherwise. + + +The ``DPM_FLAG_MAY_SKIP_RESUME`` Driver Flag +-------------------------------------------- During system-wide resume from a sleep state it's easiest to put devices into the full-power state, as explained in :file:`Documentation/power/runtime_pm.rst`. [Refer to that document for more information regarding this particular issue as well as for information on the device runtime power management framework in -general.] - -However, it often is desirable to leave devices in suspend after system -transitions to the working state, especially if those devices had been in +general.] However, it often is desirable to leave devices in suspend after +system transitions to the working state, especially if those devices had been in runtime suspend before the preceding system-wide suspend (or analogous) -transition. Device drivers can use the ``DPM_FLAG_MAY_SKIP_RESUME`` flag to -indicate to the PM core (and middle-layer code) that they prefer the specific -devices handled by them to be left suspended and they have no problems with -skipping their system-wide resume callbacks for this reason. Whether or not the -devices will actually be left in suspend may depend on their state before the -given system suspend-resume cycle and on the type of the system transition under -way. In particular, devices are not left suspended if that transition is a -restore from hibernation, as device states are not guaranteed to be reflected -by the information stored in the hibernation image in that case. +transition. -The middle-layer code involved in the handling of the device is expected to -indicate to the PM core if the device may be left in suspend by setting its -:c:member:`power.may_skip_resume` status bit which is checked by the PM core -during the "noirq" phase of the preceding system-wide suspend (or analogous) -transition. The middle layer is then responsible for handling the device as -appropriate in its "noirq" resume callback, which is executed regardless of -whether or not the device is left suspended, but the other resume callbacks -(except for ``->complete``) will be skipped automatically by the PM core if the -device really can be left in suspend. +To that end, device drivers can use the ``DPM_FLAG_MAY_SKIP_RESUME`` flag to +indicate to the PM core and middle-layer code that they allow their "noirq" and +"early" resume callbacks to be skipped if the device can be left in suspend +after system-wide PM transitions to the working state. Whether or not that is +the case generally depends on the state of the device before the given system +suspend-resume cycle and on the type of the system transition under way. +In particular, the "restore" and "thaw" transitions related to hibernation are +not affected by ``DPM_FLAG_MAY_SKIP_RESUME`` at all. [All devices are always +resumed during the "restore" transition and whether or not any driver callbacks +are skipped during the "freeze" transition depends whether or not the +``DPM_FLAG_SMART_SUSPEND`` flag is set (see `above `_).] -For devices whose "noirq", "late" and "early" driver callbacks are invoked -directly by the PM core, all of the system-wide resume callbacks are skipped if -``DPM_FLAG_MAY_SKIP_RESUME`` is set and the device is in runtime suspend during -the ``suspend_noirq`` (or analogous) phase or the transition under way is a -proper system suspend (rather than anything related to hibernation) and the -device's wakeup settings are suitable for runtime PM (that is, it cannot -generate wakeup signals at all or it is allowed to wake up the system from -sleep). +The ``DPM_FLAG_MAY_SKIP_RESUME`` flag is taken into account in combination with +the :c:member:`power.may_skip_resume` status bit set by the PM core during the +"suspend" phase of suspend-type transitions. If the driver or the middle layer +has a reason to prevent the driver's "noirq" and "early" resume callbacks from +being skipped during the subsequent resume transition of the system, it should +clear :c:member:`power.may_skip_resume` in its ``->suspend``, ``->suspend_late`` +or ``->suspend_noirq`` callback. [Note that the drivers setting +``DPM_FLAG_SMART_SUSPEND`` need to clear :c:member:`power.may_skip_resume` in +their ``->suspend`` callback in case the other two are skipped.] + +Setting the :c:member:`power.may_skip_resume` status bit along with the +``DPM_FLAG_MAY_SKIP_RESUME`` flag is necessary, but generally not sufficient, +for the driver's "noirq" and "early" resume callbacks to be skipped. Whether or +not they should be skipped can be determined by evaluating the +:c:func:`dev_pm_skip_resume` helper function. + +If that function returns ``true``, the driver's "noirq" and "early" resume +callbacks should be skipped and the device's runtime PM status will be set to +"suspended" by the PM core. Otherwise, if the device was runtime-suspended +during the preceding system-wide suspend transition and +``DPM_FLAG_SMART_SUSPEND`` is set for it, its runtime PM status will be set to +"active" by the PM core. [Hence, the drivers that do not set +``DPM_FLAG_SMART_SUSPEND`` should not expect the runtime PM status of their +devices to be changed from "suspended" to "active" by the PM core during +system-wide resume-type transitions.] + +If the ``DPM_FLAG_MAY_SKIP_RESUME`` flag is not set for a device, but +``DPM_FLAG_SMART_SUSPEND`` is set and the driver's "late" and "noirq" suspend +callbacks are skipped, its system-wide "noirq" and "early" resume callbacks, if +present, are invoked as usual and the device's runtime PM status is set to +"active" by the PM core before enabling runtime PM for it. In that case, the +driver must be prepared to cope with the invocation of its system-wide resume +callbacks back-to-back with its ``->runtime_suspend`` one (without the +intervening ``->runtime_resume`` and system-wide suspend callbacks) and the +final state of the device must reflect the "active" runtime PM status in that +case. [Note that this is not a problem at all if the driver's +``->suspend_late`` callback pointer points to the same function as its +``->runtime_suspend`` one and its ``->resume_early`` callback pointer points to +the same function as the ``->runtime_resume`` one, while none of the other +system-wide suspend-resume callbacks of the driver are present, for example.] + +Likewise, if ``DPM_FLAG_MAY_SKIP_RESUME`` is set for a device, its driver's +system-wide "noirq" and "early" resume callbacks may be skipped while its "late" +and "noirq" suspend callbacks may have been executed (in principle, regardless +of whether or not ``DPM_FLAG_SMART_SUSPEND`` is set). In that case, the driver +needs to be able to cope with the invocation of its ``->runtime_resume`` +callback back-to-back with its "late" and "noirq" suspend ones. [For instance, +that is not a concern if the driver sets both ``DPM_FLAG_SMART_SUSPEND`` and +``DPM_FLAG_MAY_SKIP_RESUME`` and uses the same pair of suspend/resume callback +functions for runtime PM and system-wide suspend/resume.] diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index f09b382b4621..1831e431f725 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1010,32 +1010,33 @@ if the device is in runtime suspend when the system suspend starts. That also affects all of the ancestors of the device, so this flag should only be used if absolutely necessary. -The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a -positive value from pci_pm_prepare() if the ->prepare callback provided by the +The DPM_FLAG_SMART_PREPARE flag causes the PCI bus type to return a positive +value from pci_pm_prepare() only if the ->prepare callback provided by the driver of the device returns a positive value. That allows the driver to opt -out from using the direct-complete mechanism dynamically. +out from using the direct-complete mechanism dynamically (whereas setting +DPM_FLAG_NO_DIRECT_COMPLETE means permanent opt-out). The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's perspective the device can be safely left in runtime suspend during system suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() -to skip resuming the device from runtime suspend unless there are PCI-specific -reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), -pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early -if the device remains in runtime suspend in the beginning of the "late" phase -of the system-wide transition under way. Moreover, if the device is in -runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime -power management status will be changed to "active" (as it is going to be put -into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), -the function will set the power.direct_complete flag for it (to make the PM core -skip the subsequent "thaw" callbacks for it) and return. +to avoid resuming the device from runtime suspend unless there are PCI-specific +reasons for doing that. Also, it causes pci_pm_suspend_late/noirq() and +pci_pm_poweroff_late/noirq() to return early if the device remains in runtime +suspend during the "late" phase of the system-wide transition under way. +Moreover, if the device is in runtime suspend in pci_pm_resume_noirq() or +pci_pm_restore_noirq(), its runtime PM status will be changed to "active" (as it +is going to be put into D0 going forward). -Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver prefers the -device to be left in suspend after system-wide transitions to the working state. -This flag is checked by the PM core, but the PCI bus type informs the PM core -which devices may be left in suspend from its perspective (that happens during -the "noirq" phase of system-wide suspend and analogous transitions) and next it -uses the dev_pm_skip_resume() helper to decide whether or not to return from -pci_pm_resume_noirq() and pci_pm_resume_early() upfront. +Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its +"noirq" and "early" resume callbacks to be skipped if the device can be left +in suspend after a system-wide transition into the working state. This flag is +taken into consideration by the PM core along with the power.may_skip_resume +status bit of the device which is set by pci_pm_suspend_noirq() in certain +situations. If the PM core determines that the driver's "noirq" and "early" +resume callbacks should be skipped, the dev_pm_skip_resume() helper function +will return "true" and that will cause pci_pm_resume_noirq() and +pci_pm_resume_early() to return upfront without touching the device and +executing the driver callbacks. 3.2. Device Runtime Power Management ------------------------------------ diff --git a/include/linux/pm.h b/include/linux/pm.h index 35796fc49e7a..121c104a4090 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -545,25 +545,11 @@ struct pm_subsys_data { * cleared by the drivers as the driver core will take care of that. * * NO_DIRECT_COMPLETE: Do not apply direct-complete optimization to the device. - * SMART_PREPARE: Check the return value of the driver's ->prepare callback. - * SMART_SUSPEND: No need to resume the device from runtime suspend. - * MAY_SKIP_RESUME: Avoid resuming the device during system resume if possible. + * SMART_PREPARE: Take the driver ->prepare callback return value into account. + * SMART_SUSPEND: Avoid resuming the device from runtime suspend. + * MAY_SKIP_RESUME: Allow driver "noirq" and "early" callbacks to be skipped. * - * Setting SMART_PREPARE instructs bus types and PM domains which may want - * system suspend/resume callbacks to be skipped for the device to return 0 from - * their ->prepare callbacks if the driver's ->prepare callback returns 0 (in - * other words, the system suspend/resume callbacks can only be skipped for the - * device if its driver doesn't object against that). This flag has no effect - * if NO_DIRECT_COMPLETE is set. - * - * Setting SMART_SUSPEND instructs bus types and PM domains which may want to - * runtime resume the device upfront during system suspend that doing so is not - * necessary from the driver's perspective. It also may cause them to skip - * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by - * the driver if they decide to leave the device in runtime suspend. - * - * Setting MAY_SKIP_RESUME informs the PM core and middle-layer code that the - * driver prefers the device to be left in suspend after system resume. + * See Documentation/driver-api/pm/devices.rst for details. */ #define DPM_FLAG_NO_DIRECT_COMPLETE BIT(0) #define DPM_FLAG_SMART_PREPARE BIT(1) From 598cc93005636e32a98ed003b0158329827c0ccb Mon Sep 17 00:00:00 2001 From: Alan Stern Date: Sat, 25 Apr 2020 16:35:40 -0400 Subject: [PATCH 11/51] PM: sleep: Helpful edits for devices.rst documentation Here are some minor edits of the devices.rst documentation file, intended to improve the clarity and add a couple of missing details. Signed-off-by: Alan Stern Signed-off-by: Rafael J. Wysocki --- Documentation/driver-api/pm/devices.rst | 94 ++++++++++++++----------- 1 file changed, 52 insertions(+), 42 deletions(-) diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index 782cb37073a3..946ad0b94e31 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -349,7 +349,7 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``. PM core will skip the ``suspend``, ``suspend_late`` and ``suspend_noirq`` phases as well as all of the corresponding phases of the subsequent device resume for all of these devices. In that case, - the ``->complete`` callback will be invoked directly after the + the ``->complete`` callback will be the next one invoked after the ``->prepare`` callback and is entirely responsible for putting the device into a consistent state as appropriate. @@ -383,11 +383,15 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``. ``->suspend`` methods provided by subsystems (bus types and PM domains in particular) must follow an additional rule regarding what can be done to the devices before their drivers' ``->suspend`` methods are called. - Namely, they can only resume the devices from runtime suspend by - calling :c:func:`pm_runtime_resume` for them, if that is necessary, and + Namely, they may resume the devices from runtime suspend by + calling :c:func:`pm_runtime_resume` for them, if that is necessary, but they must not update the state of the devices in any other way at that time (in case the drivers need to resume the devices from runtime - suspend in their ``->suspend`` methods). + suspend in their ``->suspend`` methods). In fact, the PM core prevents + subsystems or drivers from putting devices into runtime suspend at + these times by calling :c:func:`pm_runtime_get_noresume` before issuing + the ``->prepare`` callback (and calling :c:func:`pm_runtime_put` after + issuing the ``->complete`` callback). 3. For a number of devices it is convenient to split suspend into the "quiesce device" and "save device state" phases, in which cases @@ -459,22 +463,22 @@ When resuming from freeze, standby or memory sleep, the phases are: Note, however, that new children may be registered below the device as soon as the ``->resume`` callbacks occur; it's not necessary to wait - until the ``complete`` phase with that. + until the ``complete`` phase runs. Moreover, if the preceding ``->prepare`` callback returned a positive number, the device may have been left in runtime suspend throughout the - whole system suspend and resume (the ``suspend``, ``suspend_late``, - ``suspend_noirq`` phases of system suspend and the ``resume_noirq``, - ``resume_early``, ``resume`` phases of system resume may have been - skipped for it). In that case, the ``->complete`` callback is entirely + whole system suspend and resume (its ``->suspend``, ``->suspend_late``, + ``->suspend_noirq``, ``->resume_noirq``, + ``->resume_early``, and ``->resume`` callbacks may have been + skipped). In that case, the ``->complete`` callback is entirely responsible for putting the device into a consistent state after system suspend if necessary. [For example, it may need to queue up a runtime resume request for the device for this purpose.] To check if that is the case, the ``->complete`` callback can consult the device's - ``power.direct_complete`` flag. Namely, if that flag is set when the - ``->complete`` callback is being run, it has been called directly after - the preceding ``->prepare`` and special actions may be required - to make the device work correctly afterward. + ``power.direct_complete`` flag. If that flag is set when the + ``->complete`` callback is being run then the direct-complete mechanism + was used, and special actions may be required to make the device work + correctly afterward. At the end of these phases, drivers should be as functional as they were before suspending: I/O can be performed using DMA and IRQs, and the relevant clocks are @@ -575,10 +579,12 @@ and the phases are similar. The ``->poweroff``, ``->poweroff_late`` and ``->poweroff_noirq`` callbacks should do essentially the same things as the ``->suspend``, ``->suspend_late`` -and ``->suspend_noirq`` callbacks, respectively. The only notable difference is +and ``->suspend_noirq`` callbacks, respectively. A notable difference is that they need not store the device register values, because the registers should already have been stored during the ``freeze``, ``freeze_late`` or -``freeze_noirq`` phases. +``freeze_noirq`` phases. Also, on many machines the firmware will power-down +the entire system, so it is not necessary for the callback to put the device in +a low-power state. Leaving Hibernation @@ -764,11 +770,10 @@ device driver in question. If it is necessary to resume a device from runtime suspend during a system-wide transition into a sleep state, that can be done by calling -:c:func:`pm_runtime_resume` for it from the ``->suspend`` callback (or its -couterpart for transitions related to hibernation) of either the device's driver -or a subsystem responsible for it (for example, a bus type or a PM domain). -That is guaranteed to work by the requirement that subsystems must not change -the state of devices (possibly except for resuming them from runtime suspend) +:c:func:`pm_runtime_resume` from the ``->suspend`` callback (or the ``->freeze`` +or ``->poweroff`` callback for transitions related to hibernation) of either the +device's driver or its subsystem (for example, a bus type or a PM domain). +However, subsystems must not otherwise change the runtime status of devices from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before* invoking device drivers' ``->suspend`` callbacks (or equivalent). @@ -779,27 +784,29 @@ The ``DPM_FLAG_SMART_SUSPEND`` Driver Flag Some bus types and PM domains have a policy to resume all devices from runtime suspend upfront in their ``->suspend`` callbacks, but that may not be really -necessary if the driver of the device can cope with runtime-suspended devices. -The driver can indicate that by setting ``DPM_FLAG_SMART_SUSPEND`` in -:c:member:`power.driver_flags` at the probe time with the help of the +necessary if the device's driver can cope with runtime-suspended devices. +The driver can indicate this by setting ``DPM_FLAG_SMART_SUSPEND`` in +:c:member:`power.driver_flags` at probe time, with the assistance of the :c:func:`dev_pm_set_driver_flags` helper routine. -However, setting that flag also causes the PM core and middle-layer code +Setting that flag causes the PM core and middle-layer code (bus types, PM domains etc.) to skip the ``->suspend_late`` and ``->suspend_noirq`` callbacks provided by the driver if the device remains in -runtime suspend during the ``suspend_late`` phase of system-wide suspend (or -during the ``poweroff_late`` or ``freeze_late`` phase of hibernation), -after runtime PM was disabled for it. [Without doing that, the same driver +runtime suspend throughout those phases of the system-wide suspend (and +similarly for the "freeze" and "poweroff" parts of system hibernation). +[Otherwise the same driver callback might be executed twice in a row for the same device, which would not be valid in general.] If the middle-layer system-wide PM callbacks are present -for the device, they are responsible for doing the above, and the PM core takes -care of it otherwise. +for the device then they are responsible for skipping these driver callbacks; +if not then the PM core skips them. The subsystem callback routines can +determine whether they need to skip the driver callbacks by testing the return +value from the :c:func:`dev_pm_skip_suspend` helper function. -In addition, with ``DPM_FLAG_SMART_SUSPEND`` set, the driver's ``->thaw_late`` -and ``->thaw_noirq`` callbacks are skipped if the device remained in runtime -suspend during the preceding "freeze" transition related to hibernation. -Again, if the middle-layer callbacks are present for the device, they are -responsible for doing that, or the PM core takes care of it otherwise. +In addition, with ``DPM_FLAG_SMART_SUSPEND`` set, the driver's ``->thaw_noirq`` +and ``->thaw_early`` callbacks are skipped in hibernation if the device remained +in runtime suspend throughout the preceding "freeze" transition. Again, if the +middle-layer callbacks are present for the device, they are responsible for +doing this, otherwise the PM core takes care of it. The ``DPM_FLAG_MAY_SKIP_RESUME`` Driver Flag @@ -820,17 +827,20 @@ indicate to the PM core and middle-layer code that they allow their "noirq" and after system-wide PM transitions to the working state. Whether or not that is the case generally depends on the state of the device before the given system suspend-resume cycle and on the type of the system transition under way. -In particular, the "restore" and "thaw" transitions related to hibernation are -not affected by ``DPM_FLAG_MAY_SKIP_RESUME`` at all. [All devices are always -resumed during the "restore" transition and whether or not any driver callbacks -are skipped during the "freeze" transition depends whether or not the -``DPM_FLAG_SMART_SUSPEND`` flag is set (see `above `_).] +In particular, the "thaw" and "restore" transitions related to hibernation are +not affected by ``DPM_FLAG_MAY_SKIP_RESUME`` at all. [All callbacks are +issued during the "restore" transition regardless of the flag settings, +and whether or not any driver callbacks +are skipped during the "thaw" transition depends whether or not the +``DPM_FLAG_SMART_SUSPEND`` flag is set (see `above `_). +In addition, a device is not allowed to remain in runtime suspend if any of its +children will be returned to full power.] The ``DPM_FLAG_MAY_SKIP_RESUME`` flag is taken into account in combination with the :c:member:`power.may_skip_resume` status bit set by the PM core during the "suspend" phase of suspend-type transitions. If the driver or the middle layer has a reason to prevent the driver's "noirq" and "early" resume callbacks from -being skipped during the subsequent resume transition of the system, it should +being skipped during the subsequent system resume transition, it should clear :c:member:`power.may_skip_resume` in its ``->suspend``, ``->suspend_late`` or ``->suspend_noirq`` callback. [Note that the drivers setting ``DPM_FLAG_SMART_SUSPEND`` need to clear :c:member:`power.may_skip_resume` in @@ -845,8 +855,8 @@ not they should be skipped can be determined by evaluating the If that function returns ``true``, the driver's "noirq" and "early" resume callbacks should be skipped and the device's runtime PM status will be set to "suspended" by the PM core. Otherwise, if the device was runtime-suspended -during the preceding system-wide suspend transition and -``DPM_FLAG_SMART_SUSPEND`` is set for it, its runtime PM status will be set to +during the preceding system-wide suspend transition and its +``DPM_FLAG_SMART_SUSPEND`` is set, its runtime PM status will be set to "active" by the PM core. [Hence, the drivers that do not set ``DPM_FLAG_SMART_SUSPEND`` should not expect the runtime PM status of their devices to be changed from "suspended" to "active" by the PM core during From 59b55c1f204608cdc2a767c8c969d9ccfd820ec7 Mon Sep 17 00:00:00 2001 From: Anders Roxell Date: Wed, 15 Apr 2020 10:15:59 +0200 Subject: [PATCH 12/51] cpufreq: omap: Build driver by default for ARCH_OMAP2PLUS When building the mult_v7_defconfig, ARM_TI_CPUFREQ doesn't get enabled evenwhen ARCH_OMAP(3|4) is selected. Build ARM_TI_CPUFREQ by default for ARCH_OMAP2PLUS. Signed-off-by: Anders Roxell Signed-off-by: Viresh Kumar --- drivers/cpufreq/Kconfig.arm | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm index 15c1a1231516..9481292981f0 100644 --- a/drivers/cpufreq/Kconfig.arm +++ b/drivers/cpufreq/Kconfig.arm @@ -317,6 +317,7 @@ config ARM_TEGRA186_CPUFREQ config ARM_TI_CPUFREQ bool "Texas Instruments CPUFreq support" depends on ARCH_OMAP2PLUS + default ARCH_OMAP2PLUS help This driver enables valid OPPs on the running platform based on values contained within the SoC in use. Enable this in order to From a08e1b6c2d0b9ab7889ac1f7b5c535affbe5032e Mon Sep 17 00:00:00 2001 From: Peng Fan Date: Mon, 20 Apr 2020 15:55:13 +0800 Subject: [PATCH 13/51] cpufreq: Add i.MX7ULP to cpufreq-dt-platdev blacklist Add i.MX7ULP to cpufreq-dt-platdev blacklist. Signed-off-by: Peng Fan Signed-off-by: Viresh Kumar --- drivers/cpufreq/cpufreq-dt-platdev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c b/drivers/cpufreq/cpufreq-dt-platdev.c index cb9db16bea61..5c8baf603e05 100644 --- a/drivers/cpufreq/cpufreq-dt-platdev.c +++ b/drivers/cpufreq/cpufreq-dt-platdev.c @@ -105,6 +105,7 @@ static const struct of_device_id blacklist[] __initconst = { { .compatible = "calxeda,highbank", }, { .compatible = "calxeda,ecx-2000", }, + { .compatible = "fsl,imx7ulp", }, { .compatible = "fsl,imx7d", }, { .compatible = "fsl,imx8mq", }, { .compatible = "fsl,imx8mm", }, From a6d1bfa05545b0d34f5b5093248b10a745c050e3 Mon Sep 17 00:00:00 2001 From: Lad Prabhakar Date: Mon, 27 Apr 2020 13:53:30 +0100 Subject: [PATCH 14/51] cpufreq: dt: Add support for r8a7742 Add the compatible strings for supporting the generic cpufreq driver on the Renesas RZ/G1H (R8A7742) SoC. Signed-off-by: Lad Prabhakar Reviewed-by: Marian-Cristian Rotariu Reviewed-by: Geert Uytterhoeven Signed-off-by: Viresh Kumar --- drivers/cpufreq/cpufreq-dt-platdev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c b/drivers/cpufreq/cpufreq-dt-platdev.c index 5c8baf603e05..e8e20fef400b 100644 --- a/drivers/cpufreq/cpufreq-dt-platdev.c +++ b/drivers/cpufreq/cpufreq-dt-platdev.c @@ -53,6 +53,7 @@ static const struct of_device_id whitelist[] __initconst = { { .compatible = "renesas,r7s72100", }, { .compatible = "renesas,r8a73a4", }, { .compatible = "renesas,r8a7740", }, + { .compatible = "renesas,r8a7742", }, { .compatible = "renesas,r8a7743", }, { .compatible = "renesas,r8a7744", }, { .compatible = "renesas,r8a7745", }, From 7c2553f0db6133ba079597422391661914ce91c7 Mon Sep 17 00:00:00 2001 From: Peng Fan Date: Tue, 28 Apr 2020 15:21:00 +0800 Subject: [PATCH 15/51] cpufreq: imx-cpufreq-dt: support i.MX7ULP i.MX7ULP's ARM core clock design is totally different compared with i.MX7D/8M SoCs which supported by imx-cpufreq-dt. It needs get_intermediate and target_intermedate to configure clk MUX ready, before let OPP configure ARM core clk. |---FIRC |------RUN---...---SCS(MUX2) --------| ARM --(MUX1) |---SPLL_PFD0(CLK_SET_RATE_GATE) |------HSRUN--...--HSRUN_SCS(MUX3)---| |---SRIC FIRC is step clk, SPLL_PFD0 is the normal clk driving ARM core. MUX2 and MUX3 share same inputs. So if MUX2/MUX3 both sources from SPLL_PFD0, both MUXes will lose input when configure SPLL_PFD0. So the target_intermediate will configure MUX2/MUX3 to FIRC, to avoid ARM core lose clk when configure SPLL_PFD0. Signed-off-by: Peng Fan Signed-off-by: Viresh Kumar --- drivers/cpufreq/imx-cpufreq-dt.c | 84 +++++++++++++++++++++++++++++++- 1 file changed, 82 insertions(+), 2 deletions(-) diff --git a/drivers/cpufreq/imx-cpufreq-dt.c b/drivers/cpufreq/imx-cpufreq-dt.c index de206d2745fe..3fe9125156b4 100644 --- a/drivers/cpufreq/imx-cpufreq-dt.c +++ b/drivers/cpufreq/imx-cpufreq-dt.c @@ -3,7 +3,9 @@ * Copyright 2019 NXP */ +#include #include +#include #include #include #include @@ -12,8 +14,11 @@ #include #include #include +#include #include +#include "cpufreq-dt.h" + #define OCOTP_CFG3_SPEED_GRADE_SHIFT 8 #define OCOTP_CFG3_SPEED_GRADE_MASK (0x3 << 8) #define IMX8MN_OCOTP_CFG3_SPEED_GRADE_MASK (0xf << 8) @@ -22,20 +27,92 @@ #define IMX8MP_OCOTP_CFG3_MKT_SEGMENT_SHIFT 5 #define IMX8MP_OCOTP_CFG3_MKT_SEGMENT_MASK (0x3 << 5) +#define IMX7ULP_MAX_RUN_FREQ 528000 + /* cpufreq-dt device registered by imx-cpufreq-dt */ static struct platform_device *cpufreq_dt_pdev; static struct opp_table *cpufreq_opp_table; +static struct device *cpu_dev; + +enum IMX7ULP_CPUFREQ_CLKS { + ARM, + CORE, + SCS_SEL, + HSRUN_CORE, + HSRUN_SCS_SEL, + FIRC, +}; + +static struct clk_bulk_data imx7ulp_clks[] = { + { .id = "arm" }, + { .id = "core" }, + { .id = "scs_sel" }, + { .id = "hsrun_core" }, + { .id = "hsrun_scs_sel" }, + { .id = "firc" }, +}; + +static unsigned int imx7ulp_get_intermediate(struct cpufreq_policy *policy, + unsigned int index) +{ + return clk_get_rate(imx7ulp_clks[FIRC].clk); +} + +static int imx7ulp_target_intermediate(struct cpufreq_policy *policy, + unsigned int index) +{ + unsigned int newfreq = policy->freq_table[index].frequency; + + clk_set_parent(imx7ulp_clks[SCS_SEL].clk, imx7ulp_clks[FIRC].clk); + clk_set_parent(imx7ulp_clks[HSRUN_SCS_SEL].clk, imx7ulp_clks[FIRC].clk); + + if (newfreq > IMX7ULP_MAX_RUN_FREQ) + clk_set_parent(imx7ulp_clks[ARM].clk, + imx7ulp_clks[HSRUN_CORE].clk); + else + clk_set_parent(imx7ulp_clks[ARM].clk, imx7ulp_clks[CORE].clk); + + return 0; +} + +static struct cpufreq_dt_platform_data imx7ulp_data = { + .target_intermediate = imx7ulp_target_intermediate, + .get_intermediate = imx7ulp_get_intermediate, +}; static int imx_cpufreq_dt_probe(struct platform_device *pdev) { - struct device *cpu_dev = get_cpu_device(0); + struct platform_device *dt_pdev; u32 cell_value, supported_hw[2]; int speed_grade, mkt_segment; int ret; + cpu_dev = get_cpu_device(0); + if (!of_find_property(cpu_dev->of_node, "cpu-supply", NULL)) return -ENODEV; + if (of_machine_is_compatible("fsl,imx7ulp")) { + ret = clk_bulk_get(cpu_dev, ARRAY_SIZE(imx7ulp_clks), + imx7ulp_clks); + if (ret) + return ret; + + dt_pdev = platform_device_register_data(NULL, "cpufreq-dt", + -1, &imx7ulp_data, + sizeof(imx7ulp_data)); + if (IS_ERR(dt_pdev)) { + clk_bulk_put(ARRAY_SIZE(imx7ulp_clks), imx7ulp_clks); + ret = PTR_ERR(dt_pdev); + dev_err(&pdev->dev, "Failed to register cpufreq-dt: %d\n", ret); + return ret; + } + + cpufreq_dt_pdev = dt_pdev; + + return 0; + } + ret = nvmem_cell_read_u32(cpu_dev, "speed_grade", &cell_value); if (ret) return ret; @@ -98,7 +175,10 @@ static int imx_cpufreq_dt_probe(struct platform_device *pdev) static int imx_cpufreq_dt_remove(struct platform_device *pdev) { platform_device_unregister(cpufreq_dt_pdev); - dev_pm_opp_put_supported_hw(cpufreq_opp_table); + if (!of_machine_is_compatible("fsl,imx7ulp")) + dev_pm_opp_put_supported_hw(cpufreq_opp_table); + else + clk_bulk_put(ARRAY_SIZE(imx7ulp_clks), imx7ulp_clks); return 0; } From 2f516e7cbe88f05023b6cc458d3a22b7dc56af99 Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Mon, 27 Apr 2020 17:34:20 +0800 Subject: [PATCH 16/51] cpuidle: sysfs: Remove the unused define_one_r(o/w) macros The define_one_ro and define_one_rw macros are not used, remove it. Signed-off-by: Hanjun Guo Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/sysfs.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index cdeedbf02646..7729cf622d1e 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -167,11 +167,6 @@ struct cpuidle_attr { ssize_t (*store)(struct cpuidle_device *, const char *, size_t count); }; -#define define_one_ro(_name, show) \ - static struct cpuidle_attr attr_##_name = __ATTR(_name, 0444, show, NULL) -#define define_one_rw(_name, show, store) \ - static struct cpuidle_attr attr_##_name = __ATTR(_name, 0644, show, store) - #define attr_to_cpuidleattr(a) container_of(a, struct cpuidle_attr, attr) struct cpuidle_device_kobj { From eba933ceebf212127c9aa1c87a162867af9cf781 Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Mon, 27 Apr 2020 17:34:21 +0800 Subject: [PATCH 17/51] cpuidle: sysfs: Minor coding style corrections Fix two minor coding style issues. Signed-off-by: Hanjun Guo Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/sysfs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index 7729cf622d1e..d3ef1d7ad6ee 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -426,12 +426,12 @@ static inline void cpuidle_remove_s2idle_attr_group(struct cpuidle_state_kobj *k #define attr_to_stateattr(a) container_of(a, struct cpuidle_state_attr, attr) static ssize_t cpuidle_state_show(struct kobject *kobj, struct attribute *attr, - char * buf) + char *buf) { int ret = -EIO; struct cpuidle_state *state = kobj_to_state(kobj); struct cpuidle_state_usage *state_usage = kobj_to_state_usage(kobj); - struct cpuidle_state_attr * cattr = attr_to_stateattr(attr); + struct cpuidle_state_attr *cattr = attr_to_stateattr(attr); if (cattr->show) ret = cattr->show(state, state_usage, buf); From 2dea651680cea1f3a29925de51002f33d1f55711 Mon Sep 17 00:00:00 2001 From: Ansuel Smith Date: Fri, 1 May 2020 00:22:25 +0200 Subject: [PATCH 18/51] cpufreq: qcom: fix wrong compatible binding Binding in Documentation is still "operating-points-v2-kryo-cpu". Restore the old binding to fix the compatibility problem. Fixes: a8811ec764f9 ("cpufreq: qcom: Add support for krait based socs") Signed-off-by: Ansuel Smith Signed-off-by: Viresh Kumar --- drivers/cpufreq/qcom-cpufreq-nvmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c index a1b8238872a2..d06b37822c3d 100644 --- a/drivers/cpufreq/qcom-cpufreq-nvmem.c +++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c @@ -277,7 +277,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev) if (!np) return -ENOENT; - ret = of_device_is_compatible(np, "operating-points-v2-qcom-cpu"); + ret = of_device_is_compatible(np, "operating-points-v2-kryo-cpu"); if (!ret) { of_node_put(np); return -ENOENT; From 157f527639da13c45b87adfc59ca6f6d354b8530 Mon Sep 17 00:00:00 2001 From: Mian Yousaf Kaukab Date: Tue, 21 Apr 2020 10:29:59 +0200 Subject: [PATCH 19/51] cpufreq: qoriq: convert to a platform driver The driver has to be manually loaded if it is built as a module. It is neither exporting MODULE_DEVICE_TABLE nor MODULE_ALIAS. Moreover, no platform-device is created (and thus no uevent is sent) for the clockgen nodes it depends on. Convert the module to a platform driver with its own alias. Moreover, drop whitelisted SOCs. Platform device will be created only for the compatible platforms. Reviewed-by: Yuantian Tang Acked-by: Viresh Kumar Signed-off-by: Mian Yousaf Kaukab Signed-off-by: Viresh Kumar --- drivers/cpufreq/qoriq-cpufreq.c | 78 +++++++++++++-------------------- 1 file changed, 30 insertions(+), 48 deletions(-) diff --git a/drivers/cpufreq/qoriq-cpufreq.c b/drivers/cpufreq/qoriq-cpufreq.c index 8e436dc75c8b..6b6b20da2bcf 100644 --- a/drivers/cpufreq/qoriq-cpufreq.c +++ b/drivers/cpufreq/qoriq-cpufreq.c @@ -18,6 +18,7 @@ #include #include #include +#include /** * struct cpu_data @@ -29,12 +30,6 @@ struct cpu_data { struct cpufreq_frequency_table *table; }; -/* - * Don't use cpufreq on this SoC -- used when the SoC would have otherwise - * matched a more generic compatible. - */ -#define SOC_BLACKLIST 1 - /** * struct soc_data - SoC specific data * @flags: SOC_xxx @@ -264,64 +259,51 @@ static struct cpufreq_driver qoriq_cpufreq_driver = { .attr = cpufreq_generic_attr, }; -static const struct soc_data blacklist = { - .flags = SOC_BLACKLIST, -}; - -static const struct of_device_id node_matches[] __initconst = { +static const struct of_device_id qoriq_cpufreq_blacklist[] = { /* e6500 cannot use cpufreq due to erratum A-008083 */ - { .compatible = "fsl,b4420-clockgen", &blacklist }, - { .compatible = "fsl,b4860-clockgen", &blacklist }, - { .compatible = "fsl,t2080-clockgen", &blacklist }, - { .compatible = "fsl,t4240-clockgen", &blacklist }, - - { .compatible = "fsl,ls1012a-clockgen", }, - { .compatible = "fsl,ls1021a-clockgen", }, - { .compatible = "fsl,ls1028a-clockgen", }, - { .compatible = "fsl,ls1043a-clockgen", }, - { .compatible = "fsl,ls1046a-clockgen", }, - { .compatible = "fsl,ls1088a-clockgen", }, - { .compatible = "fsl,ls2080a-clockgen", }, - { .compatible = "fsl,lx2160a-clockgen", }, - { .compatible = "fsl,p4080-clockgen", }, - { .compatible = "fsl,qoriq-clockgen-1.0", }, - { .compatible = "fsl,qoriq-clockgen-2.0", }, + { .compatible = "fsl,b4420-clockgen", }, + { .compatible = "fsl,b4860-clockgen", }, + { .compatible = "fsl,t2080-clockgen", }, + { .compatible = "fsl,t4240-clockgen", }, {} }; -static int __init qoriq_cpufreq_init(void) +static int qoriq_cpufreq_probe(struct platform_device *pdev) { int ret; - struct device_node *np; - const struct of_device_id *match; - const struct soc_data *data; + struct device_node *np; - np = of_find_matching_node(NULL, node_matches); - if (!np) - return -ENODEV; - - match = of_match_node(node_matches, np); - data = match->data; - - of_node_put(np); - - if (data && data->flags & SOC_BLACKLIST) + np = of_find_matching_node(NULL, qoriq_cpufreq_blacklist); + if (np) { + dev_info(&pdev->dev, "Disabling due to erratum A-008083"); return -ENODEV; + } ret = cpufreq_register_driver(&qoriq_cpufreq_driver); - if (!ret) - pr_info("Freescale QorIQ CPU frequency scaling driver\n"); + if (ret) + return ret; - return ret; + dev_info(&pdev->dev, "Freescale QorIQ CPU frequency scaling driver\n"); + return 0; } -module_init(qoriq_cpufreq_init); -static void __exit qoriq_cpufreq_exit(void) +static int qoriq_cpufreq_remove(struct platform_device *pdev) { cpufreq_unregister_driver(&qoriq_cpufreq_driver); -} -module_exit(qoriq_cpufreq_exit); + return 0; +} + +static struct platform_driver qoriq_cpufreq_platform_driver = { + .driver = { + .name = "qoriq-cpufreq", + }, + .probe = qoriq_cpufreq_probe, + .remove = qoriq_cpufreq_remove, +}; +module_platform_driver(qoriq_cpufreq_platform_driver); + +MODULE_ALIAS("platform:qoriq-cpufreq"); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Tang Yuantian "); MODULE_DESCRIPTION("cpufreq driver for Freescale QorIQ series SoCs"); From cf1e0449ac478e419daf3c3f03721878fe7fa2be Mon Sep 17 00:00:00 2001 From: Mian Yousaf Kaukab Date: Tue, 21 Apr 2020 10:30:00 +0200 Subject: [PATCH 20/51] clk: qoriq: add cpufreq platform device Add a platform device for qoirq-cpufreq driver for the compatible clockgen blocks. Reviewed-by: Yuantian Tang Acked-by: Viresh Kumar Signed-off-by: Mian Yousaf Kaukab Acked-by: Stephen Boyd Signed-off-by: Viresh Kumar --- drivers/clk/clk-qoriq.c | 30 +++++++++++++++++++++++++++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c index d5946f7486d6..374afcab89af 100644 --- a/drivers/clk/clk-qoriq.c +++ b/drivers/clk/clk-qoriq.c @@ -95,6 +95,7 @@ struct clockgen { }; static struct clockgen clockgen; +static bool add_cpufreq_dev __initdata; static void cg_out(struct clockgen *cg, u32 val, u32 __iomem *reg) { @@ -1019,7 +1020,7 @@ static void __init create_muxes(struct clockgen *cg) } } -static void __init clockgen_init(struct device_node *np); +static void __init _clockgen_init(struct device_node *np, bool legacy); /* * Legacy nodes may get probed before the parent clockgen node. @@ -1030,7 +1031,7 @@ static void __init clockgen_init(struct device_node *np); static void __init legacy_init_clockgen(struct device_node *np) { if (!clockgen.node) - clockgen_init(of_get_parent(np)); + _clockgen_init(of_get_parent(np), true); } /* Legacy node */ @@ -1447,7 +1448,7 @@ static bool __init has_erratum_a4510(void) } #endif -static void __init clockgen_init(struct device_node *np) +static void __init _clockgen_init(struct device_node *np, bool legacy) { int i, ret; bool is_old_ls1021a = false; @@ -1516,12 +1517,35 @@ static void __init clockgen_init(struct device_node *np) __func__, np, ret); } + /* Don't create cpufreq device for legacy clockgen blocks */ + add_cpufreq_dev = !legacy; + return; err: iounmap(clockgen.regs); clockgen.regs = NULL; } +static void __init clockgen_init(struct device_node *np) +{ + _clockgen_init(np, false); +} + +static int __init clockgen_cpufreq_init(void) +{ + struct platform_device *pdev; + + if (add_cpufreq_dev) { + pdev = platform_device_register_simple("qoriq-cpufreq", -1, + NULL, 0); + if (IS_ERR(pdev)) + pr_err("Couldn't register qoriq-cpufreq err=%ld\n", + PTR_ERR(pdev)); + } + return 0; +} +device_initcall(clockgen_cpufreq_init); + CLK_OF_DECLARE(qoriq_clockgen_1, "fsl,qoriq-clockgen-1.0", clockgen_init); CLK_OF_DECLARE(qoriq_clockgen_2, "fsl,qoriq-clockgen-2.0", clockgen_init); CLK_OF_DECLARE(qoriq_clockgen_b4420, "fsl,b4420-clockgen", clockgen_init); From 1f1755af4f062cb1cbd55ca4a250fe272b82fe2f Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Thu, 7 May 2020 13:29:53 +0200 Subject: [PATCH 21/51] cpufreq: qoriq: Add platform dependencies The Freescale QorIQ clock controller is only present on Freescale E500MC and Layerscape SoCs. Add platform dependencies to the QORIQ_CPUFREQ config symbol, to avoid asking the user about it when configuring a kernel without E500MC or Layerscape support. Signed-off-by: Geert Uytterhoeven Acked-by: Arnd Bergmann Acked-by: Li Yang Signed-off-by: Viresh Kumar --- drivers/cpufreq/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index c3e6bd59e920..e91750132552 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -323,7 +323,8 @@ endif config QORIQ_CPUFREQ tristate "CPU frequency scaling driver for Freescale QorIQ SoCs" - depends on OF && COMMON_CLK && (PPC_E500MC || ARM || ARM64) + depends on OF && COMMON_CLK + depends on PPC_E500MC || SOC_LS1021A || ARCH_LAYERSCAPE || COMPILE_TEST select CLK_QORIQ help This adds the CPUFreq driver support for Freescale QorIQ SoCs From 7b0bf99b9ee497cc0f079472566aff716d033d43 Mon Sep 17 00:00:00 2001 From: Zou Wei Date: Tue, 28 Apr 2020 17:43:15 +0800 Subject: [PATCH 22/51] cpupower: Remove unneeded semicolon Fixes coccicheck warnings: tools/power/cpupower/utils/cpupower-info.c:65:2-3: Unneeded semicolon tools/power/cpupower/utils/cpupower-set.c:75:2-3: Unneeded semicolon tools/power/cpupower/utils/idle_monitor/amd_fam14h_idle.c:120:2-3: Unneeded semicolon tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c:175:2-3: Unneeded semicolon tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c:56:2-3: Unneeded semicolon tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c:75:2-3: Unneeded semicolon tools/power/cpupower/utils/idle_monitor/hsw_ext_idle.c:82:2-3: Unneeded semicolon tools/power/cpupower/utils/idle_monitor/nhm_idle.c:94:2-3: Unneeded semicolon tools/power/cpupower/utils/idle_monitor/snb_idle.c:80:2-3: Unneeded semicolon Reported-by: Hulk Robot Signed-off-by: Zou Wei Signed-off-by: Shuah Khan --- tools/power/cpupower/utils/cpupower-info.c | 2 +- tools/power/cpupower/utils/cpupower-set.c | 2 +- tools/power/cpupower/utils/idle_monitor/amd_fam14h_idle.c | 2 +- tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c | 6 +++--- tools/power/cpupower/utils/idle_monitor/hsw_ext_idle.c | 2 +- tools/power/cpupower/utils/idle_monitor/nhm_idle.c | 2 +- tools/power/cpupower/utils/idle_monitor/snb_idle.c | 2 +- 7 files changed, 9 insertions(+), 9 deletions(-) diff --git a/tools/power/cpupower/utils/cpupower-info.c b/tools/power/cpupower/utils/cpupower-info.c index d3755ea70d4d..0ba61a2c4d81 100644 --- a/tools/power/cpupower/utils/cpupower-info.c +++ b/tools/power/cpupower/utils/cpupower-info.c @@ -62,7 +62,7 @@ int cmd_info(int argc, char **argv) default: print_wrong_arg_exit(); } - }; + } if (!params.params) params.params = 0x7; diff --git a/tools/power/cpupower/utils/cpupower-set.c b/tools/power/cpupower/utils/cpupower-set.c index 3cca6f715dd9..052044d7e012 100644 --- a/tools/power/cpupower/utils/cpupower-set.c +++ b/tools/power/cpupower/utils/cpupower-set.c @@ -72,7 +72,7 @@ int cmd_set(int argc, char **argv) default: print_wrong_arg_exit(); } - }; + } if (!params.params) print_wrong_arg_exit(); diff --git a/tools/power/cpupower/utils/idle_monitor/amd_fam14h_idle.c b/tools/power/cpupower/utils/idle_monitor/amd_fam14h_idle.c index 20f46348271b..5edd35bd9ee9 100644 --- a/tools/power/cpupower/utils/idle_monitor/amd_fam14h_idle.c +++ b/tools/power/cpupower/utils/idle_monitor/amd_fam14h_idle.c @@ -117,7 +117,7 @@ static int amd_fam14h_get_pci_info(struct cstate *state, break; default: return -1; - }; + } return 0; } diff --git a/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c b/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c index a65f7d011513..8b42c2f0a5b0 100644 --- a/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c +++ b/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c @@ -53,7 +53,7 @@ static int cpuidle_start(void) dprint("CPU %d - State: %d - Val: %llu\n", cpu, state, previous_count[cpu][state]); } - }; + } return 0; } @@ -72,7 +72,7 @@ static int cpuidle_stop(void) dprint("CPU %d - State: %d - Val: %llu\n", cpu, state, previous_count[cpu][state]); } - }; + } return 0; } @@ -172,7 +172,7 @@ static struct cpuidle_monitor *cpuidle_register(void) cpuidle_cstates[num].id = num; cpuidle_cstates[num].get_count_percent = cpuidle_get_count_percent; - }; + } /* Free this at program termination */ previous_count = malloc(sizeof(long long *) * cpu_count); diff --git a/tools/power/cpupower/utils/idle_monitor/hsw_ext_idle.c b/tools/power/cpupower/utils/idle_monitor/hsw_ext_idle.c index 97ad3233a521..55e55b6b42f9 100644 --- a/tools/power/cpupower/utils/idle_monitor/hsw_ext_idle.c +++ b/tools/power/cpupower/utils/idle_monitor/hsw_ext_idle.c @@ -79,7 +79,7 @@ static int hsw_ext_get_count(enum intel_hsw_ext_id id, unsigned long long *val, break; default: return -1; - }; + } if (read_msr(cpu, msr, val)) return -1; return 0; diff --git a/tools/power/cpupower/utils/idle_monitor/nhm_idle.c b/tools/power/cpupower/utils/idle_monitor/nhm_idle.c index 114271165182..16eaf006f61f 100644 --- a/tools/power/cpupower/utils/idle_monitor/nhm_idle.c +++ b/tools/power/cpupower/utils/idle_monitor/nhm_idle.c @@ -91,7 +91,7 @@ static int nhm_get_count(enum intel_nhm_id id, unsigned long long *val, break; default: return -1; - }; + } if (read_msr(cpu, msr, val)) return -1; diff --git a/tools/power/cpupower/utils/idle_monitor/snb_idle.c b/tools/power/cpupower/utils/idle_monitor/snb_idle.c index df8b223cc096..811d63ab17a7 100644 --- a/tools/power/cpupower/utils/idle_monitor/snb_idle.c +++ b/tools/power/cpupower/utils/idle_monitor/snb_idle.c @@ -77,7 +77,7 @@ static int snb_get_count(enum intel_snb_id id, unsigned long long *val, break; default: return -1; - }; + } if (read_msr(cpu, msr, val)) return -1; return 0; From 2909438d4d62681f392c57df4cd6b7183d19dde0 Mon Sep 17 00:00:00 2001 From: Wang Wenhu Date: Wed, 13 May 2020 07:18:54 -0700 Subject: [PATCH 23/51] cpufreq: fix minor typo in struct cpufreq_driver doc comment Delete the duplicate "to", possibly double-typed. Signed-off-by: Wang Wenhu Signed-off-by: Rafael J. Wysocki --- include/linux/cpufreq.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index f7240251a949..67d5950bd878 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -330,7 +330,7 @@ struct cpufreq_driver { * * get_intermediate should return a stable intermediate frequency * platform wants to switch to and target_intermediate() should set CPU - * to to that frequency, before jumping to the frequency corresponding + * to that frequency, before jumping to the frequency corresponding * to 'index'. Core will take care of sending notifications and driver * doesn't have to handle them in target_intermediate() or * target_index(). From 33c980036deb5ee9961db82401c0fbfb96f126b3 Mon Sep 17 00:00:00 2001 From: Jacob Pan Date: Fri, 15 May 2020 15:30:41 +0800 Subject: [PATCH 24/51] powercap/intel_rapl: add support for ElkhartLake Add intel_rapl support for ElkhartLake platform. Signed-off-by: Jacob Pan Signed-off-by: Zhang Rui Signed-off-by: Rafael J. Wysocki --- drivers/powercap/intel_rapl_common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index eb328655bc01..c3e335e37c7d 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -989,6 +989,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = { X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT, &rapl_defaults_core), X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_PLUS, &rapl_defaults_core), X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_D, &rapl_defaults_core), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT, &rapl_defaults_core), X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT_D, &rapl_defaults_core), X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT_L, &rapl_defaults_core), From 8b7ce5e49049ca78c238f03d70569a73da049f32 Mon Sep 17 00:00:00 2001 From: Ulf Hansson Date: Mon, 11 May 2020 15:33:46 +0200 Subject: [PATCH 25/51] cpuidle: psci: Fixup execution order when entering a domain idle state Moving forward, platforms are going to need to execute specific "last-man" operations before a domain idle state can be entered. In one way or the other, these operations needs to be triggered while walking the hierarchical topology via runtime PM and genpd, as it's at that point the last-man becomes known. Moreover, executing last-man operations needs to be done after the CPU PM notifications are sent through cpu_pm_enter(), as otherwise it's likely that some notifications would fail. Therefore, let's re-order the sequence in psci_enter_domain_idle_state(), so cpu_pm_enter() gets called prior pm_runtime_put_sync(). Fixes: ce85aef570df ("cpuidle: psci: Manage runtime PM in the idle path") Reported-by: Lina Iyer Signed-off-by: Ulf Hansson Acked-by: Sudeep Holla Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/cpuidle-psci.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c index bae9140a65a5..d0fb585073c6 100644 --- a/drivers/cpuidle/cpuidle-psci.c +++ b/drivers/cpuidle/cpuidle-psci.c @@ -58,6 +58,10 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev, u32 state; int ret; + ret = cpu_pm_enter(); + if (ret) + return -1; + /* Do runtime PM to manage a hierarchical CPU toplogy. */ pm_runtime_put_sync_suspend(pd_dev); @@ -65,10 +69,12 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev, if (!state) state = states[idx]; - ret = psci_enter_state(idx, state); + ret = psci_cpu_suspend_enter(state) ? -1 : idx; pm_runtime_get_sync(pd_dev); + cpu_pm_exit(); + /* Clear the domain state to start fresh when back from idle. */ psci_set_domain_state(0); return ret; From 552abb884e97d26589964e5a8c7e736f852f95f0 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 18 May 2020 12:49:45 +0200 Subject: [PATCH 26/51] cpufreq: Fix up cpufreq_boost_set_sw() After commit 18c49926c4bf ("cpufreq: Add QoS requests for userspace constraints") the return value of freq_qos_update_request(), that can be 1, passed by cpufreq_boost_set_sw() to its caller sometimes confuses the latter, which only expects to see 0 or negative error codes, so notice that cpufreq_boost_set_sw() can return an error code (which should not be -EINVAL for that matter) as soon as the first policy without a frequency table is found (because either all policies have a frequency table or none of them have it) and rework it to meet its caller's expectations. Fixes: 18c49926c4bf ("cpufreq: Add QoS requests for userspace constraints") Reported-by: Serge Semin Reported-by: Xiongfeng Wang Acked-by: Viresh Kumar Cc: 5.3+ # 5.3+ Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cpufreq.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 045f9fe157ce..d03f250f68e4 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -2535,26 +2535,27 @@ EXPORT_SYMBOL_GPL(cpufreq_update_limits); static int cpufreq_boost_set_sw(int state) { struct cpufreq_policy *policy; - int ret = -EINVAL; for_each_active_policy(policy) { + int ret; + if (!policy->freq_table) - continue; + return -ENXIO; ret = cpufreq_frequency_table_cpuinfo(policy, policy->freq_table); if (ret) { pr_err("%s: Policy frequency update failed\n", __func__); - break; + return ret; } ret = freq_qos_update_request(policy->max_freq_req, policy->max); if (ret < 0) - break; + return ret; } - return ret; + return 0; } int cpufreq_boost_trigger_state(int state) From 3f9f8daad3422809d1db47ef1ca5b1400c889f9d Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Tue, 19 May 2020 14:25:20 +0800 Subject: [PATCH 27/51] cpuidle: sysfs: Fix the overlap for showing available governors When showing the available governors, it's "%s " in scnprintf(), not "%s", so if the governor name has 15 characters, it will overlap with the later one, fix it by adding one more for the size. While we are at it, fix the minor coding style issue and remove the "/sizeof(char)" since sizeof(char) always equals 1. Signed-off-by: Hanjun Guo Reviewed-by: Doug Smythies Tested-by: Doug Smythies Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/sysfs.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index d3ef1d7ad6ee..477b05afaf81 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -35,10 +35,10 @@ static ssize_t show_available_governors(struct device *dev, mutex_lock(&cpuidle_lock); list_for_each_entry(tmp, &cpuidle_governors, governor_list) { - if (i >= (ssize_t) ((PAGE_SIZE/sizeof(char)) - - CPUIDLE_NAME_LEN - 2)) + if (i >= (ssize_t) (PAGE_SIZE - (CPUIDLE_NAME_LEN + 2))) goto out; - i += scnprintf(&buf[i], CPUIDLE_NAME_LEN, "%s ", tmp->name); + + i += scnprintf(&buf[i], CPUIDLE_NAME_LEN + 1, "%s ", tmp->name); } out: From ef7e7d65eb808b5d37b4596974526962a741e930 Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Tue, 19 May 2020 14:25:21 +0800 Subject: [PATCH 28/51] cpuidle: sysfs: Accept governor name with 15 characters CPUIDLE_NAME_LEN is 16, so it's possible to accept governor name with 15 characters, but now store_current_governor() rejects governor name with 15 characters as it returns -EINVAL if count equals CPUIDLE_NAME_LEN. Refactor the code to accept such case and simplify the code. Signed-off-by: Hanjun Guo Reviewed-by: Doug Smythies Tested-by: Doug Smythies Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/sysfs.c | 23 +++++++---------------- 1 file changed, 7 insertions(+), 16 deletions(-) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index 477b05afaf81..a57ad10baccc 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -85,34 +85,25 @@ static ssize_t store_current_governor(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - char gov_name[CPUIDLE_NAME_LEN]; - int ret = -EINVAL; - size_t len = count; + char gov_name[CPUIDLE_NAME_LEN + 1]; + int ret; struct cpuidle_governor *gov; - if (!len || len >= sizeof(gov_name)) + ret = sscanf(buf, "%" __stringify(CPUIDLE_NAME_LEN) "s", gov_name); + if (ret != 1) return -EINVAL; - memcpy(gov_name, buf, len); - gov_name[len] = '\0'; - if (gov_name[len - 1] == '\n') - gov_name[--len] = '\0'; - mutex_lock(&cpuidle_lock); - + ret = -EINVAL; list_for_each_entry(gov, &cpuidle_governors, governor_list) { - if (strlen(gov->name) == len && !strcmp(gov->name, gov_name)) { + if (!strncmp(gov->name, gov_name, CPUIDLE_NAME_LEN)) { ret = cpuidle_switch_governor(gov); break; } } - mutex_unlock(&cpuidle_lock); - if (ret) - return ret; - else - return count; + return ret ? ret : count; } static DEVICE_ATTR(current_driver, 0444, show_current_driver, NULL); From b52e93e4e86c600492f977badad3c9e0f0303cb2 Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Tue, 19 May 2020 14:25:22 +0800 Subject: [PATCH 29/51] cpuidle: Make cpuidle governor switchable to be the default behaviour For now cpuidle governor can be switched via sysfs only when the boot option "cpuidle_sysfs_switch" is passed, but it's important to switch the governor to adapt to different workloads, especially after TEO and haltpoll governor were introduced. Add available_governors and current_governor into the default attributes, but reserve the current_governor_ro for compatiblity. Signed-off-by: Hanjun Guo Reviewed-by: Doug Smythies Tested-by: Doug Smythies Acked-by: Daniel Lezcano Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/sysfs.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index a57ad10baccc..b51c470d3bdb 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -106,19 +106,20 @@ static ssize_t store_current_governor(struct device *dev, return ret ? ret : count; } +static DEVICE_ATTR(available_governors, 0444, show_available_governors, NULL); static DEVICE_ATTR(current_driver, 0444, show_current_driver, NULL); +static DEVICE_ATTR(current_governor, 0644, show_current_governor, + store_current_governor); static DEVICE_ATTR(current_governor_ro, 0444, show_current_governor, NULL); static struct attribute *cpuidle_default_attrs[] = { + &dev_attr_available_governors.attr, &dev_attr_current_driver.attr, + &dev_attr_current_governor.attr, &dev_attr_current_governor_ro.attr, NULL }; -static DEVICE_ATTR(available_governors, 0444, show_available_governors, NULL); -static DEVICE_ATTR(current_governor, 0644, show_current_governor, - store_current_governor); - static struct attribute *cpuidle_switch_attrs[] = { &dev_attr_available_governors.attr, &dev_attr_current_driver.attr, From cce55cc902baa3e6b6bab5f72f3ce826cb8dc9a9 Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Tue, 19 May 2020 14:25:23 +0800 Subject: [PATCH 30/51] cpuidle: sysfs: Remove sysfs_switch and switch attributes Since the cpuidle governor can be switched via sysfs in default, remove sysfs_switch and cpuidle_switch_attrs. Signed-off-by: Hanjun Guo Reviewed-by: Doug Smythies Tested-by: Doug Smythies Acked-by: Daniel Lezcano Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/sysfs.c | 22 ++-------------------- 1 file changed, 2 insertions(+), 20 deletions(-) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index b51c470d3bdb..14c0eb536787 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -18,14 +18,6 @@ #include "cpuidle.h" -static unsigned int sysfs_switch; -static int __init cpuidle_sysfs_setup(char *unused) -{ - sysfs_switch = 1; - return 1; -} -__setup("cpuidle_sysfs_switch", cpuidle_sysfs_setup); - static ssize_t show_available_governors(struct device *dev, struct device_attribute *attr, char *buf) @@ -112,7 +104,7 @@ static DEVICE_ATTR(current_governor, 0644, show_current_governor, store_current_governor); static DEVICE_ATTR(current_governor_ro, 0444, show_current_governor, NULL); -static struct attribute *cpuidle_default_attrs[] = { +static struct attribute *cpuidle_attrs[] = { &dev_attr_available_governors.attr, &dev_attr_current_driver.attr, &dev_attr_current_governor.attr, @@ -120,15 +112,8 @@ static struct attribute *cpuidle_default_attrs[] = { NULL }; -static struct attribute *cpuidle_switch_attrs[] = { - &dev_attr_available_governors.attr, - &dev_attr_current_driver.attr, - &dev_attr_current_governor.attr, - NULL -}; - static struct attribute_group cpuidle_attr_group = { - .attrs = cpuidle_default_attrs, + .attrs = cpuidle_attrs, .name = "cpuidle", }; @@ -138,9 +123,6 @@ static struct attribute_group cpuidle_attr_group = { */ int cpuidle_add_interface(struct device *dev) { - if (sysfs_switch) - cpuidle_attr_group.attrs = cpuidle_switch_attrs; - return sysfs_create_group(&dev->kobj, &cpuidle_attr_group); } From 7395683a2498c7000120cdee8e4fb0c632e5561b Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Tue, 19 May 2020 14:25:24 +0800 Subject: [PATCH 31/51] Documentation: cpuidle: update the document Update the document after the remove of cpuidle_sysfs_switch. Signed-off-by: Hanjun Guo Reviewed-by: Doug Smythies Signed-off-by: Rafael J. Wysocki --- .../ABI/testing/sysfs-devices-system-cpu | 24 +++++++------------ Documentation/admin-guide/pm/cpuidle.rst | 20 +++++++--------- Documentation/driver-api/pm/cpuidle.rst | 5 ++-- 3 files changed, 20 insertions(+), 29 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 2e0e3b45d02a..6b5dafab950c 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -106,10 +106,10 @@ Description: CPU topology files that describe a logical CPU's relationship See Documentation/admin-guide/cputopology.rst for more information. -What: /sys/devices/system/cpu/cpuidle/current_driver - /sys/devices/system/cpu/cpuidle/current_governer_ro - /sys/devices/system/cpu/cpuidle/available_governors +What: /sys/devices/system/cpu/cpuidle/available_governors + /sys/devices/system/cpu/cpuidle/current_driver /sys/devices/system/cpu/cpuidle/current_governor + /sys/devices/system/cpu/cpuidle/current_governer_ro Date: September 2007 Contact: Linux kernel mailing list Description: Discover cpuidle policy and mechanism @@ -119,24 +119,18 @@ Description: Discover cpuidle policy and mechanism consumption during idle. Idle policy (governor) is differentiated from idle mechanism - (driver) - - current_driver: (RO) displays current idle mechanism - - current_governor_ro: (RO) displays current idle policy - - With the cpuidle_sysfs_switch boot option enabled (meant for - developer testing), the following three attributes are visible - instead: - - current_driver: same as described above + (driver). available_governors: (RO) displays a space separated list of - available governors + available governors. + + current_driver: (RO) displays current idle mechanism. current_governor: (RW) displays current idle policy. Users can switch the governor at runtime by writing to this file. + current_governor_ro: (RO) displays current idle policy. + See Documentation/admin-guide/pm/cpuidle.rst and Documentation/driver-api/pm/cpuidle.rst for more information. diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 5605cc6f9560..a96a423e3779 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -159,17 +159,15 @@ governor uses that information depends on what algorithm is implemented by it and that is the primary reason for having more than one governor in the ``CPUIdle`` subsystem. -There are three ``CPUIdle`` governors available, ``menu``, `TEO `_ -and ``ladder``. Which of them is used by default depends on the configuration -of the kernel and in particular on whether or not the scheduler tick can be -`stopped by the idle loop `_. It is possible to change the -governor at run time if the ``cpuidle_sysfs_switch`` command line parameter has -been passed to the kernel, but that is not safe in general, so it should not be -done on production systems (that may change in the future, though). The name of -the ``CPUIdle`` governor currently used by the kernel can be read from the -:file:`current_governor_ro` (or :file:`current_governor` if -``cpuidle_sysfs_switch`` is present in the kernel command line) file under -:file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``. +There are four ``CPUIdle`` governors available, ``menu``, `TEO `_, +``ladder`` and ``haltpoll``. Which of them is used by default depends on the +configuration of the kernel and in particular on whether or not the scheduler +tick can be `stopped by the idle loop `_. Available +governors can be read from the :file:`available_governors`, and the governor +can be changed at runtime. The name of the ``CPUIdle`` governor currently +used by the kernel can be read from the :file:`current_governor_ro` or +:file:`current_governor` file under :file:`/sys/devices/system/cpu/cpuidle/` +in ``sysfs``. Which ``CPUIdle`` driver is used, on the other hand, usually depends on the platform the kernel is running on, but there are platforms with more than one diff --git a/Documentation/driver-api/pm/cpuidle.rst b/Documentation/driver-api/pm/cpuidle.rst index 006cf6db40c6..3588bf078566 100644 --- a/Documentation/driver-api/pm/cpuidle.rst +++ b/Documentation/driver-api/pm/cpuidle.rst @@ -68,9 +68,8 @@ only one in the list (that is, the list was empty before) or the value of its governor currently in use, or the name of the new governor was passed to the kernel as the value of the ``cpuidle.governor=`` command line parameter, the new governor will be used from that point on (there can be only one ``CPUIdle`` -governor in use at a time). Also, if ``cpuidle_sysfs_switch`` is passed to the -kernel in the command line, user space can choose the ``CPUIdle`` governor to -use at run time via ``sysfs``. +governor in use at a time). Also, user space can choose the ``CPUIdle`` +governor to use at run time via ``sysfs``. Once registered, ``CPUIdle`` governors cannot be unregistered, so it is not practical to put them into loadable kernel modules. From a0bd8a2780fab2c8008e128e8a55995d8923e638 Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Tue, 19 May 2020 14:25:25 +0800 Subject: [PATCH 32/51] Documentation: ABI: make current_governer_ro as a candidate for removal Since both current_governor and current_governor_ro co-exist under /sys/devices/system/cpu/cpuidle/ file, and it's duplicate, make current_governer_ro as a candidate for removal. Signed-off-by: Hanjun Guo Reviewed-by: Doug Smythies Acked-by: Daniel Lezcano Signed-off-by: Rafael J. Wysocki --- Documentation/ABI/obsolete/sysfs-cpuidle | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 Documentation/ABI/obsolete/sysfs-cpuidle diff --git a/Documentation/ABI/obsolete/sysfs-cpuidle b/Documentation/ABI/obsolete/sysfs-cpuidle new file mode 100644 index 000000000000..e398fb5e542f --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-cpuidle @@ -0,0 +1,9 @@ +What: /sys/devices/system/cpu/cpuidle/current_governor_ro +Date: April, 2020 +Contact: linux-pm@vger.kernel.org +Description: + current_governor_ro shows current using cpuidle governor, but read only. + with the update that cpuidle governor can be changed at runtime in default, + both current_governor and current_governor_ro co-exist under + /sys/devices/system/cpu/cpuidle/ file, it's duplicate so make + current_governor_ro obselete. From ab7e9b067f3d9cbec28cfca51d341efb421b7a51 Mon Sep 17 00:00:00 2001 From: Domenico Andreoli Date: Thu, 7 May 2020 09:19:52 +0200 Subject: [PATCH 33/51] PM: hibernate: Incorporate concurrency handling Hibernation concurrency handling is currently delegated to user.c, where it's also used for regulating the access to the snapshot device. In the prospective of making user.c a separate configuration option, such mutual exclusion is brought into hibernate.c and made available through accessor helpers hereby introduced. Signed-off-by: Domenico Andreoli Signed-off-by: Rafael J. Wysocki --- kernel/power/hibernate.c | 20 ++++++++++++++++---- kernel/power/power.h | 4 ++-- kernel/power/user.c | 10 ++++------ 3 files changed, 22 insertions(+), 12 deletions(-) diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 30bd28d1d418..02ec716a4927 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -67,6 +67,18 @@ bool freezer_test_done; static const struct platform_hibernation_ops *hibernation_ops; +static atomic_t hibernate_atomic = ATOMIC_INIT(1); + +bool hibernate_acquire(void) +{ + return atomic_add_unless(&hibernate_atomic, -1, 0); +} + +void hibernate_release(void) +{ + atomic_inc(&hibernate_atomic); +} + bool hibernation_available(void) { return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION); @@ -704,7 +716,7 @@ int hibernate(void) lock_system_sleep(); /* The snapshot device should not be opened while we're running */ - if (!atomic_add_unless(&snapshot_device_available, -1, 0)) { + if (!hibernate_acquire()) { error = -EBUSY; goto Unlock; } @@ -775,7 +787,7 @@ int hibernate(void) Exit: __pm_notifier_call_chain(PM_POST_HIBERNATION, nr_calls, NULL); pm_restore_console(); - atomic_inc(&snapshot_device_available); + hibernate_release(); Unlock: unlock_system_sleep(); pr_info("hibernation exit\n"); @@ -880,7 +892,7 @@ static int software_resume(void) goto Unlock; /* The snapshot device should not be opened while we're running */ - if (!atomic_add_unless(&snapshot_device_available, -1, 0)) { + if (!hibernate_acquire()) { error = -EBUSY; swsusp_close(FMODE_READ); goto Unlock; @@ -911,7 +923,7 @@ static int software_resume(void) __pm_notifier_call_chain(PM_POST_RESTORE, nr_calls, NULL); pm_restore_console(); pr_info("resume failed (%d)\n", error); - atomic_inc(&snapshot_device_available); + hibernate_release(); /* For success case, the suspend path will release the lock */ Unlock: mutex_unlock(&system_transition_mutex); diff --git a/kernel/power/power.h b/kernel/power/power.h index 7cdc64dc2373..ba2094db6294 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -154,8 +154,8 @@ extern int snapshot_write_next(struct snapshot_handle *handle); extern void snapshot_write_finalize(struct snapshot_handle *handle); extern int snapshot_image_loaded(struct snapshot_handle *handle); -/* If unset, the snapshot device cannot be open. */ -extern atomic_t snapshot_device_available; +extern bool hibernate_acquire(void); +extern void hibernate_release(void); extern sector_t alloc_swapdev_block(int swap); extern void free_all_swap_pages(int swap); diff --git a/kernel/power/user.c b/kernel/power/user.c index 7959449765d9..98548d1cf8a6 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -37,8 +37,6 @@ static struct snapshot_data { bool free_bitmaps; } snapshot_state; -atomic_t snapshot_device_available = ATOMIC_INIT(1); - static int snapshot_open(struct inode *inode, struct file *filp) { struct snapshot_data *data; @@ -49,13 +47,13 @@ static int snapshot_open(struct inode *inode, struct file *filp) lock_system_sleep(); - if (!atomic_add_unless(&snapshot_device_available, -1, 0)) { + if (!hibernate_acquire()) { error = -EBUSY; goto Unlock; } if ((filp->f_flags & O_ACCMODE) == O_RDWR) { - atomic_inc(&snapshot_device_available); + hibernate_release(); error = -ENOSYS; goto Unlock; } @@ -92,7 +90,7 @@ static int snapshot_open(struct inode *inode, struct file *filp) __pm_notifier_call_chain(PM_POST_RESTORE, nr_calls, NULL); } if (error) - atomic_inc(&snapshot_device_available); + hibernate_release(); data->frozen = false; data->ready = false; @@ -122,7 +120,7 @@ static int snapshot_release(struct inode *inode, struct file *filp) } pm_notifier_call_chain(data->mode == O_RDONLY ? PM_POST_HIBERNATION : PM_POST_RESTORE); - atomic_inc(&snapshot_device_available); + hibernate_release(); unlock_system_sleep(); From c4f39a6c74389fcc93ac39056ef342f32ab57a23 Mon Sep 17 00:00:00 2001 From: Domenico Andreoli Date: Thu, 7 May 2020 09:19:53 +0200 Subject: [PATCH 34/51] PM: hibernate: Split off snapshot dev option Make it possible to reduce the attack surface in case the snapshot device is not to be used from userspace. Signed-off-by: Domenico Andreoli Signed-off-by: Rafael J. Wysocki --- kernel/power/Kconfig | 12 ++++++++++++ kernel/power/Makefile | 3 ++- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index c208566c844b..4d0e6e815a2b 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -80,6 +80,18 @@ config HIBERNATION For more information take a look at . +config HIBERNATION_SNAPSHOT_DEV + bool "Userspace snapshot device" + depends on HIBERNATION + default y + ---help--- + Device used by the uswsusp tools. + + Say N if no snapshotting from userspace is needed, this also + reduces the attack surface of the kernel. + + If in doubt, say Y. + config PM_STD_PARTITION string "Default resume partition" depends on HIBERNATION diff --git a/kernel/power/Makefile b/kernel/power/Makefile index e7e47d9be1e5..5899260a8bef 100644 --- a/kernel/power/Makefile +++ b/kernel/power/Makefile @@ -10,7 +10,8 @@ obj-$(CONFIG_VT_CONSOLE_SLEEP) += console.o obj-$(CONFIG_FREEZER) += process.o obj-$(CONFIG_SUSPEND) += suspend.o obj-$(CONFIG_PM_TEST_SUSPEND) += suspend_test.o -obj-$(CONFIG_HIBERNATION) += hibernate.o snapshot.o swap.o user.o +obj-$(CONFIG_HIBERNATION) += hibernate.o snapshot.o swap.o +obj-$(CONFIG_HIBERNATION_SNAPSHOT_DEV) += user.o obj-$(CONFIG_PM_AUTOSLEEP) += autosleep.o obj-$(CONFIG_PM_WAKELOCKS) += wakelock.o From 213081dadd308d5d41a5110c2dc87fd8ba42de5e Mon Sep 17 00:00:00 2001 From: Srinivas Pandruvada Date: Mon, 11 May 2020 16:57:31 -0700 Subject: [PATCH 35/51] Documentation: admin-guide: pm: Document intel-speed-select Added documentation to configure servers to use Intel(R) Speed Select Technology using intel-speed-select tool. Signed-off-by: Srinivas Pandruvada Acked-by: Andriy Shevchenko Signed-off-by: Rafael J. Wysocki --- .../admin-guide/pm/intel-speed-select.rst | 917 ++++++++++++++++++ .../admin-guide/pm/working-state.rst | 1 + 2 files changed, 918 insertions(+) create mode 100644 Documentation/admin-guide/pm/intel-speed-select.rst diff --git a/Documentation/admin-guide/pm/intel-speed-select.rst b/Documentation/admin-guide/pm/intel-speed-select.rst new file mode 100644 index 000000000000..b2ca601c21c6 --- /dev/null +++ b/Documentation/admin-guide/pm/intel-speed-select.rst @@ -0,0 +1,917 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Intel(R) Speed Select Technology User Guide +============================================================ + +The Intel(R) Speed Select Technology (Intel(R) SST) provides a powerful new +collection of features that give more granular control over CPU performance. +With Intel(R) SST, one server can be configured for power and performance for a +variety of diverse workload requirements. + +Refer to the links below for an overview of the technology: + +- https://www.intel.com/content/www/us/en/architecture-and-technology/speed-select-technology-article.html +- https://builders.intel.com/docs/networkbuilders/intel-speed-select-technology-base-frequency-enhancing-performance.pdf + +These capabilities are further enhanced in some of the newer generations of +server platforms where these features can be enumerated and controlled +dynamically without pre-configuring via BIOS setup options. This dynamic +configuration is done via mailbox commands to the hardware. One way to enumerate +and configure these features is by using the Intel Speed Select utility. + +This document explains how to use the Intel Speed Select tool to enumerate and +control Intel(R) SST features. This document gives example commands and explains +how these commands change the power and performance profile of the system under +test. Using this tool as an example, customers can replicate the messaging +implemented in the tool in their production software. + +intel-speed-select configuration tool +====================================== + +Most Linux distribution packages may include the "intel-speed-select" tool. If not, +it can be built by downloading the Linux kernel tree from kernel.org. Once +downloaded, the tool can be built without building the full kernel. + +From the kernel tree, run the following commands:: + +# cd tools/power/x86/intel-speed-select/ +# make +# make install + +Getting Help +------------ + +To get help with the tool, execute the command below:: + +# intel-speed-select --help + +The top-level help describes arguments and features. Notice that there is a +multi-level help structure in the tool. For example, to get help for the feature "perf-profile":: + +# intel-speed-select perf-profile --help + +To get help on a command, another level of help is provided. For example for the command info "info":: + +# intel-speed-select perf-profile info --help + +Summary of platform capability +------------------------------ +To check the current platform and driver capaibilities, execute:: + +#intel-speed-select --info + +For example on a test system:: + + # intel-speed-select --info + Intel(R) Speed Select Technology + Executing on CPU model: X + Platform: API version : 1 + Platform: Driver version : 1 + Platform: mbox supported : 1 + Platform: mmio supported : 1 + Intel(R) SST-PP (feature perf-profile) is supported + TDP level change control is unlocked, max level: 4 + Intel(R) SST-TF (feature turbo-freq) is supported + Intel(R) SST-BF (feature base-freq) is not supported + Intel(R) SST-CP (feature core-power) is supported + +Intel(R) Speed Select Technology - Performance Profile (Intel(R) SST-PP) +------------------------------------------------------------------------ + +This feature allows configuration of a server dynamically based on workload +performance requirements. This helps users during deployment as they do not have +to choose a specific server configuration statically. This Intel(R) Speed Select +Technology - Performance Profile (Intel(R) SST-PP) feature introduces a mechanism +that allows multiple optimized performance profiles per system. Each profile +defines a set of CPUs that need to be online and rest offline to sustain a +guaranteed base frequency. Once the user issues a command to use a specific +performance profile and meet CPU online/offline requirement, the user can expect +a change in the base frequency dynamically. This feature is called +"perf-profile" when using the Intel Speed Select tool. + +Number or performance levels +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There can be multiple performance profiles on a system. To get the number of +profiles, execute the command below:: + + # intel-speed-select perf-profile get-config-levels + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + get-config-levels:4 + package-1 + die-0 + cpu-14 + get-config-levels:4 + +On this system under test, there are 4 performance profiles in addition to the +base performance profile (which is performance level 0). + +Lock/Unlock status +~~~~~~~~~~~~~~~~~~ + +Even if there are multiple performance profiles, it is possible that that they +are locked. If they are locked, users cannot issue a command to change the +performance state. It is possible that there is a BIOS setup to unlock or check +with your system vendor. + +To check if the system is locked, execute the following command:: + + # intel-speed-select perf-profile get-lock-status + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + get-lock-status:0 + package-1 + die-0 + cpu-14 + get-lock-status:0 + +In this case, lock status is 0, which means that the system is unlocked. + +Properties of a performance level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get properties of a specific performance level (For example for the level 0, below), execute the command below:: + + # intel-speed-select perf-profile info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile-level-0 + cpu-count:28 + enable-cpu-mask:000003ff,f0003fff + enable-cpu-list:0,1,2,3,4,5,6,7,8,9,10,11,12,13,28,29,30,31,32,33,34,35,36,37,38,39,40,41 + thermal-design-power-ratio:26 + base-frequency(MHz):2600 + speed-select-turbo-freq:disabled + speed-select-base-freq:disabled + ... + ... + +Here -l option is used to specify a performance level. + +If the option -l is omitted, then this command will print information about all +the performance levels. The above command is printing properties of the +performance level 0. + +For this performance profile, the list of CPUs displayed by the +"enable-cpu-mask/enable-cpu-list" at the max can be "online." When that +condition is met, then base frequency of 2600 MHz can be maintained. To +understand more, execute "intel-speed-select perf-profile info" for performance +level 4:: + + # intel-speed-select perf-profile info -l 4 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile-level-4 + cpu-count:28 + enable-cpu-mask:000000fa,f0000faf + enable-cpu-list:0,1,2,3,5,7,8,9,10,11,28,29,30,31,33,35,36,37,38,39 + thermal-design-power-ratio:28 + base-frequency(MHz):2800 + speed-select-turbo-freq:disabled + speed-select-base-freq:unsupported + ... + ... + +There are fewer CPUs in the "enable-cpu-mask/enable-cpu-list". Consequently, if +the user only keeps these CPUs online and the rest "offline," then the base +frequency is increased to 2.8 GHz compared to 2.6 GHz at performance level 0. + +Get current performance level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get the current performance level, execute:: + + # intel-speed-select perf-profile get-config-current-level + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + get-config-current_level:0 + +First verify that the base_frequency displayed by the cpufreq sysfs is correct:: + + # cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency + 2600000 + +This matches the base-frequency (MHz) field value displayed from the +"perf-profile info" command for performance level 0(cpufreq frequency is in +KHz). + +To check if the average frequency is equal to the base frequency for a 100% busy +workload, disable turbo:: + +# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo + +Then runs a busy workload on all CPUs, for example:: + +#stress -c 64 + +To verify the base frequency, run turbostat:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + + Package Core CPU Bzy_MHz + - - 2600 + 0 0 0 2600 + 0 1 1 2600 + 0 2 2 2600 + 0 3 3 2600 + 0 4 4 2600 + . . . . + + +Changing performance level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To the change the performance level to 4, execute:: + + # intel-speed-select -d perf-profile set-config-level -l 4 -o + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile + set_tdp_level:success + +In the command above, "-o" is optional. If it is specified, then it will also +offline CPUs which are not present in the enable_cpu_mask for this performance +level. + +Now if the base_frequency is checked:: + + #cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency + 2800000 + +Which shows that the base frequency now increased from 2600 MHz at performance +level 0 to 2800 MHz at performance level 4. As a result, any workload, which can +use fewer CPUs, can see a boost of 200 MHz compared to performance level 0. + +Check presence of other Intel(R) SST features +--------------------------------------------- + +Each of the performance profiles also specifies weather there is support of +other two Intel(R) SST features (Intel(R) Speed Select Technology - Base Frequency +(Intel(R) SST-BF) and Intel(R) Speed Select Technology - Turbo Frequency (Intel +SST-TF)). + +For example, from the output of "perf-profile info" above, for level 0 and level +4: + +For level 0:: + speed-select-turbo-freq:disabled + speed-select-base-freq:disabled + +For level 4:: + speed-select-turbo-freq:disabled + speed-select-base-freq:unsupported + +Given these results, the "speed-select-base-freq" (Intel(R) SST-BF) in level 4 +changed from "disabled" to "unsupported" compared to performance level 0. + +This means that at performance level 4, the "speed-select-base-freq" feature is +not supported. However, at performance level 0, this feature is "supported", but +currently "disabled", meaning the user has not activated this feature. Whereas +"speed-select-turbo-freq" (Intel(R) SST-TF) is supported at both performance +levels, but currently not activated by the user. + +The Intel(R) SST-BF and the Intel(R) SST-TF features are built on a foundation +technology called Intel(R) Speed Select Technology - Core Power (Intel(R) SST-CP). +The platform firmware enables this feature when Intel(R) SST-BF or Intel(R) SST-TF +is supported on a platform. + +Intel(R) Speed Select Technology Core Power (Intel(R) SST-CP) +--------------------------------------------------------------- + +Intel(R) Speed Select Technology Core Power (Intel(R) SST-CP) is an interface that +allows users to define per core priority. This defines a mechanism to distribute +power among cores when there is a power constrained scenario. This defines a +class of service (CLOS) configuration. + +The user can configure up to 4 class of service configurations. Each CLOS group +configuration allows definitions of parameters, which affects how the frequency +can be limited and power is distributed. Each CPU core can be tied to a class of +service and hence an associated priority. The granularity is at core level not +at per CPU level. + +Enable CLOS based prioritization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To use CLOS based prioritization feature, firmware must be informed to enable +and use a priority type. There is a default per platform priority type, which +can be changed with optional command line parameter. + +To enable and check the options, execute:: + + # intel-speed-select core-power enable --help + Intel(R) Speed Select Technology + Executing on CPU model: X + Enable core-power for a package/die + Clos Enable: Specify priority type with [--priority|-p] + 0: Proportional, 1: Ordered + +There are two types of priority types: + +- Ordered + +Priority for ordered throttling is defined based on the index of the assigned +CLOS group. Where CLOS0 gets highest priority (throttled last). + +Priority order is: +CLOS0 > CLOS1 > CLOS2 > CLOS3. + +- Proportional + +When proportional priority is used, there is an additional parameter called +frequency_weight, which can be specified per CLOS group. The goal of +proportional priority is to provide each core with the requested min., then +distribute all remaining (excess/deficit) budgets in proportion to a defined +weight. This proportional priority can be configured using "core-power config" +command. + +To enable with the platform default priority type, execute:: + + # intel-speed-select core-power enable + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + core-power + enable:success + package-1 + die-0 + cpu-6 + core-power + enable:success + +The scope of this enable is per package or die scoped when a package contains +multiple dies. To check if CLOS is enabled and get priority type, "core-power +info" command can be used. For example to check the status of core-power feature +on CPU 0, execute:: + + # intel-speed-select -c 0 core-power info + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + core-power + support-status:supported + enable-status:enabled + clos-enable-status:enabled + priority-type:proportional + package-1 + die-0 + cpu-24 + core-power + support-status:supported + enable-status:enabled + clos-enable-status:enabled + priority-type:proportional + +Configuring CLOS groups +~~~~~~~~~~~~~~~~~~~~~~~ + +Each CLOS group has its own attributes including min, max, freq_weight and +desired. These parameters can be configured with "core-power config" command. +Defaults will be used if user skips setting a parameter except clos id, which is +mandatory. To check core-power config options, execute:: + + # intel-speed-select core-power config --help + Intel(R) Speed Select Technology + Executing on CPU model: X + Set core-power configuration for one of the four clos ids + Specify targeted clos id with [--clos|-c] + Specify clos Proportional Priority [--weight|-w] + Specify clos min in MHz with [--min|-n] + Specify clos max in MHz with [--max|-m] + +For example:: + + # intel-speed-select core-power config -c 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + clos epp is not specified, default: 0 + clos frequency weight is not specified, default: 0 + clos min is not specified, default: 0 MHz + clos max is not specified, default: 25500 MHz + clos desired is not specified, default: 0 + package-0 + die-0 + cpu-0 + core-power + config:success + package-1 + die-0 + cpu-6 + core-power + config:success + +The user has the option to change defaults. For example, the user can change the +"min" and set the base frequency to always get guaranteed base frequency. + +Get the current CLOS configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To check the current configuration, "core-power get-config" can be used. For +example, to get the configuration of CLOS 0:: + + # intel-speed-select core-power get-config -c 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + core-power + clos:0 + epp:0 + clos-proportional-priority:0 + clos-min:0 MHz + clos-max:Max Turbo frequency + clos-desired:0 MHz + package-1 + die-0 + cpu-24 + core-power + clos:0 + epp:0 + clos-proportional-priority:0 + clos-min:0 MHz + clos-max:Max Turbo frequency + clos-desired:0 MHz + +Associating a CPU with a CLOS group +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To associate a CPU to a CLOS group "core-power assoc" command can be used:: + + # intel-speed-select core-power assoc --help + Intel(R) Speed Select Technology + Executing on CPU model: X + Associate a clos id to a CPU + Specify targeted clos id with [--clos|-c] + + +For example to associate CPU 10 to CLOS group 3, execute:: + + # intel-speed-select -c 10 core-power assoc -c 3 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-10 + core-power + assoc:success + +Once a CPU is associated, its sibling CPUs are also associated to a CLOS group. +Once associated, avoid changing Linux "cpufreq" subsystem scaling frequency +limits. + +To check the existing association for a CPU, "core-power get-assoc" command can +be used. For example, to get association of CPU 10, execute:: + + # intel-speed-select -c 10 core-power get-assoc + Intel(R) Speed Select Technology + Executing on CPU model: X + package-1 + die-0 + cpu-10 + get-assoc + clos:3 + +This shows that CPU 10 is part of a CLOS group 3. + + +Disable CLOS based prioritization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To disable, execute:: + +# intel-speed-select core-power disable + +Some features like Intel(R) SST-TF can only be enabled when CLOS based prioritization +is enabled. For this reason, disabling while Intel(R) SST-TF is enabled can cause +Intel(R) SST-TF to fail. This will cause the "disable" command to display an error +if Intel(R) SST-TF is already enabled. In turn, to disable, the Intel(R) SST-TF +feature must be disabled first. + +Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF) +------------------------------------------------------------------- + +The Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF) feature lets +the user control base frequency. If some critical workload threads demand +constant high guaranteed performance, then this feature can be used to execute +the thread at higher base frequency on specific sets of CPUs (high priority +CPUs) at the cost of lower base frequency (low priority CPUs) on other CPUs. +This feature does not require offline of the low priority CPUs. + +The support of Intel(R) SST-BF depends on the Intel(R) Speed Select Technology - +Performance Profile (Intel(R) SST-PP) performance level configuration. It is +possible that only certain performance levels support Intel(R) SST-BF. It is also +possible that only base performance level (level = 0) has support of Intel +SST-BF. Consequently, first select the desired performance level to enable this +feature. + +In the system under test here, Intel(R) SST-BF is supported at the base +performance level 0, but currently disabled. For example for the level 0:: + + # intel-speed-select -c 0 perf-profile info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile-level-0 + ... + + speed-select-base-freq:disabled + ... + +Before enabling Intel(R) SST-BF and measuring its impact on a workload +performance, execute some workload and measure performance and get a baseline +performance to compare against. + +Here the user wants more guaranteed performance. For this reason, it is likely +that turbo is disabled. To disable turbo, execute:: + +#echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo + +Based on the output of the "intel-speed-select perf-profile info -l 0" base +frequency of guaranteed frequency 2600 MHz. + + +Measure baseline performance for comparison +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To compare, pick a multi-threaded workload where each thread can be scheduled on +separate CPUs. "Hackbench pipe" test is a good example on how to improve +performance using Intel(R) SST-BF. + +Below, the workload is measuring average scheduler wakeup latency, so a lower +number means better performance:: + + # taskset -c 3,4 perf bench -r 100 sched pipe + # Running 'sched/pipe' benchmark: + # Executed 1000000 pipe operations between two processes + Total time: 6.102 [sec] + 6.102445 usecs/op + 163868 ops/sec + +While running the above test, if we take turbostat output, it will show us that +2 of the CPUs are busy and reaching max. frequency (which would be the base +frequency as the turbo is disabled). The turbostat output:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + Package Core CPU Bzy_MHz + 0 0 0 1000 + 0 1 1 1005 + 0 2 2 1000 + 0 3 3 2600 + 0 4 4 2600 + 0 5 5 1000 + 0 6 6 1000 + 0 7 7 1005 + 0 8 8 1005 + 0 9 9 1000 + 0 10 10 1000 + 0 11 11 995 + 0 12 12 1000 + 0 13 13 1000 + +From the above turbostat output, both CPU 3 and 4 are very busy and reaching +full guaranteed frequency of 2600 MHz. + +Intel(R) SST-BF Capabilities +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get capabilities of Intel(R) SST-BF for the current performance level 0, +execute:: + + # intel-speed-select base-freq info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + speed-select-base-freq + high-priority-base-frequency(MHz):3000 + high-priority-cpu-mask:00000216,00002160 + high-priority-cpu-list:5,6,8,13,33,34,36,41 + low-priority-base-frequency(MHz):2400 + tjunction-temperature(C):125 + thermal-design-power(W):205 + +The above capabilities show that there are some CPUs on this system that can +offer base frequency of 3000 MHz compared to the standard base frequency at this +performance levels. Nevertheless, these CPUs are fixed, and they are presented +via high-priority-cpu-list/high-priority-cpu-mask. But if this Intel(R) SST-BF +feature is selected, the low priorities CPUs (which are not in +high-priority-cpu-list) can only offer up to 2400 MHz. As a result, if this +clipping of low priority CPUs is acceptable, then the user can enable Intel +SST-BF feature particularly for the above "sched pipe" workload since only two +CPUs are used, they can be scheduled on high priority CPUs and can get boost of +400 MHz. + +Enable Intel(R) SST-BF +~~~~~~~~~~~~~~~~~~~~~~ + +To enable Intel(R) SST-BF feature, execute:: + + # intel-speed-select base-freq enable -a + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + base-freq + enable:success + package-1 + die-0 + cpu-14 + base-freq + enable:success + +In this case, -a option is optional. This not only enables Intel(R) SST-BF, but it +also adjusts the priority of cores using Intel(R) Speed Select Technology Core +Power (Intel(R) SST-CP) features. This option sets the minimum performance of each +Intel(R) Speed Select Technology - Performance Profile (Intel(R) SST-PP) class to +maximum performance so that the hardware will give maximum performance possible +for each CPU. + +If -a option is not used, then the following steps are required before enabling +Intel(R) SST-BF: + +- Discover Intel(R) SST-BF and note low and high priority base frequency +- Note the high prioity CPU list +- Enable CLOS using core-power feature set +- Configure CLOS parameters. Use CLOS.min to set to minimum performance +- Subscribe desired CPUs to CLOS groups + +With this configuration, if the same workload is executed by pinning the +workload to high priority CPUs (CPU 5 and 6 in this case):: + + #taskset -c 5,6 perf bench -r 100 sched pipe + # Running 'sched/pipe' benchmark: + # Executed 1000000 pipe operations between two processes + Total time: 5.627 [sec] + 5.627922 usecs/op + 177685 ops/sec + +This way, by enabling Intel(R) SST-BF, the performance of this benchmark is +improved (latency reduced) by 7.79%. From the turbostat output, it can be +observed that the high priority CPUs reached 3000 MHz compared to 2600 MHz. +The turbostat output:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + Package Core CPU Bzy_MHz + 0 0 0 2151 + 0 1 1 2166 + 0 2 2 2175 + 0 3 3 2175 + 0 4 4 2175 + 0 5 5 3000 + 0 6 6 3000 + 0 7 7 2180 + 0 8 8 2662 + 0 9 9 2176 + 0 10 10 2175 + 0 11 11 2176 + 0 12 12 2176 + 0 13 13 2661 + +Disable Intel(R) SST-BF +~~~~~~~~~~~~~~~~~~~~~~~ + +To disable the Intel(R) SST-BF feature, execute:: + +# intel-speed-select base-freq disable -a + + +Intel(R) Speed Select Technology - Turbo Frequency (Intel(R) SST-TF) +-------------------------------------------------------------------- + +This feature enables the ability to set different "All core turbo ratio limits" +to cores based on the priority. By using this feature, some cores can be +configured to get higher turbo frequency by designating them as high priority at +the cost of lower or no turbo frequency on the low priority cores. + +For this reason, this feature is only useful when system is busy utilizing all +CPUs, but the user wants some configurable option to get high performance on +some CPUs. + +The support of Intel(R) Speed Select Technology - Turbo Frequency (Intel(R) SST-TF) +depends on the Intel(R) Speed Select Technology - Performance Profile (Intel +SST-PP) performance level configuration. It is possible that only a certain +performance level supports Intel(R) SST-TF. It is also possible that only the base +performance level (level = 0) has the support of Intel(R) SST-TF. Hence, first +select the desired performance level to enable this feature. + +In the system under test here, Intel(R) SST-TF is supported at the base +performance level 0, but currently disabled:: + + # intel-speed-select -c 0 perf-profile info -l 0 + Intel(R) Speed Select Technology + package-0 + die-0 + cpu-0 + perf-profile-level-0 + ... + ... + speed-select-turbo-freq:disabled + ... + ... + + +To check if performance can be improved using Intel(R) SST-TF feature, get the turbo +frequency properties with Intel(R) SST-TF enabled and compare to the base turbo +capability of this system. + +Get Base turbo capability +~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get the base turbo capability of performance level 0, execute:: + + # intel-speed-select perf-profile info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile-level-0 + ... + ... + turbo-ratio-limits-sse + bucket-0 + core-count:2 + max-turbo-frequency(MHz):3200 + bucket-1 + core-count:4 + max-turbo-frequency(MHz):3100 + bucket-2 + core-count:6 + max-turbo-frequency(MHz):3100 + bucket-3 + core-count:8 + max-turbo-frequency(MHz):3100 + bucket-4 + core-count:10 + max-turbo-frequency(MHz):3100 + bucket-5 + core-count:12 + max-turbo-frequency(MHz):3100 + bucket-6 + core-count:14 + max-turbo-frequency(MHz):3100 + bucket-7 + core-count:16 + max-turbo-frequency(MHz):3100 + +Based on the data above, when all the CPUS are busy, the max. frequency of 3100 +MHz can be achieved. If there is some busy workload on cpu 0 - 11 (e.g. stress) +and on CPU 12 and 13, execute "hackbench pipe" workload:: + + # taskset -c 12,13 perf bench -r 100 sched pipe + # Running 'sched/pipe' benchmark: + # Executed 1000000 pipe operations between two processes + Total time: 5.705 [sec] + 5.705488 usecs/op + 175269 ops/sec + +The turbostat output:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + Package Core CPU Bzy_MHz + 0 0 0 3000 + 0 1 1 3000 + 0 2 2 3000 + 0 3 3 3000 + 0 4 4 3000 + 0 5 5 3100 + 0 6 6 3100 + 0 7 7 3000 + 0 8 8 3100 + 0 9 9 3000 + 0 10 10 3000 + 0 11 11 3000 + 0 12 12 3100 + 0 13 13 3100 + +Based on turbostat output, the performance is limited by frequency cap of 3100 +MHz. To check if the hackbench performance can be improved for CPU 12 and CPU +13, first check the capability of the Intel(R) SST-TF feature for this performance +level. + +Get Intel(R) SST-TF Capability +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get the capability, the "turbo-freq info" command can be used:: + + # intel-speed-select turbo-freq info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + speed-select-turbo-freq + bucket-0 + high-priority-cores-count:2 + high-priority-max-frequency(MHz):3200 + high-priority-max-avx2-frequency(MHz):3200 + high-priority-max-avx512-frequency(MHz):3100 + bucket-1 + high-priority-cores-count:4 + high-priority-max-frequency(MHz):3100 + high-priority-max-avx2-frequency(MHz):3000 + high-priority-max-avx512-frequency(MHz):2900 + bucket-2 + high-priority-cores-count:6 + high-priority-max-frequency(MHz):3100 + high-priority-max-avx2-frequency(MHz):3000 + high-priority-max-avx512-frequency(MHz):2900 + speed-select-turbo-freq-clip-frequencies + low-priority-max-frequency(MHz):2600 + low-priority-max-avx2-frequency(MHz):2400 + low-priority-max-avx512-frequency(MHz):2100 + +Based on the output above, there is an Intel(R) SST-TF bucket for which there are +two high priority cores. If only two high priority cores are set, then max. +turbo frequency on those cores can be increased to 3200 MHz. This is 100 MHz +more than the base turbo capability for all cores. + +In turn, for the hackbench workload, two CPUs can be set as high priority and +rest as low priority. One side effect is that once enabled, the low priority +cores will be clipped to a lower frequency of 2600 MHz. + +Enable Intel(R) SST-TF +~~~~~~~~~~~~~~~~~~~~~~ + +To enable Intel(R) SST-TF, execute:: + + # intel-speed-select -c 12,13 turbo-freq enable -a + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-12 + turbo-freq + enable:success + package-0 + die-0 + cpu-13 + turbo-freq + enable:success + package--1 + die-0 + cpu-63 + turbo-freq --auto + enable:success + +In this case, the option "-a" is optional. If set, it enables Intel(R) SST-TF +feature and also sets the CPUs to high and and low priority using Intel Speed +Select Technology Core Power (Intel(R) SST-CP) features. The CPU numbers passed +with "-c" arguments are marked as high priority, including its siblings. + +If -a option is not used, then the following steps are required before enabling +Intel(R) SST-TF: + +- Discover Intel(R) SST-TF and note buckets of high priority cores and maximum frequency + +- Enable CLOS using core-power feature set - Configure CLOS parameters + +- Subscribe desired CPUs to CLOS groups making sure that high priority cores are set to the maximum frequency + +If the same hackbench workload is executed, schedule hackbench threads on high +priority CPUs:: + + #taskset -c 12,13 perf bench -r 100 sched pipe + # Running 'sched/pipe' benchmark: + # Executed 1000000 pipe operations between two processes + Total time: 5.510 [sec] + 5.510165 usecs/op + 180826 ops/sec + +This improved performance by around 3.3% improvement on a busy system. Here the +turbostat output will show that the CPU 12 and CPU 13 are getting 100 MHz boost. +The turbostat output:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + Package Core CPU Bzy_MHz + ... + 0 12 12 3200 + 0 13 13 3200 diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index 0a38cdf39df1..f40994c422dc 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -13,3 +13,4 @@ Working-State Power Management intel_pstate cpufreq_drivers intel_epb + intel-speed-select From 3618bbaaa898cf48e859736120a775fcb56f3838 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 22 May 2020 18:09:55 +0300 Subject: [PATCH 36/51] PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend() rpm_suspend() simple bails out when conditions are wrong. But this is not immediately obvious from the code. Make it clear what we do when conditions are wrong in rpm_suspend(). Signed-off-by: Andy Shevchenko Signed-off-by: Rafael J. Wysocki --- drivers/base/power/runtime.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c index 99c7da112c95..9f62790f644c 100644 --- a/drivers/base/power/runtime.c +++ b/drivers/base/power/runtime.c @@ -523,13 +523,11 @@ static int rpm_suspend(struct device *dev, int rpmflags) repeat: retval = rpm_check_suspend_allowed(dev); - if (retval < 0) - ; /* Conditions are wrong. */ + goto out; /* Conditions are wrong. */ /* Synchronous suspends are not allowed in the RPM_RESUMING state. */ - else if (dev->power.runtime_status == RPM_RESUMING && - !(rpmflags & RPM_ASYNC)) + if (dev->power.runtime_status == RPM_RESUMING && !(rpmflags & RPM_ASYNC)) retval = -EAGAIN; if (retval) goto out; From 03c3b413a14d8bbf0afc8e069cf2a65df580c30c Mon Sep 17 00:00:00 2001 From: Sumeet Pawnikar Date: Thu, 21 May 2020 12:14:26 +0530 Subject: [PATCH 37/51] powercap: RAPL: remove unused local MSR define Remove unused PLATFORM_POWER_LIMIT MSR local definition from file intel_rapl_common.c. This was missed while splitting old RAPL code intel_rapl.c file into two new files intel_rapl_msr.c and intel_rapl_common.c as per the commit 3382388d7148 ("intel_rapl: abstract RAPL common code"). Currently, this #define entry is being used only in intel_rapl_msr.c file and local definition present in this file. Signed-off-by: Sumeet Pawnikar Reviewed-by: Andy Shevchenko Signed-off-by: Rafael J. Wysocki --- drivers/powercap/intel_rapl_common.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index c3e335e37c7d..61a63a16b5e7 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -26,9 +26,6 @@ #include #include -/* Local defines */ -#define MSR_PLATFORM_POWER_LIMIT 0x0000065C - /* bitmasks for RAPL MSRs, used by primitive access functions */ #define ENERGY_STATUS_MASK 0xffffffff From 3441362b08dc16669adc0e7d3f3454ae38619229 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 19 May 2020 13:36:48 +0200 Subject: [PATCH 38/51] ACPI: PM: s2idle: Print type of wakeup debug messages Since acpi_s2idle_wake() knows the category of wakeup causing the system to resume from suspend-to-idle, make it print a unique message for each of them to help diagnose wakeup issues. Signed-off-by: Rafael J. Wysocki --- drivers/acpi/sleep.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c index fd9d4e8318e9..31c9d0c8ae11 100644 --- a/drivers/acpi/sleep.c +++ b/drivers/acpi/sleep.c @@ -992,23 +992,31 @@ static bool acpi_s2idle_wake(void) * wakeup is pending anyway and the SCI is not the source of * it). */ - if (irqd_is_wakeup_armed(irq_get_irq_data(acpi_sci_irq))) + if (irqd_is_wakeup_armed(irq_get_irq_data(acpi_sci_irq))) { + pm_pr_dbg("Wakeup unrelated to ACPI SCI\n"); return true; + } /* * If the status bit of any enabled fixed event is set, the * wakeup is regarded as valid. */ - if (acpi_any_fixed_event_status_set()) + if (acpi_any_fixed_event_status_set()) { + pm_pr_dbg("ACPI fixed event wakeup\n"); return true; + } /* Check wakeups from drivers sharing the SCI. */ - if (acpi_check_wakeup_handlers()) + if (acpi_check_wakeup_handlers()) { + pm_pr_dbg("ACPI custom handler wakeup\n"); return true; + } /* Check non-EC GPE wakeups and dispatch the EC GPE. */ - if (acpi_ec_dispatch_gpe()) + if (acpi_ec_dispatch_gpe()) { + pm_pr_dbg("ACPI non-EC GPE wakeup\n"); return true; + } /* * Cancel the SCI wakeup and process all pending events in case @@ -1027,8 +1035,10 @@ static bool acpi_s2idle_wake(void) * are pending here, they must be resulting from the processing * of EC events above or coming from somewhere else. */ - if (pm_wakeup_pending()) + if (pm_wakeup_pending()) { + pm_pr_dbg("Wakeup after ACPI Notify sync\n"); return true; + } rearm_wake_irq(acpi_sci_irq); } From 5fcd7359019248b4de54379720847bd41bcc42aa Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 19 May 2020 14:33:10 +0200 Subject: [PATCH 39/51] ACPI: EC: PM: s2idle: Extend GPE dispatching debug message Add the "ACPI" string to the "EC GPE dispatched" message as it is ACPI-related. Signed-off-by: Rafael J. Wysocki --- drivers/acpi/ec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c index 1af2125e17d5..c44448ab19ef 100644 --- a/drivers/acpi/ec.c +++ b/drivers/acpi/ec.c @@ -2017,7 +2017,7 @@ bool acpi_ec_dispatch_gpe(void) */ ret = acpi_dispatch_gpe(NULL, first_ec->gpe); if (ret == ACPI_INTERRUPT_HANDLED) { - pm_pr_dbg("EC GPE dispatched\n"); + pm_pr_dbg("ACPI EC GPE dispatched\n"); /* Flush the event and query workqueues. */ acpi_ec_flush_work(); From a871be6b8eee13a35a3e8e56c62770ef17ee9220 Mon Sep 17 00:00:00 2001 From: Stephan Gerhold Date: Thu, 16 Apr 2020 10:58:21 +0200 Subject: [PATCH 40/51] cpuidle: Convert Qualcomm SPM driver to a generic CPUidle driver The Qualcomm SPM cpuidle driver seems to be the last driver still using the generic ARM CPUidle infrastructure. Converting it actually allows us to simplify the driver, and we end up being able to remove more lines than adding new ones: - We can parse the CPUidle states in the device tree directly with dt_idle_states (and don't need to duplicate that functionality into the spm driver). - Each "saw" device managed by the SPM driver now directly registers its own cpuidle driver, removing the need for any global (per cpu) state. The device tree binding is the same, so the driver stays compatible with all old device trees. Signed-off-by: Stephan Gerhold Reviewed-by: Lina Iyer Reviewed-by: Ulf Hansson Acked-by: Bjorn Andersson Signed-off-by: Rafael J. Wysocki --- MAINTAINERS | 1 + drivers/cpuidle/Kconfig.arm | 13 ++ drivers/cpuidle/Makefile | 1 + .../qcom/spm.c => cpuidle/cpuidle-qcom-spm.c} | 138 +++++++----------- drivers/soc/qcom/Kconfig | 10 -- drivers/soc/qcom/Makefile | 1 - 6 files changed, 67 insertions(+), 97 deletions(-) rename drivers/{soc/qcom/spm.c => cpuidle/cpuidle-qcom-spm.c} (75%) diff --git a/MAINTAINERS b/MAINTAINERS index 26f281d9f32a..dfdf9c725d20 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2225,6 +2225,7 @@ F: drivers/*/qcom* F: drivers/*/qcom/ F: drivers/bluetooth/btqcomsmd.c F: drivers/clocksource/timer-qcom.c +F: drivers/cpuidle/cpuidle-qcom-spm.c F: drivers/extcon/extcon-qcom* F: drivers/i2c/busses/i2c-qcom-geni.c F: drivers/i2c/busses/i2c-qup.c diff --git a/drivers/cpuidle/Kconfig.arm b/drivers/cpuidle/Kconfig.arm index 99a2d72ac02b..51a7e89085c0 100644 --- a/drivers/cpuidle/Kconfig.arm +++ b/drivers/cpuidle/Kconfig.arm @@ -94,3 +94,16 @@ config ARM_TEGRA_CPUIDLE select ARM_CPU_SUSPEND help Select this to enable cpuidle for NVIDIA Tegra20/30/114/124 SoCs. + +config ARM_QCOM_SPM_CPUIDLE + bool "CPU Idle Driver for Qualcomm Subsystem Power Manager (SPM)" + depends on (ARCH_QCOM || COMPILE_TEST) && !ARM64 + select ARM_CPU_SUSPEND + select CPU_IDLE_MULTIPLE_DRIVERS + select DT_IDLE_STATES + select QCOM_SCM + help + Select this to enable cpuidle for Qualcomm processors. + The Subsystem Power Manager (SPM) controls low power modes for the + CPU and L2 cores. It interface with various system drivers to put + the cores in low power modes. diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile index 55a464f6a78b..f07800cbb43f 100644 --- a/drivers/cpuidle/Makefile +++ b/drivers/cpuidle/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle_psci.o cpuidle_psci-y := cpuidle-psci.o cpuidle_psci-$(CONFIG_PM_GENERIC_DOMAINS_OF) += cpuidle-psci-domain.o obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o +obj-$(CONFIG_ARM_QCOM_SPM_CPUIDLE) += cpuidle-qcom-spm.o ############################################################################### # MIPS drivers diff --git a/drivers/soc/qcom/spm.c b/drivers/cpuidle/cpuidle-qcom-spm.c similarity index 75% rename from drivers/soc/qcom/spm.c rename to drivers/cpuidle/cpuidle-qcom-spm.c index 8e10e02c6aa5..adf91a6e4d7d 100644 --- a/drivers/soc/qcom/spm.c +++ b/drivers/cpuidle/cpuidle-qcom-spm.c @@ -19,10 +19,11 @@ #include #include -#include #include #include +#include "dt_idle_states.h" + #define MAX_PMIC_DATA 2 #define MAX_SEQ_DATA 64 #define SPM_CTL_INDEX 0x7f @@ -62,6 +63,7 @@ struct spm_reg_data { }; struct spm_driver_data { + struct cpuidle_driver cpuidle_driver; void __iomem *reg_base; const struct spm_reg_data *reg_data; }; @@ -107,11 +109,6 @@ static const struct spm_reg_data spm_reg_8064_cpu = { .start_index[PM_SLEEP_MODE_SPC] = 2, }; -static DEFINE_PER_CPU(struct spm_driver_data *, cpu_spm_drv); - -typedef int (*idle_fn)(void); -static DEFINE_PER_CPU(idle_fn*, qcom_idle_ops); - static inline void spm_register_write(struct spm_driver_data *drv, enum spm_reg reg, u32 val) { @@ -172,10 +169,9 @@ static int qcom_pm_collapse(unsigned long int unused) return -1; } -static int qcom_cpu_spc(void) +static int qcom_cpu_spc(struct spm_driver_data *drv) { int ret; - struct spm_driver_data *drv = __this_cpu_read(cpu_spm_drv); spm_set_low_power_mode(drv, PM_SLEEP_MODE_SPC); ret = cpu_suspend(0, qcom_pm_collapse); @@ -190,94 +186,49 @@ static int qcom_cpu_spc(void) return ret; } -static int qcom_idle_enter(unsigned long index) +static int spm_enter_idle_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int idx) { - return __this_cpu_read(qcom_idle_ops)[index](); + struct spm_driver_data *data = container_of(drv, struct spm_driver_data, + cpuidle_driver); + + return CPU_PM_CPU_IDLE_ENTER_PARAM(qcom_cpu_spc, idx, data); } -static const struct of_device_id qcom_idle_state_match[] __initconst = { - { .compatible = "qcom,idle-state-spc", .data = qcom_cpu_spc }, +static struct cpuidle_driver qcom_spm_idle_driver = { + .name = "qcom_spm", + .owner = THIS_MODULE, + .states[0] = { + .enter = spm_enter_idle_state, + .exit_latency = 1, + .target_residency = 1, + .power_usage = UINT_MAX, + .name = "WFI", + .desc = "ARM WFI", + } +}; + +static const struct of_device_id qcom_idle_state_match[] = { + { .compatible = "qcom,idle-state-spc", .data = spm_enter_idle_state }, { }, }; -static int __init qcom_cpuidle_init(struct device_node *cpu_node, int cpu) +static int spm_cpuidle_init(struct cpuidle_driver *drv, int cpu) { - const struct of_device_id *match_id; - struct device_node *state_node; - int i; - int state_count = 1; - idle_fn idle_fns[CPUIDLE_STATE_MAX]; - idle_fn *fns; - cpumask_t mask; - bool use_scm_power_down = false; + int ret; - if (!qcom_scm_is_available()) - return -EPROBE_DEFER; + memcpy(drv, &qcom_spm_idle_driver, sizeof(*drv)); + drv->cpumask = (struct cpumask *)cpumask_of(cpu); - for (i = 0; ; i++) { - state_node = of_parse_phandle(cpu_node, "cpu-idle-states", i); - if (!state_node) - break; + /* Parse idle states from device tree */ + ret = dt_init_idle_driver(drv, qcom_idle_state_match, 1); + if (ret <= 0) + return ret ? : -ENODEV; - if (!of_device_is_available(state_node)) - continue; - - if (i == CPUIDLE_STATE_MAX) { - pr_warn("%s: cpuidle states reached max possible\n", - __func__); - break; - } - - match_id = of_match_node(qcom_idle_state_match, state_node); - if (!match_id) - return -ENODEV; - - idle_fns[state_count] = match_id->data; - - /* Check if any of the states allow power down */ - if (match_id->data == qcom_cpu_spc) - use_scm_power_down = true; - - state_count++; - } - - if (state_count == 1) - goto check_spm; - - fns = devm_kcalloc(get_cpu_device(cpu), state_count, sizeof(*fns), - GFP_KERNEL); - if (!fns) - return -ENOMEM; - - for (i = 1; i < state_count; i++) - fns[i] = idle_fns[i]; - - if (use_scm_power_down) { - /* We have atleast one power down mode */ - cpumask_clear(&mask); - cpumask_set_cpu(cpu, &mask); - qcom_scm_set_warm_boot_addr(cpu_resume_arm, &mask); - } - - per_cpu(qcom_idle_ops, cpu) = fns; - - /* - * SPM probe for the cpu should have happened by now, if the - * SPM device does not exist, return -ENXIO to indicate that the - * cpu does not support idle states. - */ -check_spm: - return per_cpu(cpu_spm_drv, cpu) ? 0 : -ENXIO; + /* We have atleast one power down mode */ + return qcom_scm_set_warm_boot_addr(cpu_resume_arm, drv->cpumask); } -static const struct cpuidle_ops qcom_cpuidle_ops __initconst = { - .suspend = qcom_idle_enter, - .init = qcom_cpuidle_init, -}; - -CPUIDLE_METHOD_OF_DECLARE(qcom_idle_v1, "qcom,kpss-acc-v1", &qcom_cpuidle_ops); -CPUIDLE_METHOD_OF_DECLARE(qcom_idle_v2, "qcom,kpss-acc-v2", &qcom_cpuidle_ops); - static struct spm_driver_data *spm_get_drv(struct platform_device *pdev, int *spm_cpu) { @@ -323,11 +274,15 @@ static int spm_dev_probe(struct platform_device *pdev) struct resource *res; const struct of_device_id *match_id; void __iomem *addr; - int cpu; + int cpu, ret; + + if (!qcom_scm_is_available()) + return -EPROBE_DEFER; drv = spm_get_drv(pdev, &cpu); if (!drv) return -EINVAL; + platform_set_drvdata(pdev, drv); res = platform_get_resource(pdev, IORESOURCE_MEM, 0); drv->reg_base = devm_ioremap_resource(&pdev->dev, res); @@ -340,6 +295,10 @@ static int spm_dev_probe(struct platform_device *pdev) drv->reg_data = match_id->data; + ret = spm_cpuidle_init(&drv->cpuidle_driver, cpu); + if (ret) + return ret; + /* Write the SPM sequences first.. */ addr = drv->reg_base + drv->reg_data->reg_offset[SPM_REG_SEQ_ENTRY]; __iowrite32_copy(addr, drv->reg_data->seq, @@ -362,13 +321,20 @@ static int spm_dev_probe(struct platform_device *pdev) /* Set up Standby as the default low power mode */ spm_set_low_power_mode(drv, PM_SLEEP_MODE_STBY); - per_cpu(cpu_spm_drv, cpu) = drv; + return cpuidle_register(&drv->cpuidle_driver, NULL); +} +static int spm_dev_remove(struct platform_device *pdev) +{ + struct spm_driver_data *drv = platform_get_drvdata(pdev); + + cpuidle_unregister(&drv->cpuidle_driver); return 0; } static struct platform_driver spm_driver = { .probe = spm_dev_probe, + .remove = spm_dev_remove, .driver = { .name = "saw", .of_match_table = spm_match_table, diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig index bf42a17a45de..285baa7e474e 100644 --- a/drivers/soc/qcom/Kconfig +++ b/drivers/soc/qcom/Kconfig @@ -80,16 +80,6 @@ config QCOM_PDR_HELPERS tristate select QCOM_QMI_HELPERS -config QCOM_PM - bool "Qualcomm Power Management" - depends on ARCH_QCOM && !ARM64 - select ARM_CPU_SUSPEND - select QCOM_SCM - help - QCOM Platform specific power driver to manage cores and L2 low power - modes. It interface with various system drivers to put the cores in - low power modes. - config QCOM_QMI_HELPERS tristate depends on NET diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile index 5d6b83dc58e8..92cc4232d72c 100644 --- a/drivers/soc/qcom/Makefile +++ b/drivers/soc/qcom/Makefile @@ -8,7 +8,6 @@ obj-$(CONFIG_QCOM_GSBI) += qcom_gsbi.o obj-$(CONFIG_QCOM_MDT_LOADER) += mdt_loader.o obj-$(CONFIG_QCOM_OCMEM) += ocmem.o obj-$(CONFIG_QCOM_PDR_HELPERS) += pdr_interface.o -obj-$(CONFIG_QCOM_PM) += spm.o obj-$(CONFIG_QCOM_QMI_HELPERS) += qmi_helpers.o qmi_helpers-y += qmi_encdec.o qmi_interface.o obj-$(CONFIG_QCOM_RMTFS_MEM) += rmtfs_mem.o From 64c7d7ea22d86cacb65d0c097cc447bc0e6d8abd Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 21 May 2020 19:08:09 +0200 Subject: [PATCH 41/51] PM: runtime: clk: Fix clk_pm_runtime_get() error path clk_pm_runtime_get() assumes that the PM-runtime usage counter will be dropped by pm_runtime_get_sync() on errors, which is not the case, so PM-runtime references to devices acquired by the former are leaked on errors returned by the latter. Fix this by modifying clk_pm_runtime_get() to drop the reference if pm_runtime_get_sync() returns an error. Fixes: 9a34b45397e5 clk: Add support for runtime PM Cc: 4.15+ # 4.15+ Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson --- drivers/clk/clk.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c index 2dfb30b963c4..407f6919604c 100644 --- a/drivers/clk/clk.c +++ b/drivers/clk/clk.c @@ -114,7 +114,11 @@ static int clk_pm_runtime_get(struct clk_core *core) return 0; ret = pm_runtime_get_sync(core->dev); - return ret < 0 ? ret : 0; + if (ret < 0) { + pm_runtime_put_noidle(core->dev); + return ret; + } + return 0; } static void clk_pm_runtime_put(struct clk_core *core) From ad1e4f74c072eaa2c6d77dd710db31aafecd614f Mon Sep 17 00:00:00 2001 From: Domenico Andreoli Date: Tue, 19 May 2020 20:14:10 +0200 Subject: [PATCH 42/51] PM: hibernate: Restrict writes to the resume device Hibernation via snapshot device requires write permission to the swap block device, the one that more often (but not necessarily) is used to store the hibernation image. With this patch, such permissions are granted iff: 1) snapshot device config option is enabled 2) swap partition is used as resume device In other circumstances the swap device is not writable from userspace. In order to achieve this, every write attempt to a swap device is checked against the device configured as part of the uswsusp API [0] using a pointer to the inode struct in memory. If the swap device being written was not configured for resuming, the write request is denied. NOTE: this implementation works only for swap block devices, where the inode configured by swapon (which sets S_SWAPFILE) is the same used by SNAPSHOT_SET_SWAP_AREA. In case of swap file, SNAPSHOT_SET_SWAP_AREA indeed receives the inode of the block device containing the filesystem where the swap file is located (+ offset in it) which is never passed to swapon and then has not set S_SWAPFILE. As result, the swap file itself (as a file) has never an option to be written from userspace. Instead it remains writable if accessed directly from the containing block device, which is always writeable from root. [0] Documentation/power/userland-swsusp.rst v2: - rename is_hibernate_snapshot_dev() to is_hibernate_resume_dev() - fix description so to correctly refer to the resume device Signed-off-by: Domenico Andreoli Acked-by: Darrick J. Wong Signed-off-by: Rafael J. Wysocki --- fs/block_dev.c | 3 +-- include/linux/suspend.h | 6 ++++++ kernel/power/user.c | 14 +++++++++++++- 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 93672c3f1c78..608a7b9173f4 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -2023,8 +2023,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) if (bdev_read_only(I_BDEV(bd_inode))) return -EPERM; - /* uswsusp needs write permission to the swap */ - if (IS_SWAPFILE(bd_inode) && !hibernation_available()) + if (IS_SWAPFILE(bd_inode) && !is_hibernate_resume_dev(bd_inode)) return -ETXTBSY; if (!iov_iter_count(from)) diff --git a/include/linux/suspend.h b/include/linux/suspend.h index 4fcc6fd0cbd6..b960098acfb0 100644 --- a/include/linux/suspend.h +++ b/include/linux/suspend.h @@ -466,6 +466,12 @@ static inline bool system_entering_hibernation(void) { return false; } static inline bool hibernation_available(void) { return false; } #endif /* CONFIG_HIBERNATION */ +#ifdef CONFIG_HIBERNATION_SNAPSHOT_DEV +int is_hibernate_resume_dev(const struct inode *); +#else +static inline int is_hibernate_resume_dev(const struct inode *i) { return 0; } +#endif + /* Hibernation and suspend events */ #define PM_HIBERNATION_PREPARE 0x0001 /* Going to hibernate */ #define PM_POST_HIBERNATION 0x0002 /* Hibernation finished */ diff --git a/kernel/power/user.c b/kernel/power/user.c index 98548d1cf8a6..d5eedc2baa2a 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -35,8 +35,14 @@ static struct snapshot_data { bool ready; bool platform_support; bool free_bitmaps; + struct inode *bd_inode; } snapshot_state; +int is_hibernate_resume_dev(const struct inode *bd_inode) +{ + return hibernation_available() && snapshot_state.bd_inode == bd_inode; +} + static int snapshot_open(struct inode *inode, struct file *filp) { struct snapshot_data *data; @@ -95,6 +101,7 @@ static int snapshot_open(struct inode *inode, struct file *filp) data->frozen = false; data->ready = false; data->platform_support = false; + data->bd_inode = NULL; Unlock: unlock_system_sleep(); @@ -110,6 +117,7 @@ static int snapshot_release(struct inode *inode, struct file *filp) swsusp_free(); data = filp->private_data; + data->bd_inode = NULL; free_all_swap_pages(data->swap); if (data->frozen) { pm_restore_gfp_mask(); @@ -202,6 +210,7 @@ struct compat_resume_swap_area { static int snapshot_set_swap_area(struct snapshot_data *data, void __user *argp) { + struct block_device *bdev; sector_t offset; dev_t swdev; @@ -232,9 +241,12 @@ static int snapshot_set_swap_area(struct snapshot_data *data, data->swap = -1; return -EINVAL; } - data->swap = swap_type_of(swdev, offset, NULL); + data->swap = swap_type_of(swdev, offset, &bdev); if (data->swap < 0) return -ENODEV; + + data->bd_inode = bdev->bd_inode; + bdput(bdev); return 0; } From d2216ba3ebea8d8864c5094526b8f9302c01021c Mon Sep 17 00:00:00 2001 From: Dmitry Osipenko Date: Fri, 3 Apr 2020 01:24:48 +0300 Subject: [PATCH 43/51] PM / devfreq: tegra30: Make CPUFreq notifier to take into account boosting We're taking into account both HW memory-accesses + CPU activity based on current CPU's frequency. For memory-accesses there is a kind of hysteresis in a form of "boosting" which is managed by the tegra30-devfreq driver. If current HW memory activity is higher than activity judged based of the CPU's frequency, then there is no need to schedule cpufreq_update_work because the result of the work will be a NO-OP. And thus, tegra_actmon_cpufreq_contribution() should return 0, meaning that at the moment CPU frequency doesn't contribute anything to the final decision about required memory clock rate. Signed-off-by: Dmitry Osipenko Signed-off-by: Chanwoo Choi --- drivers/devfreq/tegra30-devfreq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/devfreq/tegra30-devfreq.c b/drivers/devfreq/tegra30-devfreq.c index 28b2c7ca416e..dfc3ac93c584 100644 --- a/drivers/devfreq/tegra30-devfreq.c +++ b/drivers/devfreq/tegra30-devfreq.c @@ -420,7 +420,7 @@ tegra_actmon_cpufreq_contribution(struct tegra_devfreq *tegra, static_cpu_emc_freq = actmon_cpu_to_emc_rate(tegra, cpu_freq); - if (dev_freq >= static_cpu_emc_freq) + if (dev_freq + actmon_dev->boost_freq >= static_cpu_emc_freq) return 0; return static_cpu_emc_freq; From 0716f9fdb3b65b396531e490a42a4f58a69e1c8f Mon Sep 17 00:00:00 2001 From: Markus Elfring Date: Sat, 4 Apr 2020 20:34:02 +0200 Subject: [PATCH 44/51] PM / devfreq: tegra30: Delete an error message in tegra_devfreq_probe() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The function “platform_get_irq” can log an error already. Thus omit a redundant message for the exception handling in the calling function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring Reviewed-by: Dmitry Osipenko Signed-off-by: Chanwoo Choi --- drivers/devfreq/tegra30-devfreq.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/devfreq/tegra30-devfreq.c b/drivers/devfreq/tegra30-devfreq.c index dfc3ac93c584..e94a27804c20 100644 --- a/drivers/devfreq/tegra30-devfreq.c +++ b/drivers/devfreq/tegra30-devfreq.c @@ -807,10 +807,9 @@ static int tegra_devfreq_probe(struct platform_device *pdev) } err = platform_get_irq(pdev, 0); - if (err < 0) { - dev_err(&pdev->dev, "Failed to get IRQ: %d\n", err); + if (err < 0) return err; - } + tegra->irq = err; irq_set_status_flags(tegra->irq, IRQ_NOAUTOEN); From 5173a9756c8df9c387e04e49da0c4061951bbfec Mon Sep 17 00:00:00 2001 From: Leonard Crestez Date: Mon, 6 Apr 2020 15:03:07 +0300 Subject: [PATCH 45/51] PM / devfreq: Add generic imx bus scaling driver Add initial support for dynamic frequency switching on pieces of the imx interconnect fabric. All this driver does is set a clk rate based on an opp table, it does not map register areas. Signed-off-by: Leonard Crestez Tested-by: Martin Kepplinger Acked-by: Chanwoo Choi Signed-off-by: Chanwoo Choi --- drivers/devfreq/Kconfig | 8 +++ drivers/devfreq/Makefile | 1 + drivers/devfreq/imx-bus.c | 138 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 147 insertions(+) create mode 100644 drivers/devfreq/imx-bus.c diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig index 0b1df12e0f21..37dc40d1fcfb 100644 --- a/drivers/devfreq/Kconfig +++ b/drivers/devfreq/Kconfig @@ -91,6 +91,14 @@ config ARM_EXYNOS_BUS_DEVFREQ and adjusts the operating frequencies and voltages with OPP support. This does not yet operate with optimal voltages. +config ARM_IMX_BUS_DEVFREQ + tristate "i.MX Generic Bus DEVFREQ Driver" + depends on ARCH_MXC || COMPILE_TEST + select DEVFREQ_GOV_USERSPACE + help + This adds the generic DEVFREQ driver for i.MX interconnects. It + allows adjusting NIC/NOC frequency. + config ARM_IMX8M_DDRC_DEVFREQ tristate "i.MX8M DDRC DEVFREQ Driver" depends on (ARCH_MXC && HAVE_ARM_SMCCC) || \ diff --git a/drivers/devfreq/Makefile b/drivers/devfreq/Makefile index 3eb4d5e6635c..3ca1ad0ecb97 100644 --- a/drivers/devfreq/Makefile +++ b/drivers/devfreq/Makefile @@ -9,6 +9,7 @@ obj-$(CONFIG_DEVFREQ_GOV_PASSIVE) += governor_passive.o # DEVFREQ Drivers obj-$(CONFIG_ARM_EXYNOS_BUS_DEVFREQ) += exynos-bus.o +obj-$(CONFIG_ARM_IMX_BUS_DEVFREQ) += imx-bus.o obj-$(CONFIG_ARM_IMX8M_DDRC_DEVFREQ) += imx8m-ddrc.o obj-$(CONFIG_ARM_RK3399_DMC_DEVFREQ) += rk3399_dmc.o obj-$(CONFIG_ARM_TEGRA_DEVFREQ) += tegra30-devfreq.o diff --git a/drivers/devfreq/imx-bus.c b/drivers/devfreq/imx-bus.c new file mode 100644 index 000000000000..428f7980a2f2 --- /dev/null +++ b/drivers/devfreq/imx-bus.c @@ -0,0 +1,138 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright 2019 NXP + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +struct imx_bus { + struct devfreq_dev_profile profile; + struct devfreq *devfreq; + struct clk *clk; +}; + +static int imx_bus_target(struct device *dev, + unsigned long *freq, u32 flags) +{ + struct dev_pm_opp *new_opp; + int ret; + + new_opp = devfreq_recommended_opp(dev, freq, flags); + if (IS_ERR(new_opp)) { + ret = PTR_ERR(new_opp); + dev_err(dev, "failed to get recommended opp: %d\n", ret); + return ret; + } + dev_pm_opp_put(new_opp); + + return dev_pm_opp_set_rate(dev, *freq); +} + +static int imx_bus_get_cur_freq(struct device *dev, unsigned long *freq) +{ + struct imx_bus *priv = dev_get_drvdata(dev); + + *freq = clk_get_rate(priv->clk); + + return 0; +} + +static int imx_bus_get_dev_status(struct device *dev, + struct devfreq_dev_status *stat) +{ + struct imx_bus *priv = dev_get_drvdata(dev); + + stat->busy_time = 0; + stat->total_time = 0; + stat->current_frequency = clk_get_rate(priv->clk); + + return 0; +} + +static void imx_bus_exit(struct device *dev) +{ + dev_pm_opp_of_remove_table(dev); +} + +static int imx_bus_probe(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct imx_bus *priv; + const char *gov = DEVFREQ_GOV_USERSPACE; + int ret; + + priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + /* + * Fetch the clock to adjust but don't explicitly enable. + * + * For imx bus clock clk_set_rate is safe no matter if the clock is on + * or off and some peripheral side-buses might be off unless enabled by + * drivers for devices on those specific buses. + * + * Rate adjustment on a disabled bus clock just takes effect later. + */ + priv->clk = devm_clk_get(dev, NULL); + if (IS_ERR(priv->clk)) { + ret = PTR_ERR(priv->clk); + dev_err(dev, "failed to fetch clk: %d\n", ret); + return ret; + } + platform_set_drvdata(pdev, priv); + + ret = dev_pm_opp_of_add_table(dev); + if (ret < 0) { + dev_err(dev, "failed to get OPP table\n"); + return ret; + } + + priv->profile.polling_ms = 1000; + priv->profile.target = imx_bus_target; + priv->profile.get_dev_status = imx_bus_get_dev_status; + priv->profile.exit = imx_bus_exit; + priv->profile.get_cur_freq = imx_bus_get_cur_freq; + priv->profile.initial_freq = clk_get_rate(priv->clk); + + priv->devfreq = devm_devfreq_add_device(dev, &priv->profile, + gov, NULL); + if (IS_ERR(priv->devfreq)) { + ret = PTR_ERR(priv->devfreq); + dev_err(dev, "failed to add devfreq device: %d\n", ret); + goto err; + } + + return 0; + +err: + dev_pm_opp_of_remove_table(dev); + return ret; +} + +static const struct of_device_id imx_bus_of_match[] = { + { .compatible = "fsl,imx8m-noc", }, + { .compatible = "fsl,imx8m-nic", }, + { /* sentinel */ }, +}; +MODULE_DEVICE_TABLE(of, imx_bus_of_match); + +static struct platform_driver imx_bus_platdrv = { + .probe = imx_bus_probe, + .driver = { + .name = "imx-bus-devfreq", + .of_match_table = of_match_ptr(imx_bus_of_match), + }, +}; +module_platform_driver(imx_bus_platdrv); + +MODULE_DESCRIPTION("Generic i.MX bus frequency scaling driver"); +MODULE_AUTHOR("Leonard Crestez "); +MODULE_LICENSE("GPL v2"); From 02355216b4c00b1fe3f1e969aa15eca99240f8f3 Mon Sep 17 00:00:00 2001 From: Leonard Crestez Date: Mon, 6 Apr 2020 15:03:08 +0300 Subject: [PATCH 46/51] PM / devfreq: imx: Register interconnect device There is no single device which can represent the imx interconnect. Instead of adding a virtual one just make the main &noc act as the global interconnect provider. The imx interconnect provider driver will scale the NOC and DDRC based on bandwidth request. More scalable nodes can be added in the future, for example for audio/display/vpu/gpu NICs. Signed-off-by: Leonard Crestez Tested-by: Martin Kepplinger Acked-by: Chanwoo Choi Signed-off-by: Chanwoo Choi --- drivers/devfreq/imx-bus.c | 41 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/drivers/devfreq/imx-bus.c b/drivers/devfreq/imx-bus.c index 428f7980a2f2..532e7954032f 100644 --- a/drivers/devfreq/imx-bus.c +++ b/drivers/devfreq/imx-bus.c @@ -16,6 +16,7 @@ struct imx_bus { struct devfreq_dev_profile profile; struct devfreq *devfreq; struct clk *clk; + struct platform_device *icc_pdev; }; static int imx_bus_target(struct device *dev, @@ -58,7 +59,40 @@ static int imx_bus_get_dev_status(struct device *dev, static void imx_bus_exit(struct device *dev) { + struct imx_bus *priv = dev_get_drvdata(dev); + dev_pm_opp_of_remove_table(dev); + platform_device_unregister(priv->icc_pdev); +} + +/* imx_bus_init_icc() - register matching icc provider if required */ +static int imx_bus_init_icc(struct device *dev) +{ + struct imx_bus *priv = dev_get_drvdata(dev); + const char *icc_driver_name; + + if (!of_get_property(dev->of_node, "#interconnect-cells", 0)) + return 0; + if (!IS_ENABLED(CONFIG_INTERCONNECT_IMX)) { + dev_warn(dev, "imx interconnect drivers disabled\n"); + return 0; + } + + icc_driver_name = of_device_get_match_data(dev); + if (!icc_driver_name) { + dev_err(dev, "unknown interconnect driver\n"); + return 0; + } + + priv->icc_pdev = platform_device_register_data( + dev, icc_driver_name, -1, NULL, 0); + if (IS_ERR(priv->icc_pdev)) { + dev_err(dev, "failed to register icc provider %s: %ld\n", + icc_driver_name, PTR_ERR(priv->devfreq)); + return PTR_ERR(priv->devfreq); + } + + return 0; } static int imx_bus_probe(struct platform_device *pdev) @@ -110,6 +144,10 @@ static int imx_bus_probe(struct platform_device *pdev) goto err; } + ret = imx_bus_init_icc(dev); + if (ret) + goto err; + return 0; err: @@ -118,6 +156,9 @@ err: } static const struct of_device_id imx_bus_of_match[] = { + { .compatible = "fsl,imx8mq-noc", .data = "imx8mq-interconnect", }, + { .compatible = "fsl,imx8mm-noc", .data = "imx8mm-interconnect", }, + { .compatible = "fsl,imx8mn-noc", .data = "imx8mn-interconnect", }, { .compatible = "fsl,imx8m-noc", }, { .compatible = "fsl,imx8m-nic", }, { /* sentinel */ }, From a316b5ca9ead8065ac873e655c313248ecd732bc Mon Sep 17 00:00:00 2001 From: Dmitry Osipenko Date: Thu, 27 Feb 2020 20:08:54 +0300 Subject: [PATCH 47/51] PM / devfreq: Replace strncpy with strscpy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit GCC produces this warning when kernel compiled using `make W=1`: warning: ‘strncpy’ specified bound 16 equals destination size [-Wstringop-truncation] 772 | strncpy(devfreq->governor_name, governor_name, DEVFREQ_NAME_LEN); The strncpy doesn't take care of NULL-termination of the destination buffer, while the strscpy does. Signed-off-by: Dmitry Osipenko Signed-off-by: Chanwoo Choi --- drivers/devfreq/devfreq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c index 6fecd11dafdd..ef3d2bc3d1ac 100644 --- a/drivers/devfreq/devfreq.c +++ b/drivers/devfreq/devfreq.c @@ -768,7 +768,7 @@ struct devfreq *devfreq_add_device(struct device *dev, devfreq->dev.release = devfreq_dev_release; INIT_LIST_HEAD(&devfreq->node); devfreq->profile = profile; - strncpy(devfreq->governor_name, governor_name, DEVFREQ_NAME_LEN); + strscpy(devfreq->governor_name, governor_name, DEVFREQ_NAME_LEN); devfreq->previous_freq = profile->initial_freq; devfreq->last_status.current_frequency = profile->initial_freq; devfreq->data = data; From 48bbf6375131155d329eb8e06ae962e27cbba032 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Thu, 7 May 2020 08:12:45 -0500 Subject: [PATCH 48/51] PM / devfreq: imx-bus: Fix inconsistent IS_ERR and PTR_ERR Fix inconsistent IS_ERR and PTR_ERR in imx_bus_init_icc(). The proper pointer to be passed as argument to PTR_ERR() is priv->icc_pdev. This bug was detected with the help of Coccinelle. Fixes: 16c1d2f1b0bd ("PM / devfreq: imx: Register interconnect device") Signed-off-by: Gustavo A. R. Silva Reviewed-by: Dong Aisheng [cw00.choi: Edit the patch title from 'imx' to 'imx-bus'] Signed-off-by: Chanwoo Choi --- drivers/devfreq/imx-bus.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/devfreq/imx-bus.c b/drivers/devfreq/imx-bus.c index 532e7954032f..4f38455ad742 100644 --- a/drivers/devfreq/imx-bus.c +++ b/drivers/devfreq/imx-bus.c @@ -88,8 +88,8 @@ static int imx_bus_init_icc(struct device *dev) dev, icc_driver_name, -1, NULL, 0); if (IS_ERR(priv->icc_pdev)) { dev_err(dev, "failed to register icc provider %s: %ld\n", - icc_driver_name, PTR_ERR(priv->devfreq)); - return PTR_ERR(priv->devfreq); + icc_driver_name, PTR_ERR(priv->icc_pdev)); + return PTR_ERR(priv->icc_pdev); } return 0; From 8fc0e48e0faefef5064f3cb803d3d12314e16ec4 Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Tue, 12 May 2020 08:41:58 +0200 Subject: [PATCH 49/51] PM / devfreq: Use lockdep asserts instead of manual checks for locked mutex Instead of warning when mutex_is_locked(), just use the lockdep framework. The code is smaller and checks could be disabled for production environments (it is useful only during development). Put asserts at beginning of function, even before validating arguments. The behavior of update_devfreq() is now changed because lockdep assert will only print a warning, not return with EINVAL. Signed-off-by: Krzysztof Kozlowski Signed-off-by: Chanwoo Choi --- drivers/devfreq/devfreq.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c index ef3d2bc3d1ac..52b9c3e141f3 100644 --- a/drivers/devfreq/devfreq.c +++ b/drivers/devfreq/devfreq.c @@ -60,12 +60,12 @@ static struct devfreq *find_device_devfreq(struct device *dev) { struct devfreq *tmp_devfreq; + lockdep_assert_held(&devfreq_list_lock); + if (IS_ERR_OR_NULL(dev)) { pr_err("DEVFREQ: %s: Invalid parameters\n", __func__); return ERR_PTR(-EINVAL); } - WARN(!mutex_is_locked(&devfreq_list_lock), - "devfreq_list_lock must be locked."); list_for_each_entry(tmp_devfreq, &devfreq_list, node) { if (tmp_devfreq->dev.parent == dev) @@ -258,12 +258,12 @@ static struct devfreq_governor *find_devfreq_governor(const char *name) { struct devfreq_governor *tmp_governor; + lockdep_assert_held(&devfreq_list_lock); + if (IS_ERR_OR_NULL(name)) { pr_err("DEVFREQ: %s: Invalid parameters\n", __func__); return ERR_PTR(-EINVAL); } - WARN(!mutex_is_locked(&devfreq_list_lock), - "devfreq_list_lock must be locked."); list_for_each_entry(tmp_governor, &devfreq_governor_list, node) { if (!strncmp(tmp_governor->name, name, DEVFREQ_NAME_LEN)) @@ -289,12 +289,12 @@ static struct devfreq_governor *try_then_request_governor(const char *name) struct devfreq_governor *governor; int err = 0; + lockdep_assert_held(&devfreq_list_lock); + if (IS_ERR_OR_NULL(name)) { pr_err("DEVFREQ: %s: Invalid parameters\n", __func__); return ERR_PTR(-EINVAL); } - WARN(!mutex_is_locked(&devfreq_list_lock), - "devfreq_list_lock must be locked."); governor = find_devfreq_governor(name); if (IS_ERR(governor)) { @@ -392,10 +392,7 @@ int update_devfreq(struct devfreq *devfreq) int err = 0; u32 flags = 0; - if (!mutex_is_locked(&devfreq->lock)) { - WARN(true, "devfreq->lock must be locked by the caller.\n"); - return -EINVAL; - } + lockdep_assert_held(&devfreq->lock); if (!devfreq->governor) return -EINVAL; From 9a7875461fd0427dc86e3a87e93bd5723679b8b1 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 28 May 2020 16:45:14 +0200 Subject: [PATCH 50/51] PM: runtime: Replace pm_runtime_callbacks_present() The name of pm_runtime_callbacks_present() is confusing, because it suggests that the device has PM-runtime callbacks if 'true' is returned by that function, but in fact that may not be the case, so replace it with pm_runtime_has_no_callbacks() which is not ambiguous. No functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson --- drivers/base/power/sysfs.c | 4 ++-- include/linux/pm_runtime.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/base/power/sysfs.c b/drivers/base/power/sysfs.c index 2b99fe1eb207..24d25cf8ab14 100644 --- a/drivers/base/power/sysfs.c +++ b/drivers/base/power/sysfs.c @@ -666,7 +666,7 @@ int dpm_sysfs_add(struct device *dev) if (rc) return rc; - if (pm_runtime_callbacks_present(dev)) { + if (!pm_runtime_has_no_callbacks(dev)) { rc = sysfs_merge_group(&dev->kobj, &pm_runtime_attr_group); if (rc) goto err_out; @@ -709,7 +709,7 @@ int dpm_sysfs_change_owner(struct device *dev, kuid_t kuid, kgid_t kgid) if (rc) return rc; - if (pm_runtime_callbacks_present(dev)) { + if (!pm_runtime_has_no_callbacks(dev)) { rc = sysfs_group_change_owner( &dev->kobj, &pm_runtime_attr_group, kuid, kgid); if (rc) diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h index 3bdcbce8141a..3dbc207bff53 100644 --- a/include/linux/pm_runtime.h +++ b/include/linux/pm_runtime.h @@ -102,9 +102,9 @@ static inline bool pm_runtime_enabled(struct device *dev) return !dev->power.disable_depth; } -static inline bool pm_runtime_callbacks_present(struct device *dev) +static inline bool pm_runtime_has_no_callbacks(struct device *dev) { - return !dev->power.no_callbacks; + return dev->power.no_callbacks; } static inline void pm_runtime_mark_last_busy(struct device *dev) From c343bf1ba5efcbf2266a1fe3baefec9cc82f867f Mon Sep 17 00:00:00 2001 From: Qiushi Wu Date: Thu, 28 May 2020 13:20:46 -0500 Subject: [PATCH 51/51] cpuidle: Fix three reference count leaks kobject_init_and_add() takes reference even when it fails. If this function returns an error, kobject_put() must be called to properly clean up the memory associated with the object. Previous commit "b8eb718348b8" fixed a similar problem. Signed-off-by: Qiushi Wu [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/sysfs.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index 14c0eb536787..091d1caceb41 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -484,7 +484,7 @@ static int cpuidle_add_state_sysfs(struct cpuidle_device *device) ret = kobject_init_and_add(&kobj->kobj, &ktype_state_cpuidle, &kdev->kobj, "state%d", i); if (ret) { - kfree(kobj); + kobject_put(&kobj->kobj); goto error_state; } cpuidle_add_s2idle_attr_group(kobj); @@ -615,7 +615,7 @@ static int cpuidle_add_driver_sysfs(struct cpuidle_device *dev) ret = kobject_init_and_add(&kdrv->kobj, &ktype_driver_cpuidle, &kdev->kobj, "driver"); if (ret) { - kfree(kdrv); + kobject_put(&kdrv->kobj); return ret; } @@ -709,7 +709,7 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev) error = kobject_init_and_add(&kdev->kobj, &ktype_cpuidle, &cpu_dev->kobj, "cpuidle"); if (error) { - kfree(kdev); + kobject_put(&kdev->kobj); return error; }