linux/drivers/thermal
Daniel Lezcano 07209fcf33 thermal/drivers/step_wise: Fix temperature regulation misbehavior
There is a particular situation when the cooling device is cpufreq and the heat
dissipation is not efficient enough where the temperature increases little by
little until reaching the critical threshold and leading to a SoC reset.

The behavior is reproducible on a hikey6220 with bad heat dissipation (eg.
stacked with other boards).

Running a simple C program doing while(1); for each CPU of the SoC makes the
temperature to reach the passive regulation trip point and ends up to the
maximum allowed temperature followed by a reset.

This issue has been also reported by running the libhugetlbfs test suite.

What is observed is a ping pong between two cpu frequencies, 1.2GHz and 900MHz
while the temperature continues to grow.

It appears the step wise governor calls get_target_state() the first time with
the throttle set to true and the trend to 'raising'. The code selects logically
the next state, so the cpu frequency decreases from 1.2GHz to 900MHz, so far so
good. The temperature decreases immediately but still stays greater than the
trip point, then get_target_state() is called again, this time with the
throttle set to true *and* the trend to 'dropping'. From there the algorithm
assumes we have to step down the state and the cpu frequency jumps back to
1.2GHz. But the temperature is still higher than the trip point, so
get_target_state() is called with throttle=1 and trend='raising' again, we jump
to 900MHz, then get_target_state() is called with throttle=1 and
trend='dropping', we jump to 1.2GHz, etc ... but the temperature does not
stabilizes and continues to increase.

[  237.922654] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
[  237.922678] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=1
[  237.922690] thermal cooling_device0: cur_state=0
[  237.922701] thermal cooling_device0: old_target=0, target=1
[  238.026656] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=2,throttle=1
[  238.026680] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=2,throttle=1
[  238.026694] thermal cooling_device0: cur_state=1
[  238.026707] thermal cooling_device0: old_target=1, target=0
[  238.134647] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
[  238.134667] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=1
[  238.134679] thermal cooling_device0: cur_state=0
[  238.134690] thermal cooling_device0: old_target=0, target=1

In this situation the temperature continues to increase while the trend is
oscillating between 'dropping' and 'raising'. We need to keep the current state
untouched if the throttle is set, so the temperature can decrease or a higher
state could be selected, thus preventing this oscillation.

Keeping the next_target untouched when 'throttle' is true at 'dropping' time
fixes the issue.

The following traces show the governor does not change the next state if
trend==2 (dropping) and throttle==1.

[ 2306.127987] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
[ 2306.128009] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=1
[ 2306.128021] thermal cooling_device0: cur_state=0
[ 2306.128031] thermal cooling_device0: old_target=0, target=1
[ 2306.231991] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=2,throttle=1
[ 2306.232016] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=2,throttle=1
[ 2306.232030] thermal cooling_device0: cur_state=1
[ 2306.232042] thermal cooling_device0: old_target=1, target=1
[ 2306.335982] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=0,throttle=1
[ 2306.336006] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=0,throttle=1
[ 2306.336021] thermal cooling_device0: cur_state=1
[ 2306.336034] thermal cooling_device0: old_target=1, target=1
[ 2306.439984] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=2,throttle=1
[ 2306.440008] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=2,throttle=0
[ 2306.440022] thermal cooling_device0: cur_state=1
[ 2306.440034] thermal cooling_device0: old_target=1, target=0

[ ... ]

After a while, if the temperature continues to increase, the next state becomes
2 which is 720MHz on the hikey. That results in the temperature stabilizing
around the trip point.

[ 2455.831982] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
[ 2455.832006] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=0
[ 2455.832019] thermal cooling_device0: cur_state=1
[ 2455.832032] thermal cooling_device0: old_target=1, target=1
[ 2455.935985] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=0,throttle=1
[ 2455.936013] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=0,throttle=0
[ 2455.936027] thermal cooling_device0: cur_state=1
[ 2455.936040] thermal cooling_device0: old_target=1, target=1
[ 2456.043984] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=0,throttle=1
[ 2456.044009] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=0,throttle=0
[ 2456.044023] thermal cooling_device0: cur_state=1
[ 2456.044036] thermal cooling_device0: old_target=1, target=1
[ 2456.148001] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=1,throttle=1
[ 2456.148028] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=1,throttle=1
[ 2456.148042] thermal cooling_device0: cur_state=1
[ 2456.148055] thermal cooling_device0: old_target=1, target=2
[ 2456.252009] thermal thermal_zone0: Trip0[type=1,temp=65000]:trend=2,throttle=1
[ 2456.252041] thermal thermal_zone0: Trip1[type=1,temp=75000]:trend=2,throttle=0
[ 2456.252058] thermal cooling_device0: cur_state=2
[ 2456.252075] thermal cooling_device0: old_target=2, target=1

IOW, this change is needed to keep the state for a cooling device if the
temperature trend is oscillating while the temperature increases slightly.

Without this change, the situation above leads to a catastrophic crash by a
hardware reset on hikey. This issue has been reported to happen on an OMAP
dra7xx also.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Keerthy <j-keerthy@ti.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Leo Yan <leo.yan@linaro.org>
Tested-by: Keerthy <j-keerthy@ti.com>
Reviewed-by: Keerthy <j-keerthy@ti.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
2017-10-31 19:32:17 -07:00
..
broadcom thermal: bcm2835: constify thermal_zone_of_device_ops structures 2017-08-11 11:38:30 +08:00
int340x_thermal Thermal: int3406_thermal: fix thermal sysfs I/F 2017-09-01 09:03:29 +08:00
qcom thermal: qcom: tsens: Fix return value check in init_common() 2016-09-27 14:02:16 +08:00
samsung thermal: exynos: constify thermal_zone_of_device_ops structures 2017-08-11 11:38:30 +08:00
st thermal: Enhance thermal_zone_device_update for events 2016-09-27 14:35:21 +08:00
tegra thermal: tegra: remove null check for dev pointer 2017-10-31 19:32:13 -07:00
ti-soc-thermal thermal: ti-soc-thermal: Fix ti_thermal_unregister_cpu_cooling NULL pointer on unload 2017-10-31 19:32:14 -07:00
armada_thermal.c thermal: armada: fix formula documentation comment 2017-10-31 19:32:13 -07:00
clock_cooling.c thermal: convert clock cooling to use an IDA 2017-01-04 12:47:28 +08:00
cpu_cooling.c thermal: cpu_cooling: Replace kmalloc with kmalloc_array 2017-05-27 17:33:04 -07:00
da9062-thermal.c thermal: da9062/61: Thermal junction temperature monitoring driver 2017-04-06 21:48:03 -07:00
db8500_thermal.c thermal: db8500: Fix module autoload 2016-11-23 10:07:35 +08:00
devfreq_cooling.c trace: thermal: add another parameter 'power' to the tracing function 2017-05-05 15:54:45 +08:00
dove_thermal.c thermal: consistently use int for temperatures 2015-08-03 23:15:50 +08:00
fair_share.c thermal: fix source code documentation for parameters 2017-06-29 10:46:31 +08:00
gov_bang_bang.c thermal: bang-bang governor: act on lower trip boundary 2016-09-27 14:02:16 +08:00
hisi_thermal.c thermal/drivers/hisi: Use round up step value 2017-10-31 19:32:17 -07:00
imx_thermal.c thermal: imx: Handle return value of clk_prepare_enable 2017-06-30 16:41:55 -07:00
intel_bxt_pmic_thermal.c mfd: intel_soc_pmic_bxtwc: Remove thermal second level IRQs 2017-06-19 15:44:29 +01:00
intel_pch_thermal.c thermal: intel_pch_thermal: Fix enable check on Broadwell-DE 2017-08-15 14:32:58 +08:00
intel_powerclamp.c sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h> 2017-03-02 08:42:27 +01:00
intel_quark_dts_thermal.c x86/platform/iosf_mbi: Remove duplicate definitions 2015-12-09 01:18:34 +01:00
intel_soc_dts_iosf.c thermal: Enhance thermal_zone_device_update for events 2016-09-27 14:35:21 +08:00
intel_soc_dts_iosf.h Thermal: Intel SoC: DTS thermal IOSF core 2015-05-01 11:20:42 +08:00
intel_soc_dts_thermal.c Thermal: Intel SoC DTS: Change interrupt request behavior 2017-05-05 16:00:10 +08:00
Kconfig thermal: enable broadcom menu for arm64 bcm2835 2017-10-31 19:32:13 -07:00
kirkwood_thermal.c thermal: consistently use int for temperatures 2015-08-03 23:15:50 +08:00
Makefile thermal: uniphier: add UniPhier thermal driver 2017-08-11 10:49:52 +08:00
max77620_thermal.c thermal: max77620: fix pinmux conflict on reprobe 2017-06-13 11:07:32 +02:00
mtk_thermal.c thermal: mediatek: minor mtk_thermal.c cleanups 2017-08-31 21:13:53 +08:00
of-thermal.c thermal: Enhance thermal_zone_device_update for events 2016-09-27 14:35:21 +08:00
power_allocator.c thermal: fix race condition when updating cooling device 2016-08-08 10:57:39 +08:00
qcom-spmi-temp-alarm.c thermal: qcom-spmi: Treat reg property as a single cell 2016-11-23 10:07:35 +08:00
qoriq_thermal.c thermal: qoriq: constify thermal_zone_of_device_ops structures 2017-08-11 11:38:29 +08:00
rcar_gen3_thermal.c thermal: rcar_gen3_thermal: fix initialization sequence for H3 ES2.0 2017-10-31 19:32:14 -07:00
rcar_thermal.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2016-10-12 11:05:23 -07:00
rockchip_thermal.c thermal: rockchip: Support the RV1108 SoC in thermal driver 2017-10-31 19:32:13 -07:00
spear_thermal.c thermal: spear: use __maybe_unused for PM functions 2016-02-09 14:12:08 -08:00
step_wise.c thermal/drivers/step_wise: Fix temperature regulation misbehavior 2017-10-31 19:32:17 -07:00
tango_thermal.c thermal: tango: Fix module autoload 2016-11-23 10:07:35 +08:00
thermal_core.c thermal: core: Fix resources release in error paths in thermal_zone_device_register() 2017-08-11 11:34:07 +08:00
thermal_core.h thermal: core: Add some new helper functions to free resources 2017-08-11 11:33:49 +08:00
thermal_helpers.c thermal: core: move slop and offset helpers to thermal_helpers.c 2016-11-23 10:06:12 +08:00
thermal_hwmon.c Revert "thermal: thermal_hwmon: Convert to hwmon_device_register_with_info()" 2017-01-25 09:51:08 +08:00
thermal_hwmon.h thermal: hwmon: move hwmon support to single file 2013-09-03 09:09:12 -04:00
thermal_sysfs.c thermal: core: Add some new helper functions to free resources 2017-08-11 11:33:49 +08:00
thermal-generic-adc.c thermal: generic-adc: Add ADC based thermal sensor driver 2016-05-17 07:28:31 -07:00
uniphier_thermal.c thermal: uniphier: add UniPhier thermal driver 2017-08-11 10:49:52 +08:00
user_space.c thermal: fix source code documentation for parameters 2017-06-29 10:46:31 +08:00
x86_pkg_temp_thermal.c thermal/x86 pkg temp: Convert to hotplug state machine 2016-11-30 10:25:47 +08:00
zx2967_thermal.c thermal: zx2967: constify thermal_zone_of_device_ops structures 2017-08-11 11:38:30 +08:00