linux/arch/powerpc/kernel/idle_book3s.S

1003 lines
26 KiB
ArmAsm
Raw Normal View History

/*
* This file contains idle entry/exit functions for POWER7,
* POWER8 and POWER9 CPUs.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
#include <linux/threads.h>
#include <asm/processor.h>
#include <asm/page.h>
#include <asm/cputable.h>
#include <asm/thread_info.h>
#include <asm/ppc_asm.h>
#include <asm/asm-offsets.h>
#include <asm/ppc-opcode.h>
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 15:27:59 +08:00
#include <asm/hw_irq.h>
#include <asm/kvm_book3s_asm.h>
#include <asm/opal.h>
#include <asm/cpuidle.h>
#include <asm/exception-64s.h>
#include <asm/book3s/64/mmu-hash.h>
#include <asm/mmu.h>
#undef DEBUG
/*
* Use unused space in the interrupt stack to save and restore
* registers for winkle support.
*/
#define _MMCR0 GPR0
#define _SDR1 GPR3
#define _PTCR GPR3
#define _RPR GPR4
#define _SPURR GPR5
#define _PURR GPR6
#define _TSCR GPR7
#define _DSCR GPR8
#define _AMOR GPR9
#define _WORT GPR10
#define _WORC GPR11
#define _LPCR GPR12
powernv: Pass PSSCR value and mask to power9_idle_stop The power9_idle_stop method currently takes only the requested stop level as a parameter and picks up the rest of the PSSCR bits from a hand-coded macro. This is not a very flexible design, especially when the firmware has the capability to communicate the psscr value and the mask associated with a particular stop state via device tree. This patch modifies the power9_idle_stop API to take as parameters the PSSCR value and the PSSCR mask corresponding to the stop state that needs to be set. These PSSCR value and mask are respectively obtained by parsing the "ibm,cpu-idle-state-psscr" and "ibm,cpu-idle-state-psscr-mask" fields from the device tree. In addition to this, the patch adds support for handling stop states for which ESL and EC bits in the PSSCR are zero. As per the architecture, a wakeup from these stop states resumes execution from the subsequent instruction as opposed to waking up at the System Vector. The older firmware sets only the Requested Level (RL) field in the psscr and psscr-mask exposed in the device tree. For older firmware where psscr-mask=0xf, this patch will set the default sane values that the set for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and TR). For the new firmware, the patch will validate that the invariants required by the ISA for the psscr values are maintained by the firmware. This skiboot patch that exports fully populated PSSCR values and the mask for all the stop states can be found here: https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html [Optimize the number of instructions before entering STOP with ESL=EC=0, validate the PSSCR values provided by the firimware maintains the invariants required as per the ISA suggested by Balbir Singh] Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-25 16:36:28 +08:00
#define PSSCR_EC_ESL_MASK_SHIFTED (PSSCR_EC | PSSCR_ESL) >> 16
.text
/*
* Used by threads before entering deep idle states. Saves SPRs
* in interrupt stack frame
*/
save_sprs_to_stack:
/*
* Note all register i.e per-core, per-subcore or per-thread is saved
* here since any thread in the core might wake up first
*/
BEGIN_FTR_SECTION
/*
* Note - SDR1 is dropped in Power ISA v3. Hence not restoring
* SDR1 here
*/
mfspr r3,SPRN_PTCR
std r3,_PTCR(r1)
mfspr r3,SPRN_LPCR
std r3,_LPCR(r1)
FTR_SECTION_ELSE
mfspr r3,SPRN_SDR1
std r3,_SDR1(r1)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
mfspr r3,SPRN_RPR
std r3,_RPR(r1)
mfspr r3,SPRN_SPURR
std r3,_SPURR(r1)
mfspr r3,SPRN_PURR
std r3,_PURR(r1)
mfspr r3,SPRN_TSCR
std r3,_TSCR(r1)
mfspr r3,SPRN_DSCR
std r3,_DSCR(r1)
mfspr r3,SPRN_AMOR
std r3,_AMOR(r1)
mfspr r3,SPRN_WORT
std r3,_WORT(r1)
mfspr r3,SPRN_WORC
std r3,_WORC(r1)
/*
* On POWER9, there are idle states such as stop4, invoked via cpuidle,
* that lose hypervisor resources. In such cases, we need to save
* additional SPRs before entering those idle states so that they can
* be restored to their older values on wakeup from the idle state.
*
* On POWER8, the only such deep idle state is winkle which is used
* only in the context of CPU-Hotplug, where these additional SPRs are
* reinitiazed to a sane value. Hence there is no need to save/restore
* these SPRs.
*/
BEGIN_FTR_SECTION
blr
END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
power9_save_additional_sprs:
mfspr r3, SPRN_PID
mfspr r4, SPRN_LDBAR
std r3, STOP_PID(r13)
std r4, STOP_LDBAR(r13)
mfspr r3, SPRN_FSCR
mfspr r4, SPRN_HFSCR
std r3, STOP_FSCR(r13)
std r4, STOP_HFSCR(r13)
mfspr r3, SPRN_MMCRA
mfspr r4, SPRN_MMCR0
std r3, STOP_MMCRA(r13)
std r4, _MMCR0(r1)
mfspr r3, SPRN_MMCR1
mfspr r4, SPRN_MMCR2
std r3, STOP_MMCR1(r13)
std r4, STOP_MMCR2(r13)
blr
power9_restore_additional_sprs:
ld r3,_LPCR(r1)
ld r4, STOP_PID(r13)
mtspr SPRN_LPCR,r3
mtspr SPRN_PID, r4
ld r3, STOP_LDBAR(r13)
ld r4, STOP_FSCR(r13)
mtspr SPRN_LDBAR, r3
mtspr SPRN_FSCR, r4
ld r3, STOP_HFSCR(r13)
ld r4, STOP_MMCRA(r13)
mtspr SPRN_HFSCR, r3
mtspr SPRN_MMCRA, r4
ld r3, _MMCR0(r1)
ld r4, STOP_MMCR1(r13)
mtspr SPRN_MMCR0, r3
mtspr SPRN_MMCR1, r4
ld r3, STOP_MMCR2(r13)
mtspr SPRN_MMCR2, r3
blr
powerpc/powernv: Fix race in updating core_idle_state core_idle_state is maintained for each core. It uses 0-7 bits to track whether a thread in the core has entered fastsleep or winkle. 8th bit is used as a lock bit. The lock bit is set in these 2 scenarios- - The thread is first in subcore to wakeup from sleep/winkle. - If its the last thread in the core about to enter sleep/winkle While the lock bit is set, if any other thread in the core wakes up, it loops until the lock bit is cleared before proceeding in the wakeup path. This helps prevent race conditions w.r.t fastsleep workaround and prevents threads from switching to process context before core/subcore resources are restored. But, in the path to sleep/winkle entry, we currently don't check for lock-bit. This exposes us to following race when running with subcore on- First thread in the subcorea Another thread in the same waking up core entering sleep/winkle lwarx r15,0,r14 ori r15,r15,PNV_CORE_IDLE_LOCK_BIT stwcx. r15,0,r14 [Code to restore subcore state] lwarx r15,0,r14 [clear thread bit] stwcx. r15,0,r14 andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS stw r15,0(r14) Here, after the thread entering sleep clears its thread bit in core_idle_state, the value is overwritten by the thread waking up. In such cases when the core enters fastsleep, code mistakes an idle thread as running. Because of this, the first thread waking up from fastsleep which is supposed to resync timebase skips it. So we can end up having a core with stale timebase value. This patch fixes the above race by looping on the lock bit even while entering the idle states. Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus' Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 04:09:23 +08:00
/*
* Used by threads when the lock bit of core_idle_state is set.
* Threads will spin in HMT_LOW until the lock bit is cleared.
* r14 - pointer to core_idle_state
* r15 - used to load contents of core_idle_state
* r9 - used as a temporary variable
powerpc/powernv: Fix race in updating core_idle_state core_idle_state is maintained for each core. It uses 0-7 bits to track whether a thread in the core has entered fastsleep or winkle. 8th bit is used as a lock bit. The lock bit is set in these 2 scenarios- - The thread is first in subcore to wakeup from sleep/winkle. - If its the last thread in the core about to enter sleep/winkle While the lock bit is set, if any other thread in the core wakes up, it loops until the lock bit is cleared before proceeding in the wakeup path. This helps prevent race conditions w.r.t fastsleep workaround and prevents threads from switching to process context before core/subcore resources are restored. But, in the path to sleep/winkle entry, we currently don't check for lock-bit. This exposes us to following race when running with subcore on- First thread in the subcorea Another thread in the same waking up core entering sleep/winkle lwarx r15,0,r14 ori r15,r15,PNV_CORE_IDLE_LOCK_BIT stwcx. r15,0,r14 [Code to restore subcore state] lwarx r15,0,r14 [clear thread bit] stwcx. r15,0,r14 andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS stw r15,0(r14) Here, after the thread entering sleep clears its thread bit in core_idle_state, the value is overwritten by the thread waking up. In such cases when the core enters fastsleep, code mistakes an idle thread as running. Because of this, the first thread waking up from fastsleep which is supposed to resync timebase skips it. So we can end up having a core with stale timebase value. This patch fixes the above race by looping on the lock bit even while entering the idle states. Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus' Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 04:09:23 +08:00
*/
core_idle_lock_held:
HMT_LOW
3: lwz r15,0(r14)
andis. r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
powerpc/powernv: Fix race in updating core_idle_state core_idle_state is maintained for each core. It uses 0-7 bits to track whether a thread in the core has entered fastsleep or winkle. 8th bit is used as a lock bit. The lock bit is set in these 2 scenarios- - The thread is first in subcore to wakeup from sleep/winkle. - If its the last thread in the core about to enter sleep/winkle While the lock bit is set, if any other thread in the core wakes up, it loops until the lock bit is cleared before proceeding in the wakeup path. This helps prevent race conditions w.r.t fastsleep workaround and prevents threads from switching to process context before core/subcore resources are restored. But, in the path to sleep/winkle entry, we currently don't check for lock-bit. This exposes us to following race when running with subcore on- First thread in the subcorea Another thread in the same waking up core entering sleep/winkle lwarx r15,0,r14 ori r15,r15,PNV_CORE_IDLE_LOCK_BIT stwcx. r15,0,r14 [Code to restore subcore state] lwarx r15,0,r14 [clear thread bit] stwcx. r15,0,r14 andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS stw r15,0(r14) Here, after the thread entering sleep clears its thread bit in core_idle_state, the value is overwritten by the thread waking up. In such cases when the core enters fastsleep, code mistakes an idle thread as running. Because of this, the first thread waking up from fastsleep which is supposed to resync timebase skips it. So we can end up having a core with stale timebase value. This patch fixes the above race by looping on the lock bit even while entering the idle states. Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus' Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 04:09:23 +08:00
bne 3b
HMT_MEDIUM
lwarx r15,0,r14
andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
bne- core_idle_lock_held
powerpc/powernv: Fix race in updating core_idle_state core_idle_state is maintained for each core. It uses 0-7 bits to track whether a thread in the core has entered fastsleep or winkle. 8th bit is used as a lock bit. The lock bit is set in these 2 scenarios- - The thread is first in subcore to wakeup from sleep/winkle. - If its the last thread in the core about to enter sleep/winkle While the lock bit is set, if any other thread in the core wakes up, it loops until the lock bit is cleared before proceeding in the wakeup path. This helps prevent race conditions w.r.t fastsleep workaround and prevents threads from switching to process context before core/subcore resources are restored. But, in the path to sleep/winkle entry, we currently don't check for lock-bit. This exposes us to following race when running with subcore on- First thread in the subcorea Another thread in the same waking up core entering sleep/winkle lwarx r15,0,r14 ori r15,r15,PNV_CORE_IDLE_LOCK_BIT stwcx. r15,0,r14 [Code to restore subcore state] lwarx r15,0,r14 [clear thread bit] stwcx. r15,0,r14 andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS stw r15,0(r14) Here, after the thread entering sleep clears its thread bit in core_idle_state, the value is overwritten by the thread waking up. In such cases when the core enters fastsleep, code mistakes an idle thread as running. Because of this, the first thread waking up from fastsleep which is supposed to resync timebase skips it. So we can end up having a core with stale timebase value. This patch fixes the above race by looping on the lock bit even while entering the idle states. Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus' Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 04:09:23 +08:00
blr
/*
* Pass requested state in r3:
* r3 - PNV_THREAD_NAP/SLEEP/WINKLE in POWER8
* - Requested PSSCR value in POWER9
*
* Address of idle handler to branch to in realmode in r4
*/
pnv_powersave_common:
/* Use r3 to pass state nap/sleep/winkle */
/* NAP is a state loss, we create a regs frame on the
* stack, fill it up with the state we care about and
* stick a pointer to it in PACAR1. We really only
* need to save PC, some CR bits and the NV GPRs,
* but for now an interrupt frame will do.
*/
mtctr r4
mflr r0
std r0,16(r1)
stdu r1,-INT_FRAME_SIZE(r1)
std r0,_LINK(r1)
std r0,_NIP(r1)
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 15:27:59 +08:00
/* We haven't lost state ... yet */
li r0,0
stb r0,PACA_NAPSTATELOST(r13)
/* Continue saving state */
SAVE_GPR(2, r1)
SAVE_NVGPRS(r1)
mfcr r5
std r5,_CCR(r1)
std r1,PACAR1(r13)
BEGIN_FTR_SECTION
/*
* POWER9 does not require real mode to stop, and presently does not
* set hwthread_state for KVM (threads don't share MMU context), so
* we can remain in virtual mode for this.
*/
bctr
END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
/*
* POWER8
* Go to real mode to do the nap, as required by the architecture.
* Also, we need to be in real mode before setting hwthread_state,
* because as soon as we do that, another thread can switch
* the MMU context to the guest.
*/
LOAD_REG_IMMEDIATE(r7, MSR_IDLE)
mtmsrd r7,0
bctr
/*
* This is the sequence required to execute idle instructions, as
* specified in ISA v2.07 (and earlier). MSR[IR] and MSR[DR] must be 0.
*/
#define IDLE_STATE_ENTER_SEQ_NORET(IDLE_INST) \
/* Magic NAP/SLEEP/WINKLE mode enter sequence */ \
std r0,0(r1); \
ptesync; \
ld r0,0(r1); \
236: cmpd cr0,r0,r0; \
bne 236b; \
IDLE_INST;
.globl pnv_enter_arch207_idle_mode
pnv_enter_arch207_idle_mode:
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
/* Tell KVM we're entering idle */
li r4,KVM_HWTHREAD_IN_IDLE
/******************************************************/
/* N O T E W E L L ! ! ! N O T E W E L L */
/* The following store to HSTATE_HWTHREAD_STATE(r13) */
/* MUST occur in real mode, i.e. with the MMU off, */
/* and the MMU must stay off until we clear this flag */
/* and test HSTATE_HWTHREAD_REQ(r13) in */
/* pnv_powersave_wakeup in this file. */
/* The reason is that another thread can switch the */
/* MMU to a guest context whenever this flag is set */
/* to KVM_HWTHREAD_IN_IDLE, and if the MMU was on, */
/* that would potentially cause this thread to start */
/* executing instructions from guest memory in */
/* hypervisor mode, leading to a host crash or data */
/* corruption, or worse. */
/******************************************************/
stb r4,HSTATE_HWTHREAD_STATE(r13)
#endif
stb r3,PACA_THREAD_IDLE_STATE(r13)
cmpwi cr3,r3,PNV_THREAD_SLEEP
bge cr3,2f
IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
/* No return */
2:
/* Sleep or winkle */
lbz r7,PACA_THREAD_MASK(r13)
ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
li r5,0
beq cr3,3f
lis r5,PNV_CORE_IDLE_WINKLE_COUNT@h
3:
lwarx_loop1:
lwarx r15,0,r14
powerpc/powernv: Fix race in updating core_idle_state core_idle_state is maintained for each core. It uses 0-7 bits to track whether a thread in the core has entered fastsleep or winkle. 8th bit is used as a lock bit. The lock bit is set in these 2 scenarios- - The thread is first in subcore to wakeup from sleep/winkle. - If its the last thread in the core about to enter sleep/winkle While the lock bit is set, if any other thread in the core wakes up, it loops until the lock bit is cleared before proceeding in the wakeup path. This helps prevent race conditions w.r.t fastsleep workaround and prevents threads from switching to process context before core/subcore resources are restored. But, in the path to sleep/winkle entry, we currently don't check for lock-bit. This exposes us to following race when running with subcore on- First thread in the subcorea Another thread in the same waking up core entering sleep/winkle lwarx r15,0,r14 ori r15,r15,PNV_CORE_IDLE_LOCK_BIT stwcx. r15,0,r14 [Code to restore subcore state] lwarx r15,0,r14 [clear thread bit] stwcx. r15,0,r14 andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS stw r15,0(r14) Here, after the thread entering sleep clears its thread bit in core_idle_state, the value is overwritten by the thread waking up. In such cases when the core enters fastsleep, code mistakes an idle thread as running. Because of this, the first thread waking up from fastsleep which is supposed to resync timebase skips it. So we can end up having a core with stale timebase value. This patch fixes the above race by looping on the lock bit even while entering the idle states. Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus' Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 04:09:23 +08:00
andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
bnel- core_idle_lock_held
powerpc/powernv: Fix race in updating core_idle_state core_idle_state is maintained for each core. It uses 0-7 bits to track whether a thread in the core has entered fastsleep or winkle. 8th bit is used as a lock bit. The lock bit is set in these 2 scenarios- - The thread is first in subcore to wakeup from sleep/winkle. - If its the last thread in the core about to enter sleep/winkle While the lock bit is set, if any other thread in the core wakes up, it loops until the lock bit is cleared before proceeding in the wakeup path. This helps prevent race conditions w.r.t fastsleep workaround and prevents threads from switching to process context before core/subcore resources are restored. But, in the path to sleep/winkle entry, we currently don't check for lock-bit. This exposes us to following race when running with subcore on- First thread in the subcorea Another thread in the same waking up core entering sleep/winkle lwarx r15,0,r14 ori r15,r15,PNV_CORE_IDLE_LOCK_BIT stwcx. r15,0,r14 [Code to restore subcore state] lwarx r15,0,r14 [clear thread bit] stwcx. r15,0,r14 andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS stw r15,0(r14) Here, after the thread entering sleep clears its thread bit in core_idle_state, the value is overwritten by the thread waking up. In such cases when the core enters fastsleep, code mistakes an idle thread as running. Because of this, the first thread waking up from fastsleep which is supposed to resync timebase skips it. So we can end up having a core with stale timebase value. This patch fixes the above race by looping on the lock bit even while entering the idle states. Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus' Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 04:09:23 +08:00
add r15,r15,r5 /* Add if winkle */
andc r15,r15,r7 /* Clear thread bit */
andi. r9,r15,PNV_CORE_IDLE_THREAD_BITS
/*
* If cr0 = 0, then current thread is the last thread of the core entering
* sleep. Last thread needs to execute the hardware bug workaround code if
* required by the platform.
* Make the workaround call unconditionally here. The below branch call is
* patched out when the idle states are discovered if the platform does not
* require it.
*/
.global pnv_fastsleep_workaround_at_entry
pnv_fastsleep_workaround_at_entry:
beq fastsleep_workaround_at_entry
stwcx. r15,0,r14
bne- lwarx_loop1
isync
common_enter: /* common code for all the threads entering sleep or winkle */
bgt cr3,enter_winkle
IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
fastsleep_workaround_at_entry:
oris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
stwcx. r15,0,r14
bne- lwarx_loop1
isync
/* Fast sleep workaround */
li r3,1
li r4,1
bl opal_config_cpu_idle_state
/* Unlock */
xoris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
lwsync
stw r15,0(r14)
b common_enter
enter_winkle:
bl save_sprs_to_stack
IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
/*
powernv: Pass PSSCR value and mask to power9_idle_stop The power9_idle_stop method currently takes only the requested stop level as a parameter and picks up the rest of the PSSCR bits from a hand-coded macro. This is not a very flexible design, especially when the firmware has the capability to communicate the psscr value and the mask associated with a particular stop state via device tree. This patch modifies the power9_idle_stop API to take as parameters the PSSCR value and the PSSCR mask corresponding to the stop state that needs to be set. These PSSCR value and mask are respectively obtained by parsing the "ibm,cpu-idle-state-psscr" and "ibm,cpu-idle-state-psscr-mask" fields from the device tree. In addition to this, the patch adds support for handling stop states for which ESL and EC bits in the PSSCR are zero. As per the architecture, a wakeup from these stop states resumes execution from the subsequent instruction as opposed to waking up at the System Vector. The older firmware sets only the Requested Level (RL) field in the psscr and psscr-mask exposed in the device tree. For older firmware where psscr-mask=0xf, this patch will set the default sane values that the set for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and TR). For the new firmware, the patch will validate that the invariants required by the ISA for the psscr values are maintained by the firmware. This skiboot patch that exports fully populated PSSCR values and the mask for all the stop states can be found here: https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html [Optimize the number of instructions before entering STOP with ESL=EC=0, validate the PSSCR values provided by the firimware maintains the invariants required as per the ISA suggested by Balbir Singh] Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-25 16:36:28 +08:00
* r3 - PSSCR value corresponding to the requested stop state.
*/
power_enter_stop:
/*
* Check if we are executing the lite variant with ESL=EC=0
*/
andis. r4,r3,PSSCR_EC_ESL_MASK_SHIFTED
powernv: Pass PSSCR value and mask to power9_idle_stop The power9_idle_stop method currently takes only the requested stop level as a parameter and picks up the rest of the PSSCR bits from a hand-coded macro. This is not a very flexible design, especially when the firmware has the capability to communicate the psscr value and the mask associated with a particular stop state via device tree. This patch modifies the power9_idle_stop API to take as parameters the PSSCR value and the PSSCR mask corresponding to the stop state that needs to be set. These PSSCR value and mask are respectively obtained by parsing the "ibm,cpu-idle-state-psscr" and "ibm,cpu-idle-state-psscr-mask" fields from the device tree. In addition to this, the patch adds support for handling stop states for which ESL and EC bits in the PSSCR are zero. As per the architecture, a wakeup from these stop states resumes execution from the subsequent instruction as opposed to waking up at the System Vector. The older firmware sets only the Requested Level (RL) field in the psscr and psscr-mask exposed in the device tree. For older firmware where psscr-mask=0xf, this patch will set the default sane values that the set for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and TR). For the new firmware, the patch will validate that the invariants required by the ISA for the psscr values are maintained by the firmware. This skiboot patch that exports fully populated PSSCR values and the mask for all the stop states can be found here: https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html [Optimize the number of instructions before entering STOP with ESL=EC=0, validate the PSSCR values provided by the firimware maintains the invariants required as per the ISA suggested by Balbir Singh] Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-25 16:36:28 +08:00
clrldi r3,r3,60 /* r3 = Bits[60:63] = Requested Level (RL) */
bne .Lhandle_esl_ec_set
PPC_STOP
li r3,0 /* Since we didn't lose state, return 0 */
std r3, PACA_REQ_PSSCR(r13)
/*
* pnv_wakeup_noloss() expects r12 to contain the SRR1 value so
* it can determine if the wakeup reason is an HMI in
* CHECK_HMI_INTERRUPT.
*
* However, when we wakeup with ESL=0, SRR1 will not contain the wakeup
* reason, so there is no point setting r12 to SRR1.
*
* Further, we clear r12 here, so that we don't accidentally enter the
* HMI in pnv_wakeup_noloss() if the value of r12[42:45] == WAKE_HMI.
*/
li r12, 0
b pnv_wakeup_noloss
.Lhandle_esl_ec_set:
BEGIN_FTR_SECTION
/*
powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature Recently we added a CPU feature for Power9 DD2.0, to capture the fact that some workarounds are required only on Power9 DD1 and DD2.0 but not DD2.1 or later. Then in commit 9d2f510a66ec ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 ERAT workaround on DD2.1") and commit e3646330cf66 "powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1") we changed CPU_FTR_SECTIONs to check for DD1 or DD20, eg: BEGIN_FTR_SECTION PPC_INVALIDATE_ERAT END_FTR_SECTION_IFSET(CPU_FTR_POWER9_DD1 | CPU_FTR_POWER9_DD20) Unfortunately although this reads as "if set DD1 or DD2.0", the or is a bitwise or and actually generates a mask of both bits. The code that does the feature patching then checks that the value of the CPU features masked with that mask are equal to the mask. So the end result is we're checking for DD1 and DD20 being set, which never happens. Yes the API is terrible. Removing the ERAT workaround on DD2.0 results in random SEGVs, the system tends to boot, but things randomly die including sometimes dhclient, udev etc. To fix the problem and hopefully avoid it in future, we remove the DD2.0 CPU feature and instead add a DD2.1 (or later) feature. This allows us to easily express that the workarounds are required if DD2.1 is not set. At some point we will drop the DD1 workarounds entirely and some of this can be cleaned up. Fixes: 9d2f510a66ec ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 ERAT workaround on DD2.1") Fixes: e3646330cf66 ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-15 11:25:42 +08:00
* POWER9 DD2.0 or earlier can incorrectly set PMAO when waking up after
* a state-loss idle. Saving and restoring MMCR0 over idle is a
* workaround.
*/
mfspr r4,SPRN_MMCR0
std r4,_MMCR0(r1)
powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature Recently we added a CPU feature for Power9 DD2.0, to capture the fact that some workarounds are required only on Power9 DD1 and DD2.0 but not DD2.1 or later. Then in commit 9d2f510a66ec ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 ERAT workaround on DD2.1") and commit e3646330cf66 "powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1") we changed CPU_FTR_SECTIONs to check for DD1 or DD20, eg: BEGIN_FTR_SECTION PPC_INVALIDATE_ERAT END_FTR_SECTION_IFSET(CPU_FTR_POWER9_DD1 | CPU_FTR_POWER9_DD20) Unfortunately although this reads as "if set DD1 or DD2.0", the or is a bitwise or and actually generates a mask of both bits. The code that does the feature patching then checks that the value of the CPU features masked with that mask are equal to the mask. So the end result is we're checking for DD1 and DD20 being set, which never happens. Yes the API is terrible. Removing the ERAT workaround on DD2.0 results in random SEGVs, the system tends to boot, but things randomly die including sometimes dhclient, udev etc. To fix the problem and hopefully avoid it in future, we remove the DD2.0 CPU feature and instead add a DD2.1 (or later) feature. This allows us to easily express that the workarounds are required if DD2.1 is not set. At some point we will drop the DD1 workarounds entirely and some of this can be cleaned up. Fixes: 9d2f510a66ec ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 ERAT workaround on DD2.1") Fixes: e3646330cf66 ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-15 11:25:42 +08:00
END_FTR_SECTION_IFCLR(CPU_FTR_POWER9_DD2_1)
/*
* Check if the requested state is a deep idle state.
*/
LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
ld r4,ADDROFF(pnv_first_deep_stop_state)(r5)
cmpd r3,r4
bge .Lhandle_deep_stop
PPC_STOP /* Does not return (system reset interrupt) */
.Lhandle_deep_stop:
/*
* Entering deep idle state.
* Clear thread bit in PACA_CORE_IDLE_STATE, save SPRs to
* stack and enter stop
*/
lbz r7,PACA_THREAD_MASK(r13)
ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
lwarx_loop_stop:
lwarx r15,0,r14
andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
bnel- core_idle_lock_held
andc r15,r15,r7 /* Clear thread bit */
stwcx. r15,0,r14
bne- lwarx_loop_stop
isync
bl save_sprs_to_stack
PPC_STOP /* Does not return (system reset interrupt) */
/*
* Entered with MSR[EE]=0 and no soft-masked interrupts pending.
* r3 contains desired idle state (PNV_THREAD_NAP/SLEEP/WINKLE).
*/
_GLOBAL(power7_idle_insn)
/* Now check if user or arch enabled NAP mode */
LOAD_REG_ADDR(r4, pnv_enter_arch207_idle_mode)
b pnv_powersave_common
#define CHECK_HMI_INTERRUPT \
BEGIN_FTR_SECTION_NESTED(66); \
rlwinm r0,r12,45-31,0xf; /* extract wake reason field (P8) */ \
FTR_SECTION_ELSE_NESTED(66); \
rlwinm r0,r12,45-31,0xe; /* P7 wake reason field is 3 bits */ \
ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66); \
cmpwi r0,0xa; /* Hypervisor maintenance ? */ \
bne+ 20f; \
/* Invoke opal call to handle hmi */ \
ld r2,PACATOC(r13); \
ld r1,PACAR1(r13); \
std r3,ORIG_GPR3(r1); /* Save original r3 */ \
KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt When a guest is assigned to a core it converts the host Timebase (TB) into guest TB by adding guest timebase offset before entering into guest. During guest exit it restores the guest TB to host TB. This means under certain conditions (Guest migration) host TB and guest TB can differ. When we get an HMI for TB related issues the opal HMI handler would try fixing errors and restore the correct host TB value. With no guest running, we don't have any issues. But with guest running on the core we run into TB corruption issues. If we get an HMI while in the guest, the current HMI handler invokes opal hmi handler before forcing guest to exit. The guest exit path subtracts the guest TB offset from the current TB value which may have already been restored with host value by opal hmi handler. This leads to incorrect host and guest TB values. With split-core, things become more complex. With split-core, TB also gets split and each subcore gets its own TB register. When a hmi handler fixes a TB error and restores the TB value, it affects all the TB values of sibling subcores on the same core. On TB errors all the thread in the core gets HMI. With existing code, the individual threads call opal hmi handle independently which can easily throw TB out of sync if we have guest running on subcores. Hence we will need to co-ordinate with all the threads before making opal hmi handler call followed by TB resync. This patch introduces a sibling subcore state structure (shared by all threads in the core) in paca which holds information about whether sibling subcores are in Guest mode or host mode. An array in_guest[] of size MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore. The subcore id is used as index into in_guest[] array. Only primary thread entering/exiting the guest is responsible to set/unset its designated array element. On TB error, we get HMI interrupt on every thread on the core. Upon HMI, this patch will now force guest to vacate the core/subcore. Primary thread from each subcore will then turn off its respective bit from the above bitmap during the guest exit path just after the guest->host partition switch is complete. All other threads that have just exited the guest OR were already in host will wait until all other subcores clears their respective bit. Once all the subcores turn off their respective bit, all threads will will make call to opal hmi handler. It is not necessary that opal hmi handler would resync the TB value for every HMI interrupts. It would do so only for the HMI caused due to TB errors. For rest, it would not touch TB value. Hence to make things simpler, primary thread would call TB resync explicitly once for each core immediately after opal hmi handler instead of subtracting guest offset from TB. TB resync call will restore the TB with host value. Thus we can be sure about the TB state. One of the primary threads exiting the guest will take up the responsibility of calling TB resync. It will use one of the top bits (bit 63) from subcore state flags bitmap to make the decision. The first primary thread (among the subcores) that is able to set the bit will have to call the TB resync. Rest all other threads will wait until TB resync is complete. Once TB resync is complete all threads will then proceed. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-05-15 12:14:26 +08:00
li r3,0; /* NULL argument */ \
bl hmi_exception_realmode; \
nop; \
ld r3,ORIG_GPR3(r1); /* Restore original r3 */ \
20: nop;
/*
* Entered with MSR[EE]=0 and no soft-masked interrupts pending.
* r3 contains desired PSSCR register value.
*
* Offline (CPU unplug) case also must notify KVM that the CPU is
* idle.
*/
_GLOBAL(power9_offline_stop)
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
/*
* Tell KVM we're entering idle.
* This does not have to be done in real mode because the P9 MMU
* is independent per-thread. Some steppings share radix/hash mode
* between threads, but in that case KVM has a barrier sync in real
* mode before and after switching between radix and hash.
*/
li r4,KVM_HWTHREAD_IN_IDLE
stb r4,HSTATE_HWTHREAD_STATE(r13)
#endif
/* fall through */
_GLOBAL(power9_idle_stop)
std r3, PACA_REQ_PSSCR(r13)
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
BEGIN_FTR_SECTION
powerpc/powernv: Provide a way to force a core into SMT4 mode POWER9 processors up to and including "Nimbus" v2.2 have hardware bugs relating to transactional memory and thread reconfiguration. One of these bugs has a workaround which is to get the core into SMT4 state temporarily. This workaround is only needed when running bare-metal. This patch provides a function which gets the core into SMT4 mode by preventing threads from going to a stop state, and waking up those which are already in a stop state. Once at least 3 threads are not in a stop state, the core will be in SMT4 and we can continue. To do this, we add a "dont_stop" flag to the paca to tell the thread not to go into a stop state. If this flag is set, power9_idle_stop() just returns immediately with a return value of 0. The pnv_power9_force_smt4_catch() function does the following: 1. Set the dont_stop flag for each thread in the core, except ourselves (in fact we use an atomic_inc() in case more than one thread is calling this function concurrently). 2. See how many threads are awake, indicated by their requested_psscr field in the paca being 0. If this is at least 3, skip to step 5. 3. Send a doorbell interrupt to each thread that was seen as being in a stop state in step 2. 4. Until at least 3 threads are awake, scan the threads to which we sent a doorbell interrupt and check if they are awake now. This relies on the following properties: - Once dont_stop is non-zero, requested_psccr can't go from zero to non-zero, except transiently (and without the thread doing stop). - requested_psscr being zero guarantees that the thread isn't in a state-losing stop state where thread reconfiguration could occur. - Doing stop with a PSSCR value of 0 won't be a state-losing stop and thus won't allow thread reconfiguration. - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core must be in SMT4 mode, since SMT modes are powers of 2. This does add a sync to power9_idle_stop(), which is necessary to provide the correct ordering between setting requested_psscr and checking dont_stop. The overhead of the sync should be unnoticeable compared to the latency of going into and out of a stop state. Because some objected to incurring this extra latency on systems where the XER[SO] bug is not relevant, I have put the test in power9_idle_stop inside a feature section. This means that pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will probably hang the system. In order to cater for uses where the caller has an operation that has to be done while the core is in SMT4, the core continues to be kept in SMT4 after pnv_power9_force_smt4_catch() function returns, until the pnv_power9_force_smt4_release() function is called. It undoes the effect of step 1 above and allows the other threads to go into a stop state. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:00 +08:00
sync
lwz r5, PACA_DONT_STOP(r13)
cmpwi r5, 0
bne 1f
powerpc/powernv: Provide a way to force a core into SMT4 mode POWER9 processors up to and including "Nimbus" v2.2 have hardware bugs relating to transactional memory and thread reconfiguration. One of these bugs has a workaround which is to get the core into SMT4 state temporarily. This workaround is only needed when running bare-metal. This patch provides a function which gets the core into SMT4 mode by preventing threads from going to a stop state, and waking up those which are already in a stop state. Once at least 3 threads are not in a stop state, the core will be in SMT4 and we can continue. To do this, we add a "dont_stop" flag to the paca to tell the thread not to go into a stop state. If this flag is set, power9_idle_stop() just returns immediately with a return value of 0. The pnv_power9_force_smt4_catch() function does the following: 1. Set the dont_stop flag for each thread in the core, except ourselves (in fact we use an atomic_inc() in case more than one thread is calling this function concurrently). 2. See how many threads are awake, indicated by their requested_psscr field in the paca being 0. If this is at least 3, skip to step 5. 3. Send a doorbell interrupt to each thread that was seen as being in a stop state in step 2. 4. Until at least 3 threads are awake, scan the threads to which we sent a doorbell interrupt and check if they are awake now. This relies on the following properties: - Once dont_stop is non-zero, requested_psccr can't go from zero to non-zero, except transiently (and without the thread doing stop). - requested_psscr being zero guarantees that the thread isn't in a state-losing stop state where thread reconfiguration could occur. - Doing stop with a PSSCR value of 0 won't be a state-losing stop and thus won't allow thread reconfiguration. - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core must be in SMT4 mode, since SMT modes are powers of 2. This does add a sync to power9_idle_stop(), which is necessary to provide the correct ordering between setting requested_psscr and checking dont_stop. The overhead of the sync should be unnoticeable compared to the latency of going into and out of a stop state. Because some objected to incurring this extra latency on systems where the XER[SO] bug is not relevant, I have put the test in power9_idle_stop inside a feature section. This means that pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will probably hang the system. In order to cater for uses where the caller has an operation that has to be done while the core is in SMT4, the core continues to be kept in SMT4 after pnv_power9_force_smt4_catch() function returns, until the pnv_power9_force_smt4_release() function is called. It undoes the effect of step 1 above and allows the other threads to go into a stop state. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:00 +08:00
END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
#endif
mtspr SPRN_PSSCR,r3
LOAD_REG_ADDR(r4,power_enter_stop)
b pnv_powersave_common
/* No return */
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
1:
powerpc/powernv: Provide a way to force a core into SMT4 mode POWER9 processors up to and including "Nimbus" v2.2 have hardware bugs relating to transactional memory and thread reconfiguration. One of these bugs has a workaround which is to get the core into SMT4 state temporarily. This workaround is only needed when running bare-metal. This patch provides a function which gets the core into SMT4 mode by preventing threads from going to a stop state, and waking up those which are already in a stop state. Once at least 3 threads are not in a stop state, the core will be in SMT4 and we can continue. To do this, we add a "dont_stop" flag to the paca to tell the thread not to go into a stop state. If this flag is set, power9_idle_stop() just returns immediately with a return value of 0. The pnv_power9_force_smt4_catch() function does the following: 1. Set the dont_stop flag for each thread in the core, except ourselves (in fact we use an atomic_inc() in case more than one thread is calling this function concurrently). 2. See how many threads are awake, indicated by their requested_psscr field in the paca being 0. If this is at least 3, skip to step 5. 3. Send a doorbell interrupt to each thread that was seen as being in a stop state in step 2. 4. Until at least 3 threads are awake, scan the threads to which we sent a doorbell interrupt and check if they are awake now. This relies on the following properties: - Once dont_stop is non-zero, requested_psccr can't go from zero to non-zero, except transiently (and without the thread doing stop). - requested_psscr being zero guarantees that the thread isn't in a state-losing stop state where thread reconfiguration could occur. - Doing stop with a PSSCR value of 0 won't be a state-losing stop and thus won't allow thread reconfiguration. - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core must be in SMT4 mode, since SMT modes are powers of 2. This does add a sync to power9_idle_stop(), which is necessary to provide the correct ordering between setting requested_psscr and checking dont_stop. The overhead of the sync should be unnoticeable compared to the latency of going into and out of a stop state. Because some objected to incurring this extra latency on systems where the XER[SO] bug is not relevant, I have put the test in power9_idle_stop inside a feature section. This means that pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will probably hang the system. In order to cater for uses where the caller has an operation that has to be done while the core is in SMT4, the core continues to be kept in SMT4 after pnv_power9_force_smt4_catch() function returns, until the pnv_power9_force_smt4_release() function is called. It undoes the effect of step 1 above and allows the other threads to go into a stop state. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:00 +08:00
/*
* We get here when TM / thread reconfiguration bug workaround
* code wants to get the CPU into SMT4 mode, and therefore
* we are being asked not to stop.
*/
li r3, 0
std r3, PACA_REQ_PSSCR(r13)
blr /* return 0 for wakeup cause / SRR1 value */
#endif
/*
* On waking up from stop 0,1,2 with ESL=1 on POWER9 DD1,
* HSPRG0 will be set to the HSPRG0 value of one of the
* threads in this core. Thus the value we have in r13
* may not be this thread's paca pointer.
*
* Fortunately, the TIR remains invariant. Since this thread's
* paca pointer is recorded in all its sibling's paca, we can
* correctly recover this thread's paca pointer if we
* know the index of this thread in the core.
*
* This index can be obtained from the TIR.
*
* i.e, thread's position in the core = TIR.
* If this value is i, then this thread's paca is
* paca->thread_sibling_pacas[i].
*/
power9_dd1_recover_paca:
mfspr r4, SPRN_TIR
/*
* Since each entry in thread_sibling_pacas is 8 bytes
* we need to left-shift by 3 bits. Thus r4 = i * 8
*/
sldi r4, r4, 3
/* Get &paca->thread_sibling_pacas[0] in r5 */
ld r5, PACA_SIBLING_PACA_PTRS(r13)
/* Load paca->thread_sibling_pacas[i] into r13 */
ldx r13, r4, r5
SET_PACA(r13)
/*
* Indicate that we have lost NVGPR state
* which needs to be restored from the stack.
*/
li r3, 1
stb r3,PACA_NAPSTATELOST(r13)
blr
2017-04-19 21:05:47 +08:00
/*
* Called from machine check handler for powersave wakeups.
* Low level machine check processing has already been done. Now just
* go through the wake up path to get everything in order.
*
* r3 - The original SRR1 value.
* Original SRR[01] have been clobbered.
* MSR_RI is clear.
*/
.global pnv_powersave_wakeup_mce
pnv_powersave_wakeup_mce:
/* Set cr3 for pnv_powersave_wakeup */
rlwinm r11,r3,47-31,30,31
cmpwi cr3,r11,2
/*
* Now put the original SRR1 with SRR1_WAKEMCE_RESVD as the wake
* reason into r12, which allows reuse of the system reset wakeup
2017-04-19 21:05:47 +08:00
* code without being mistaken for another type of wakeup.
*/
oris r12,r3,SRR1_WAKEMCE_RESVD@h
2017-04-19 21:05:47 +08:00
b pnv_powersave_wakeup
/*
* Called from reset vector for powersave wakeups.
* cr3 - set to gt if waking up with partial/complete hypervisor state loss
* r12 - SRR1
*/
.global pnv_powersave_wakeup
pnv_powersave_wakeup:
ld r2, PACATOC(r13)
BEGIN_FTR_SECTION
BEGIN_FTR_SECTION_NESTED(70)
bl power9_dd1_recover_paca
END_FTR_SECTION_NESTED_IFSET(CPU_FTR_POWER9_DD1, 70)
bl pnv_restore_hyp_resource_arch300
FTR_SECTION_ELSE
bl pnv_restore_hyp_resource_arch207
ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
li r0,PNV_THREAD_RUNNING
stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */
mr r3,r12
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
lbz r0,HSTATE_HWTHREAD_STATE(r13)
cmpwi r0,KVM_HWTHREAD_IN_KERNEL
powerpc/kvm: Fix lockups when running KVM guests on Power8 When running KVM guests on Power8 we can see a lockup where one CPU stops responding. This often leads to a message such as: watchdog: CPU 136 detected hard LOCKUP on other CPUs 72 Task dump for CPU 72: qemu-system-ppc R running task 10560 20917 20908 0x00040004 And then backtraces on other CPUs, such as: Task dump for CPU 48: ksmd R running task 10032 1519 2 0x00000804 Call Trace: ... --- interrupt: 901 at smp_call_function_many+0x3c8/0x460 LR = smp_call_function_many+0x37c/0x460 pmdp_invalidate+0x100/0x1b0 __split_huge_pmd+0x52c/0xdb0 try_to_unmap_one+0x764/0x8b0 rmap_walk_anon+0x15c/0x370 try_to_unmap+0xb4/0x170 split_huge_page_to_list+0x148/0xa30 try_to_merge_one_page+0xc8/0x990 try_to_merge_with_ksm_page+0x74/0xf0 ksm_scan_thread+0x10ec/0x1ac0 kthread+0x160/0x1a0 ret_from_kernel_thread+0x5c/0x78 This is caused by commit 8c1c7fb0b5ec ("powerpc/64s/idle: avoid sync for KVM state when waking from idle"), which added a check in pnv_powersave_wakeup() to see if the kvm_hstate.hwthread_state is already set to KVM_HWTHREAD_IN_KERNEL, and if so to skip the store and test of kvm_hstate.hwthread_req. The problem is that the primary does not set KVM_HWTHREAD_IN_KVM when entering the guest, so it can then come out to cede with KVM_HWTHREAD_IN_KERNEL set. It can then go idle in kvm_do_nap after setting hwthread_req to 1, but because hwthread_state is still KVM_HWTHREAD_IN_KERNEL we will skip the test of hwthread_req when we wake up from idle and won't go to kvm_start_guest. From there the thread will return somewhere garbage and crash. Fix it by skipping the store of hwthread_state, but not the test of hwthread_req, when coming out of idle. It's OK to skip the sync in that case because hwthread_req will have been set on the same thread, so there is no synchronisation required. Fixes: 8c1c7fb0b5ec ("powerpc/64s/idle: avoid sync for KVM state when waking from idle") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-19 14:22:20 +08:00
beq 0f
li r0,KVM_HWTHREAD_IN_KERNEL
stb r0,HSTATE_HWTHREAD_STATE(r13)
/* Order setting hwthread_state vs. testing hwthread_req */
sync
powerpc/kvm: Fix lockups when running KVM guests on Power8 When running KVM guests on Power8 we can see a lockup where one CPU stops responding. This often leads to a message such as: watchdog: CPU 136 detected hard LOCKUP on other CPUs 72 Task dump for CPU 72: qemu-system-ppc R running task 10560 20917 20908 0x00040004 And then backtraces on other CPUs, such as: Task dump for CPU 48: ksmd R running task 10032 1519 2 0x00000804 Call Trace: ... --- interrupt: 901 at smp_call_function_many+0x3c8/0x460 LR = smp_call_function_many+0x37c/0x460 pmdp_invalidate+0x100/0x1b0 __split_huge_pmd+0x52c/0xdb0 try_to_unmap_one+0x764/0x8b0 rmap_walk_anon+0x15c/0x370 try_to_unmap+0xb4/0x170 split_huge_page_to_list+0x148/0xa30 try_to_merge_one_page+0xc8/0x990 try_to_merge_with_ksm_page+0x74/0xf0 ksm_scan_thread+0x10ec/0x1ac0 kthread+0x160/0x1a0 ret_from_kernel_thread+0x5c/0x78 This is caused by commit 8c1c7fb0b5ec ("powerpc/64s/idle: avoid sync for KVM state when waking from idle"), which added a check in pnv_powersave_wakeup() to see if the kvm_hstate.hwthread_state is already set to KVM_HWTHREAD_IN_KERNEL, and if so to skip the store and test of kvm_hstate.hwthread_req. The problem is that the primary does not set KVM_HWTHREAD_IN_KVM when entering the guest, so it can then come out to cede with KVM_HWTHREAD_IN_KERNEL set. It can then go idle in kvm_do_nap after setting hwthread_req to 1, but because hwthread_state is still KVM_HWTHREAD_IN_KERNEL we will skip the test of hwthread_req when we wake up from idle and won't go to kvm_start_guest. From there the thread will return somewhere garbage and crash. Fix it by skipping the store of hwthread_state, but not the test of hwthread_req, when coming out of idle. It's OK to skip the sync in that case because hwthread_req will have been set on the same thread, so there is no synchronisation required. Fixes: 8c1c7fb0b5ec ("powerpc/64s/idle: avoid sync for KVM state when waking from idle") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-19 14:22:20 +08:00
0: lbz r0,HSTATE_HWTHREAD_REQ(r13)
cmpwi r0,0
beq 1f
b kvm_start_guest
1:
#endif
/* Return SRR1 from power7_nap() */
blt cr3,pnv_wakeup_noloss
b pnv_wakeup_loss
/*
* Check whether we have woken up with hypervisor state loss.
* If yes, restore hypervisor state and return back to link.
*
* cr3 - set to gt if waking up with partial/complete hypervisor state loss
*/
pnv_restore_hyp_resource_arch300:
/*
* Workaround for POWER9, if we lost resources, the ERAT
* might have been mixed up and needs flushing. We also need
* to reload MMCR0 (see comment above). We also need to set
* then clear bit 60 in MMCRA to ensure the PMU starts running.
*/
blt cr3,1f
BEGIN_FTR_SECTION
PPC_INVALIDATE_ERAT
ld r1,PACAR1(r13)
ld r4,_MMCR0(r1)
mtspr SPRN_MMCR0,r4
powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature Recently we added a CPU feature for Power9 DD2.0, to capture the fact that some workarounds are required only on Power9 DD1 and DD2.0 but not DD2.1 or later. Then in commit 9d2f510a66ec ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 ERAT workaround on DD2.1") and commit e3646330cf66 "powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1") we changed CPU_FTR_SECTIONs to check for DD1 or DD20, eg: BEGIN_FTR_SECTION PPC_INVALIDATE_ERAT END_FTR_SECTION_IFSET(CPU_FTR_POWER9_DD1 | CPU_FTR_POWER9_DD20) Unfortunately although this reads as "if set DD1 or DD2.0", the or is a bitwise or and actually generates a mask of both bits. The code that does the feature patching then checks that the value of the CPU features masked with that mask are equal to the mask. So the end result is we're checking for DD1 and DD20 being set, which never happens. Yes the API is terrible. Removing the ERAT workaround on DD2.0 results in random SEGVs, the system tends to boot, but things randomly die including sometimes dhclient, udev etc. To fix the problem and hopefully avoid it in future, we remove the DD2.0 CPU feature and instead add a DD2.1 (or later) feature. This allows us to easily express that the workarounds are required if DD2.1 is not set. At some point we will drop the DD1 workarounds entirely and some of this can be cleaned up. Fixes: 9d2f510a66ec ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 ERAT workaround on DD2.1") Fixes: e3646330cf66 ("powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-15 11:25:42 +08:00
END_FTR_SECTION_IFCLR(CPU_FTR_POWER9_DD2_1)
mfspr r4,SPRN_MMCRA
ori r4,r4,(1 << (63-60))
mtspr SPRN_MMCRA,r4
xori r4,r4,(1 << (63-60))
mtspr SPRN_MMCRA,r4
1:
/*
* POWER ISA 3. Use PSSCR to determine if we
* are waking up from deep idle state
*/
LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
ld r4,ADDROFF(pnv_first_deep_stop_state)(r5)
BEGIN_FTR_SECTION_NESTED(71)
/*
* Assume that we are waking up from the state
* same as the Requested Level (RL) in the PSSCR
* which are Bits 60-63
*/
ld r5,PACA_REQ_PSSCR(r13)
rldicl r5,r5,0,60
FTR_SECTION_ELSE_NESTED(71)
/*
* 0-3 bits correspond to Power-Saving Level Status
* which indicates the idle state we are waking up from
*/
mfspr r5, SPRN_PSSCR
rldicl r5,r5,4,60
ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_POWER9_DD1, 71)
powerpc/powernv: Provide a way to force a core into SMT4 mode POWER9 processors up to and including "Nimbus" v2.2 have hardware bugs relating to transactional memory and thread reconfiguration. One of these bugs has a workaround which is to get the core into SMT4 state temporarily. This workaround is only needed when running bare-metal. This patch provides a function which gets the core into SMT4 mode by preventing threads from going to a stop state, and waking up those which are already in a stop state. Once at least 3 threads are not in a stop state, the core will be in SMT4 and we can continue. To do this, we add a "dont_stop" flag to the paca to tell the thread not to go into a stop state. If this flag is set, power9_idle_stop() just returns immediately with a return value of 0. The pnv_power9_force_smt4_catch() function does the following: 1. Set the dont_stop flag for each thread in the core, except ourselves (in fact we use an atomic_inc() in case more than one thread is calling this function concurrently). 2. See how many threads are awake, indicated by their requested_psscr field in the paca being 0. If this is at least 3, skip to step 5. 3. Send a doorbell interrupt to each thread that was seen as being in a stop state in step 2. 4. Until at least 3 threads are awake, scan the threads to which we sent a doorbell interrupt and check if they are awake now. This relies on the following properties: - Once dont_stop is non-zero, requested_psccr can't go from zero to non-zero, except transiently (and without the thread doing stop). - requested_psscr being zero guarantees that the thread isn't in a state-losing stop state where thread reconfiguration could occur. - Doing stop with a PSSCR value of 0 won't be a state-losing stop and thus won't allow thread reconfiguration. - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core must be in SMT4 mode, since SMT modes are powers of 2. This does add a sync to power9_idle_stop(), which is necessary to provide the correct ordering between setting requested_psscr and checking dont_stop. The overhead of the sync should be unnoticeable compared to the latency of going into and out of a stop state. Because some objected to incurring this extra latency on systems where the XER[SO] bug is not relevant, I have put the test in power9_idle_stop inside a feature section. This means that pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will probably hang the system. In order to cater for uses where the caller has an operation that has to be done while the core is in SMT4, the core continues to be kept in SMT4 after pnv_power9_force_smt4_catch() function returns, until the pnv_power9_force_smt4_release() function is called. It undoes the effect of step 1 above and allows the other threads to go into a stop state. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:00 +08:00
li r0, 0 /* clear requested_psscr to say we're awake */
std r0, PACA_REQ_PSSCR(r13)
cmpd cr4,r5,r4
bge cr4,pnv_wakeup_tb_loss /* returns to caller */
blr /* Waking up without hypervisor state loss. */
/* Same calling convention as arch300 */
pnv_restore_hyp_resource_arch207:
/*
* POWER ISA 2.07 or less.
* Check if we slept with sleep or winkle.
*/
lbz r4,PACA_THREAD_IDLE_STATE(r13)
cmpwi cr2,r4,PNV_THREAD_NAP
bgt cr2,pnv_wakeup_tb_loss /* Either sleep or Winkle */
/*
* We fall through here if PACA_THREAD_IDLE_STATE shows we are waking
* up from nap. At this stage CR3 shouldn't contains 'gt' since that
* indicates we are waking with hypervisor state loss from nap.
*/
bgt cr3,.
blr /* Waking up without hypervisor state loss */
/*
* Called if waking up from idle state which can cause either partial or
* complete hyp state loss.
* In POWER8, called if waking up from fastsleep or winkle
* In POWER9, called if waking up from stop state >= pnv_first_deep_stop_state
*
* r13 - PACA
* cr3 - gt if waking up with partial/complete hypervisor state loss
*
* If ISA300:
* cr4 - gt or eq if waking up from complete hypervisor state loss.
*
* If ISA207:
* r4 - PACA_THREAD_IDLE_STATE
*/
pnv_wakeup_tb_loss:
ld r1,PACAR1(r13)
/*
* Before entering any idle state, the NVGPRs are saved in the stack.
* If there was a state loss, or PACA_NAPSTATELOST was set, then the
* NVGPRs are restored. If we are here, it is likely that state is lost,
* but not guaranteed -- neither ISA207 nor ISA300 tests to reach
* here are the same as the test to restore NVGPRS:
* PACA_THREAD_IDLE_STATE test for ISA207, PSSCR test for ISA300,
* and SRR1 test for restoring NVGPRs.
*
* We are about to clobber NVGPRs now, so set NAPSTATELOST to
* guarantee they will always be restored. This might be tightened
* with careful reading of specs (particularly for ISA300) but this
* is already a slow wakeup path and it's simpler to be safe.
*/
li r0,1
stb r0,PACA_NAPSTATELOST(r13)
/*
*
* Save SRR1 and LR in NVGPRs as they might be clobbered in
* opal_call() (called in CHECK_HMI_INTERRUPT). SRR1 is required
* to determine the wakeup reason if we branch to kvm_start_guest. LR
* is required to return back to reset vector after hypervisor state
* restore is complete.
*/
mr r19,r12
mr r18,r4
mflr r17
BEGIN_FTR_SECTION
CHECK_HMI_INTERRUPT
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
lbz r7,PACA_THREAD_MASK(r13)
/*
* Take the core lock to synchronize against other threads.
*
* Lock bit is set in one of the 2 cases-
* a. In the sleep/winkle enter path, the last thread is executing
* fastsleep workaround code.
* b. In the wake up path, another thread is executing fastsleep
* workaround undo code or resyncing timebase or restoring context
* In either case loop until the lock bit is cleared.
*/
1:
lwarx r15,0,r14
andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
bnel- core_idle_lock_held
oris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
stwcx. r15,0,r14
bne- 1b
isync
andi. r9,r15,PNV_CORE_IDLE_THREAD_BITS
cmpwi cr2,r9,0
/*
* At this stage
* cr2 - eq if first thread to wakeup in core
* cr3- gt if waking up with partial/complete hypervisor state loss
* ISA300:
* cr4 - gt or eq if waking up from complete hypervisor state loss.
*/
BEGIN_FTR_SECTION
/*
* Were we in winkle?
* If yes, check if all threads were in winkle, decrement our
* winkle count, set all thread winkle bits if all were in winkle.
* Check if our thread has a winkle bit set, and set cr4 accordingly
* (to match ISA300, above). Pseudo-code for core idle state
* transitions for ISA207 is as follows (everything happens atomically
* due to store conditional and/or lock bit):
*
* nap_idle() { }
* nap_wake() { }
*
* sleep_idle()
* {
* core_idle_state &= ~thread_in_core
* }
*
* sleep_wake()
* {
* bool first_in_core, first_in_subcore;
*
* first_in_core = (core_idle_state & IDLE_THREAD_BITS) == 0;
* first_in_subcore = (core_idle_state & SUBCORE_SIBLING_MASK) == 0;
*
* core_idle_state |= thread_in_core;
* }
*
* winkle_idle()
* {
* core_idle_state &= ~thread_in_core;
* core_idle_state += 1 << WINKLE_COUNT_SHIFT;
* }
*
* winkle_wake()
* {
* bool first_in_core, first_in_subcore, winkle_state_lost;
*
* first_in_core = (core_idle_state & IDLE_THREAD_BITS) == 0;
* first_in_subcore = (core_idle_state & SUBCORE_SIBLING_MASK) == 0;
*
* core_idle_state |= thread_in_core;
*
* if ((core_idle_state & WINKLE_MASK) == (8 << WINKLE_COUNT_SIHFT))
* core_idle_state |= THREAD_WINKLE_BITS;
* core_idle_state -= 1 << WINKLE_COUNT_SHIFT;
*
* winkle_state_lost = core_idle_state &
* (thread_in_core << WINKLE_THREAD_SHIFT);
* core_idle_state &= ~(thread_in_core << WINKLE_THREAD_SHIFT);
* }
*
*/
cmpwi r18,PNV_THREAD_WINKLE
bne 2f
andis. r9,r15,PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT@h
subis r15,r15,PNV_CORE_IDLE_WINKLE_COUNT@h
beq 2f
ori r15,r15,PNV_CORE_IDLE_THREAD_WINKLE_BITS /* all were winkle */
2:
/* Shift thread bit to winkle mask, then test if this thread is set,
* and remove it from the winkle bits */
slwi r8,r7,8
and r8,r8,r15
andc r15,r15,r8
cmpwi cr4,r8,1 /* cr4 will be gt if our bit is set, lt if not */
lbz r4,PACA_SUBCORE_SIBLING_MASK(r13)
and r4,r4,r15
cmpwi r4,0 /* Check if first in subcore */
or r15,r15,r7 /* Set thread bit */
beq first_thread_in_subcore
END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
or r15,r15,r7 /* Set thread bit */
beq cr2,first_thread_in_core
/* Not first thread in core or subcore to wake up */
b clear_lock
first_thread_in_subcore:
/*
* If waking up from sleep, subcore state is not lost. Hence
* skip subcore state restore
*/
blt cr4,subcore_state_restored
/* Restore per-subcore state */
ld r4,_SDR1(r1)
mtspr SPRN_SDR1,r4
ld r4,_RPR(r1)
mtspr SPRN_RPR,r4
ld r4,_AMOR(r1)
mtspr SPRN_AMOR,r4
subcore_state_restored:
/*
* Check if the thread is also the first thread in the core. If not,
* skip to clear_lock.
*/
bne cr2,clear_lock
first_thread_in_core:
/*
* First thread in the core waking up from any state which can cause
* partial or complete hypervisor state loss. It needs to
* call the fastsleep workaround code if the platform requires it.
* Call it unconditionally here. The below branch instruction will
* be patched out if the platform does not have fastsleep or does not
* require the workaround. Patching will be performed during the
* discovery of idle-states.
*/
.global pnv_fastsleep_workaround_at_exit
pnv_fastsleep_workaround_at_exit:
b fastsleep_workaround_at_exit
timebase_resync:
/*
* Use cr3 which indicates that we are waking up with atleast partial
* hypervisor state loss to determine if TIMEBASE RESYNC is needed.
*/
ble cr3,.Ltb_resynced
/* Time base re-sync */
bl opal_resync_timebase;
/*
* If waking up from sleep (POWER8), per core state
* is not lost, skip to clear_lock.
*/
.Ltb_resynced:
blt cr4,clear_lock
/*
* First thread in the core to wake up and its waking up with
* complete hypervisor state loss. Restore per core hypervisor
* state.
*/
BEGIN_FTR_SECTION
ld r4,_PTCR(r1)
mtspr SPRN_PTCR,r4
ld r4,_RPR(r1)
mtspr SPRN_RPR,r4
ld r4,_AMOR(r1)
mtspr SPRN_AMOR,r4
END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
ld r4,_TSCR(r1)
mtspr SPRN_TSCR,r4
ld r4,_WORC(r1)
mtspr SPRN_WORC,r4
clear_lock:
xoris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
lwsync
stw r15,0(r14)
common_exit:
/*
* Common to all threads.
*
* If waking up from sleep, hypervisor state is not lost. Hence
* skip hypervisor state restore.
*/
blt cr4,hypervisor_state_restored
/* Waking up from winkle */
BEGIN_MMU_FTR_SECTION
b no_segments
END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
/* Restore SLB from PACA */
ld r8,PACA_SLBSHADOWPTR(r13)
.rept SLB_NUM_BOLTED
li r3, SLBSHADOW_SAVEAREA
LDX_BE r5, r8, r3
addi r3, r3, 8
LDX_BE r6, r8, r3
andis. r7,r5,SLB_ESID_V@h
beq 1f
slbmte r6,r5
1: addi r8,r8,16
.endr
no_segments:
/* Restore per thread state */
ld r4,_SPURR(r1)
mtspr SPRN_SPURR,r4
ld r4,_PURR(r1)
mtspr SPRN_PURR,r4
ld r4,_DSCR(r1)
mtspr SPRN_DSCR,r4
ld r4,_WORT(r1)
mtspr SPRN_WORT,r4
/* Call cur_cpu_spec->cpu_restore() */
LOAD_REG_ADDR(r4, cur_cpu_spec)
ld r4,0(r4)
ld r12,CPU_SPEC_RESTORE(r4)
#ifdef PPC64_ELF_ABI_v1
ld r12,0(r12)
#endif
mtctr r12
bctrl
/*
* On POWER9, we can come here on wakeup from a cpuidle stop state.
* Hence restore the additional SPRs to the saved value.
*
* On POWER8, we come here only on winkle. Since winkle is used
* only in the case of CPU-Hotplug, we don't need to restore
* the additional SPRs.
*/
BEGIN_FTR_SECTION
bl power9_restore_additional_sprs
END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
hypervisor_state_restored:
mr r12,r19
mtlr r17
blr /* return to pnv_powersave_wakeup */
fastsleep_workaround_at_exit:
li r3,1
li r4,0
bl opal_config_cpu_idle_state
b timebase_resync
powerpc/powernv: Return to cpu offline loop when finished in KVM guest When a secondary hardware thread has finished running a KVM guest, we currently put that thread into nap mode using a nap instruction in the KVM code. This changes the code so that instead of doing a nap instruction directly, we instead cause the call to power7_nap() that put the thread into nap mode to return. The reason for doing this is to avoid having the KVM code having to know what low-power mode to put the thread into. In the case of a secondary thread used to run a KVM guest, the thread will be offline from the point of view of the host kernel, and the relevant power7_nap() call is the one in pnv_smp_cpu_disable(). In this case we don't want to clear pending IPIs in the offline loop in that function, since that might cause us to miss the wakeup for the next time the thread needs to run a guest. To tell whether or not to clear the interrupt, we use the SRR1 value returned from power7_nap(), and check if it indicates an external interrupt. We arrange that the return from power7_nap() when we have finished running a guest returns 0, so pending interrupts don't get flushed in that case. Note that it is important a secondary thread that has finished executing in the guest, or that didn't have a guest to run, should not return to power7_nap's caller while the kvm_hstate.hwthread_req flag in the PACA is non-zero, because the return from power7_nap will reenable the MMU, and the MMU might still be in guest context. In this situation we spin at low priority in real mode waiting for hwthread_req to become zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 11:48:40 +08:00
/*
* R3 here contains the value that will be returned to the caller
* of power7_nap.
* R12 contains SRR1 for CHECK_HMI_INTERRUPT.
powerpc/powernv: Return to cpu offline loop when finished in KVM guest When a secondary hardware thread has finished running a KVM guest, we currently put that thread into nap mode using a nap instruction in the KVM code. This changes the code so that instead of doing a nap instruction directly, we instead cause the call to power7_nap() that put the thread into nap mode to return. The reason for doing this is to avoid having the KVM code having to know what low-power mode to put the thread into. In the case of a secondary thread used to run a KVM guest, the thread will be offline from the point of view of the host kernel, and the relevant power7_nap() call is the one in pnv_smp_cpu_disable(). In this case we don't want to clear pending IPIs in the offline loop in that function, since that might cause us to miss the wakeup for the next time the thread needs to run a guest. To tell whether or not to clear the interrupt, we use the SRR1 value returned from power7_nap(), and check if it indicates an external interrupt. We arrange that the return from power7_nap() when we have finished running a guest returns 0, so pending interrupts don't get flushed in that case. Note that it is important a secondary thread that has finished executing in the guest, or that didn't have a guest to run, should not return to power7_nap's caller while the kvm_hstate.hwthread_req flag in the PACA is non-zero, because the return from power7_nap will reenable the MMU, and the MMU might still be in guest context. In this situation we spin at low priority in real mode waiting for hwthread_req to become zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 11:48:40 +08:00
*/
.global pnv_wakeup_loss
pnv_wakeup_loss:
ld r1,PACAR1(r13)
BEGIN_FTR_SECTION
CHECK_HMI_INTERRUPT
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
REST_NVGPRS(r1)
REST_GPR(2, r1)
ld r4,PACAKMSR(r13)
ld r5,_LINK(r1)
powerpc/powernv: Return to cpu offline loop when finished in KVM guest When a secondary hardware thread has finished running a KVM guest, we currently put that thread into nap mode using a nap instruction in the KVM code. This changes the code so that instead of doing a nap instruction directly, we instead cause the call to power7_nap() that put the thread into nap mode to return. The reason for doing this is to avoid having the KVM code having to know what low-power mode to put the thread into. In the case of a secondary thread used to run a KVM guest, the thread will be offline from the point of view of the host kernel, and the relevant power7_nap() call is the one in pnv_smp_cpu_disable(). In this case we don't want to clear pending IPIs in the offline loop in that function, since that might cause us to miss the wakeup for the next time the thread needs to run a guest. To tell whether or not to clear the interrupt, we use the SRR1 value returned from power7_nap(), and check if it indicates an external interrupt. We arrange that the return from power7_nap() when we have finished running a guest returns 0, so pending interrupts don't get flushed in that case. Note that it is important a secondary thread that has finished executing in the guest, or that didn't have a guest to run, should not return to power7_nap's caller while the kvm_hstate.hwthread_req flag in the PACA is non-zero, because the return from power7_nap will reenable the MMU, and the MMU might still be in guest context. In this situation we spin at low priority in real mode waiting for hwthread_req to become zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 11:48:40 +08:00
ld r6,_CCR(r1)
addi r1,r1,INT_FRAME_SIZE
mtlr r5
powerpc/powernv: Return to cpu offline loop when finished in KVM guest When a secondary hardware thread has finished running a KVM guest, we currently put that thread into nap mode using a nap instruction in the KVM code. This changes the code so that instead of doing a nap instruction directly, we instead cause the call to power7_nap() that put the thread into nap mode to return. The reason for doing this is to avoid having the KVM code having to know what low-power mode to put the thread into. In the case of a secondary thread used to run a KVM guest, the thread will be offline from the point of view of the host kernel, and the relevant power7_nap() call is the one in pnv_smp_cpu_disable(). In this case we don't want to clear pending IPIs in the offline loop in that function, since that might cause us to miss the wakeup for the next time the thread needs to run a guest. To tell whether or not to clear the interrupt, we use the SRR1 value returned from power7_nap(), and check if it indicates an external interrupt. We arrange that the return from power7_nap() when we have finished running a guest returns 0, so pending interrupts don't get flushed in that case. Note that it is important a secondary thread that has finished executing in the guest, or that didn't have a guest to run, should not return to power7_nap's caller while the kvm_hstate.hwthread_req flag in the PACA is non-zero, because the return from power7_nap will reenable the MMU, and the MMU might still be in guest context. In this situation we spin at low priority in real mode waiting for hwthread_req to become zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 11:48:40 +08:00
mtcr r6
mtmsrd r4
blr
powerpc/powernv: Return to cpu offline loop when finished in KVM guest When a secondary hardware thread has finished running a KVM guest, we currently put that thread into nap mode using a nap instruction in the KVM code. This changes the code so that instead of doing a nap instruction directly, we instead cause the call to power7_nap() that put the thread into nap mode to return. The reason for doing this is to avoid having the KVM code having to know what low-power mode to put the thread into. In the case of a secondary thread used to run a KVM guest, the thread will be offline from the point of view of the host kernel, and the relevant power7_nap() call is the one in pnv_smp_cpu_disable(). In this case we don't want to clear pending IPIs in the offline loop in that function, since that might cause us to miss the wakeup for the next time the thread needs to run a guest. To tell whether or not to clear the interrupt, we use the SRR1 value returned from power7_nap(), and check if it indicates an external interrupt. We arrange that the return from power7_nap() when we have finished running a guest returns 0, so pending interrupts don't get flushed in that case. Note that it is important a secondary thread that has finished executing in the guest, or that didn't have a guest to run, should not return to power7_nap's caller while the kvm_hstate.hwthread_req flag in the PACA is non-zero, because the return from power7_nap will reenable the MMU, and the MMU might still be in guest context. In this situation we spin at low priority in real mode waiting for hwthread_req to become zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 11:48:40 +08:00
/*
* R3 here contains the value that will be returned to the caller
* of power7_nap.
* R12 contains SRR1 for CHECK_HMI_INTERRUPT.
powerpc/powernv: Return to cpu offline loop when finished in KVM guest When a secondary hardware thread has finished running a KVM guest, we currently put that thread into nap mode using a nap instruction in the KVM code. This changes the code so that instead of doing a nap instruction directly, we instead cause the call to power7_nap() that put the thread into nap mode to return. The reason for doing this is to avoid having the KVM code having to know what low-power mode to put the thread into. In the case of a secondary thread used to run a KVM guest, the thread will be offline from the point of view of the host kernel, and the relevant power7_nap() call is the one in pnv_smp_cpu_disable(). In this case we don't want to clear pending IPIs in the offline loop in that function, since that might cause us to miss the wakeup for the next time the thread needs to run a guest. To tell whether or not to clear the interrupt, we use the SRR1 value returned from power7_nap(), and check if it indicates an external interrupt. We arrange that the return from power7_nap() when we have finished running a guest returns 0, so pending interrupts don't get flushed in that case. Note that it is important a secondary thread that has finished executing in the guest, or that didn't have a guest to run, should not return to power7_nap's caller while the kvm_hstate.hwthread_req flag in the PACA is non-zero, because the return from power7_nap will reenable the MMU, and the MMU might still be in guest context. In this situation we spin at low priority in real mode waiting for hwthread_req to become zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 11:48:40 +08:00
*/
pnv_wakeup_noloss:
lbz r0,PACA_NAPSTATELOST(r13)
cmpwi r0,0
bne pnv_wakeup_loss
ld r1,PACAR1(r13)
BEGIN_FTR_SECTION
CHECK_HMI_INTERRUPT
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
ld r4,PACAKMSR(r13)
ld r5,_NIP(r1)
ld r6,_CCR(r1)
addi r1,r1,INT_FRAME_SIZE
mtlr r5
powerpc/powernv: Restore non-volatile CRs after nap Patches 7cba160ad "powernv/cpuidle: Redesign idle states management" and 77b54e9f2 "powernv/powerpc: Add winkle support for offline cpus" use non-volatile condition registers (cr2, cr3 and cr4) early in the system reset interrupt handler (system_reset_pSeries()) before it has been determined if state loss has occurred. If state loss has not occurred, control returns via the power7_wakeup_noloss() path which does not restore those condition registers, leaving them corrupted. Fix this by restoring the condition registers in the power7_wakeup_noloss() case. This is apparent when running a KVM guest on hardware that does not support winkle or sleep and the guest makes use of secondary threads. In practice this means Power7 machines, though some early unreleased Power8 machines may also be susceptible. The secondary CPUs are taken off line before the guest is started and they call pnv_smp_cpu_kill_self(). This checks support for sleep states (in this case there is no support) and power7_nap() is called. When the CPU is woken, power7_nap() returns and because the CPU is still off line, the main while loop executes again. The sleep states support test is executed again, but because the tested values cannot have changed, the compiler has optimized the test away and instead we rely on the result of the first test, which has been left in cr3 and/or cr4. With the result overwritten, the wrong branch is taken and power7_winkle() is called on a CPU that does not support it, leading to it stalling. Fixes: 7cba160ad789 ("powernv/cpuidle: Redesign idle states management") Fixes: 77b54e9f213f ("powernv/powerpc: Add winkle support for offline cpus") [mpe: Massage change log a bit more] Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-05-01 14:50:34 +08:00
mtcr r6
mtmsrd r4
blr