2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-19 11:04:00 +08:00
linux-next/arch
Daniel Axtens f2dd80ecca powerpc/powernv: Panic on unhandled Machine Check
All unrecovered machine check errors on PowerNV should cause an
immediate panic. There are 2 reasons that this is the right policy:
it's not safe to continue, and we're already trying to reboot.

Firstly, if we go through the recovery process and do not successfully
recover, we can't be sure about the state of the machine, and it is
not safe to recover and proceed.

Linux knows about the following sources of Machine Check Errors:
- Uncorrectable Errors (UE)
- Effective - Real Address Translation (ERAT)
- Segment Lookaside Buffer (SLB)
- Translation Lookaside Buffer (TLB)
- Unknown/Unrecognised

In the SLB, TLB and ERAT cases, we can further categorise these as
parity errors, multihit errors or unknown/unrecognised.

We can handle SLB errors by flushing and reloading the SLB. We can
handle TLB and ERAT multihit errors by flushing the TLB. (It appears
we may not handle TLB and ERAT parity errors: I will investigate
further and send a followup patch if appropriate.)

This leaves us with uncorrectable errors. Uncorrectable errors are
usually the result of ECC memory detecting an error that it cannot
correct, but they also crop up in the context of PCI cards failing
during DMA writes, and during CAPI error events.

There are several types of UE, and there are 3 places a UE can occur:
Skiboot, the kernel, and userspace. For Skiboot errors, we have the
facility to make some recoverable. For userspace, we can simply kill
(SIGBUS) the affected process. We have no meaningful way to deal with
UEs in kernel space or in unrecoverable sections of Skiboot.

Currently, these unrecovered UEs fall through to
machine_check_expection() in traps.c, which calls die(), which OOPSes
and sends SIGBUS to the process. This sometimes allows us to stumble
onwards. For example we've seen UEs kill the kernel eehd and
khugepaged. However, the process killed could have held a lock, or it
could have been a more important process, etc: we can no longer make
any assertions about the state of the machine. Similarly if we see a
UE in skiboot (and again we've seen this happen), we're not in a
position where we can make any assertions about the state of the
machine.

Likewise, for unknown or unrecognised errors, we're not able to say
anything about the state of the machine.

Therefore, if we have an unrecovered MCE, the most appropriate thing
to do is to panic.

The second reason is that since e784b6499d ("powerpc/powernv: Invoke
opal_cec_reboot2() on unrecoverable machine check errors."), we
attempt a special OPAL reboot on an unhandled MCE. This is so the
hardware can record error data for later debugging.

The comments in that commit assert that we are heading down the panic
path anyway. At the moment this is not always true. With UEs in kernel
space, for instance, they are marked as recoverable by the hardware,
so if the attempt to reboot failed (e.g. old Skiboot), we wouldn't
panic() but would simply die() and OOPS. It doesn't make sense to be
staggering on if we've just tried to reboot: we should panic().

Explicitly panic() on unrecovered MCEs on PowerNV.
Update the comments appropriately.

This fixes some hangs following EEH events on cxlflash setups.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-09 08:07:19 +11:00
..
alpha PCI updates for v4.3: 2015-09-25 11:16:53 -07:00
arc genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
arm Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm 2015-09-27 06:48:48 -04:00
arm64 AMD fixes for bugs introduced in the 4.2 merge window, 2015-09-25 10:51:40 -07:00
avr32 genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
blackfin genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
c6x genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
cris CRISv10: delete unused lib/dmacopy.c 2015-09-05 00:56:51 +02:00
frv PCI: Revert "PCI: Call pci_read_bridge_bases() from core instead of arch code" 2015-09-15 13:18:04 -05:00
h8300 dma-mapping: consolidate dma_set_mask 2015-09-10 13:29:01 -07:00
hexagon Merge branch 'akpm' (patches from Andrew) 2015-09-10 18:19:42 -07:00
ia64 PCI updates for v4.3: 2015-09-25 11:16:53 -07:00
m32r lib/decompressors: use real out buf size for gunzip with kernel 2015-09-10 13:29:01 -07:00
m68k genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
metag genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
microblaze PCI: Revert "PCI: Call pci_read_bridge_bases() from core instead of arch code" 2015-09-15 13:18:04 -05:00
mips PCI updates for v4.3: 2015-09-25 11:16:53 -07:00
mn10300 PCI: Revert "PCI: Call pci_read_bridge_bases() from core instead of arch code" 2015-09-15 13:18:04 -05:00
nios2 nios2: add Max10 defconfig 2015-09-08 18:16:02 +08:00
openrisc dma-mapping: consolidate dma_set_mask 2015-09-10 13:29:01 -07:00
parisc parisc: Use platform_device_register_simple("rtc-generic") 2015-09-08 17:53:48 +02:00
powerpc powerpc/powernv: Panic on unhandled Machine Check 2015-10-09 08:07:19 +11:00
s390 AMD fixes for bugs introduced in the 4.2 merge window, 2015-09-25 10:51:40 -07:00
score Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-09-01 14:04:50 -07:00
sh genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
sparc genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
tile genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
um Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-09-01 14:04:50 -07:00
unicore32 genirq: Remove irq argument from irq flow handlers 2015-09-16 15:47:51 +02:00
x86 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-09-27 06:51:42 -04:00
xtensa PCI: Revert "PCI: Call pci_read_bridge_bases() from core instead of arch code" 2015-09-15 13:18:04 -05:00
.gitignore
Kconfig kexec: split kexec_load syscall from kexec core code 2015-09-10 13:29:01 -07:00