Fam17h has new register offsets and fields for setting up the DRAM
scrubber so add support for this.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-17-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
MCA_STATUS[43] has been defined as "Poison" or "Reserved" for every bank
since Fam15h except for Fam15h, bank 4 in which case it's defined as
part of the McaStatSubCache bitfield.
Filter out that case.
Reported-by: Dean Liberty <Dean.Liberty@amd.com>
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1479478222-19896-1-git-send-email-Yazen.Ghannam@amd.com
[ Split an almost unparseable ternary conditional, add a comment. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Fam17h has a different set of registers and bitfields. Most of these
registers are read through SMN (System Management Network) rather
than PCI config space. Also, the derivation of various values is now
different.
Update amd64_edac to read the appropriate registers and extract the
correct values for Fam17h.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-12-git-send-email-Yazen.Ghannam@amd.com
[ Save us the indentation level in read_mc_regs(), add defines ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Fam17h needs PCI device functions 0 and 6 instead of 1 and 2 as on older
systems. Update struct amd64_pvt to hold the new functions and reserve
them if on Fam17h.
Also, allocate an array of UMC structs within our newly allocated PVT
struct.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-11-git-send-email-Yazen.Ghannam@amd.com
[ init_one_instance() error handling, shorten lines, unbreak >80 cols lines. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Add a family type and associated ops for Fam17h. Define a struct to hold
all the UMC registers that we need. Make this a part of struct amd64_pvt
in order to maximize code reuse in the rest of the driver.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-10-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Update the ecc_enabled() function to work on Fam17h. This entails
reading a different set of registers and using the SMN (System
Management Network) rather than PCI devices.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-9-git-send-email-Yazen.Ghannam@amd.com
[ Fixup ecc_en assignment and get_umc_base(). ]
Signed-off-by: Borislav Petkov <bp@suse.de>
tip:ras/core contains the respective Fam17h x86 RAS bits which
amd64_edac is going to use. So merge it into the EDAC branch.
Signed-off-by: Borislav Petkov <bp@suse.de>
It's not recommended for the OS to try and force-enable ECC checking.
This is considered a firmware task since it includes memory training,
etc, so don't change ECC settings on Fam17h or newer systems and inform
the user.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1479850816-1595-1-git-send-email-Yazen.Ghannam@amd.com
[ Put the "forcing" message in an else branch. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Currently, deferred errors are classified as correctable in EDAC. Add a
new error type for deferred errors so that they are correctly reported
to the user.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-7-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
We only use __log_bus_error() to log DRAM ECC errors, so let's change
the name to reflect this. We'll also use this function for DRAM ECC
errors on Fam17h, but we'll call it from a different function than
decode_bus_error().
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-6-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
AMD Fam17h will not be using PCI function 2 for EDAC, but will continue
to use function 3. So let's get the name of F3 instead of F2 to support
Fam17h and previous families.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-5-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
nb_bus_decoder() is only used for DRAM ECC errors so rename it so that
the name is more generic and descriptive.
Also, call it for DRAM ECC errors on SMCA systems.
[ Boris: rename it to real function name with a verb in it. ]
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1479423463-8536-4-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
If we execute the below steps without this patch:
modprobe mpc85xx_edac [The first insmod, everything is well.]
modprobe -r mpc85xx_edac
modprobe mpc85xx_edac [insmod again, error happens.]
We would get the error messages as below:
BUG: recent printk recursion!
Oops: Kernel access of bad area, sig: 11 [#48]
Modules linked in: mpc85xx_edac edac_core softdog [last unloaded: mpc85xx_edac]
CPU: 5 PID: 14773 Comm: modprobe Tainted: G D C 4.8.3-rt2
.vsnprintf
.vscnprintf
.vprintk_emit
.printk
.edac_pci_add_device
.mpc85xx_pci_err_probe
.platform_drv_probe
.driver_probe_device
.__driver_attach
.bus_for_each_dev
.driver_attach
.bus_add_driver
.driver_register
.__platform_register_drivers
.mpc85xx_mc_init
.do_one_initcall
.do_init_module
.load_module
.SyS_finit_module
system_call
Address this by cleaning up properly when removing the platform driver.
Tested on a T4240QDS board.
Signed-off-by: Yanjiang Jin <yanjiang.jin@windriver.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: york.sun@nxp.com
Link: http://lkml.kernel.org/r/1479351380-17109-2-git-send-email-yanjiang.jin@windriver.com
[ Boris: massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Trivial fix to spelling mistake "Mutilple" to "Multiple" in error
messages.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Loc Ho <lho@apm.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20161114231104.5585-1-colin.king@canonical.com
Signed-off-by: Borislav Petkov <bp@suse.de>
When accessing the mc_devices list of memory controller descriptors, we
need to hold mem_ctls_mutex. This was not always the case, fix that.
Make all external callers call a version which grabs the mutex since the
last is local to edac_mc.c.
Reported-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Add accessor functions and hide the smca_names array. Also, add a
sanity-check to bank HWID assignment in get_smca_bank_info().
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20161104152317.5r276t35df53qk76@pd.tnic
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make it differ more from struct smca_bank_name for better readability.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20161103125556.15482-3-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Call it simply smca_hwid and call local variables "hwid". More readable.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20161103125556.15482-2-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Disable IRQs while injecting SDRAM errors. The RT patches exposed
a spinlock deadlock where the spinlock taken for the regmap write
deadlocked with the IRQ clear regmap write.
Error injection is not normally enabled for ECC but only for testing.
Signed-off-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1476906827-9412-1-git-send-email-tthayer@opensource.altera.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Fix the following sparse warnings:
drivers/edac/skx_edac.c:266:25: warning:
symbol 'skx_cpuids' was not declared. Should it be static?
drivers/edac/skx_edac.c:1040:12: warning:
symbol 'skx_init' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1477147098-2842-1-git-send-email-weiyj.lk@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Add Knights Mill (KNM) to the list of CPU models supported by sb_edac.
Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20161013153105.2517-6-piotr.luc@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
We now have symbolic names for a bunch of Intel CPU models via
asm/intel-family.h. The original conversion missed the EDAC drivers.
Convert them.
Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20160929204321.9FAE5F84@viggo.jf.intel.com
[ Remove comment, macro name is descriptive enough. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
buffers (Thor Thayer)
* Split the memory controller part out of mpc85xx and share it with a
* new Freescale ARM Layerscape driver (York Sun)
* amd64_edac fixes (Yazen Ghannam)
* Misc cleanups, refactoring and fixes all over the place
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJX80EwAAoJEBLB8Bhh3lVKWwYP/1bbODQ7o+XhO8IaCDffYk30
8y4WdSnI0/QcP8JbSvFA7y6Zn4L0BbrbYhKLRDAg9c34V2bMaqonCnkDtT6YatUb
6l0H/hQ/Cah9AOm5PJLYg6O9s+ZBT8zA5b+F2Z9kUsuB6LSnVhp9skNrH6KPlm0U
4pFaLnHQenQPZbuRCfRxPU49ZuKtBZtQDkJLJlHXwn7e1qZy2Q4tMnnEtsY6U2ea
t3Hj+F8g+cdoiTQXOceCcOTR8GqDI6szgzn7vpXAGYvljBndszauAkxO7by79jg1
I8AQfgwoBF5CYL2Q0pzT1maHmmG2sydeRAHIvhmGxiEfFz1abWhriXbS33c32q8a
iFiVMAUIaSKpB/sB+5w5ymuBctI1mX5EQVW+8Xl2Gxt+olnhdJMocHnvQdYkfsYm
Ka8LcbaiK6ZQTbs/cIMOc2paE0AFPu5uXKHCPeZlhQAxOBvSPuDAv0+qUB/of5Uq
1SPidtsTmCI7X2hrdHAH9hLEkSjq68v3kqL5YnZL3H4gA3WohQEmX9ybjk097Kus
WWEhdi/PSFX0qQKotMUUDuxfNcKI6PZH9p+i2dN6tNCkiTDdb0Eo5lCXN7RVVhvq
qfE0Fcc4uDzh5MUS5jT58MWpA1cfdu9jbAf2BwFIU/poJcaeqy/SMyzCL+1D2/u6
dmDAtQbKUUwiltB8QzQd
=pcI8
-----END PGP SIGNATURE-----
Merge tag 'edac_for_4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Pull EDAC updates from Borislav Petkov:
"A lot of movement in the EDAC tree this time around, coarse summary
below:
- Altera Arria10 enablement of NAND, DMA, USB, QSPI and SD-MMC FIFO
buffers (Thor Thayer)
- split the memory controller part out of mpc85xx and share it with a
new Freescale ARM Layerscape driver (York Sun)
- amd64_edac fixes (Yazen Ghannam)
- misc cleanups, refactoring and fixes all over the place"
* tag 'edac_for_4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (37 commits)
EDAC, altera: Add IRQ Flags to disable IRQ while handling
EDAC, altera: Correct EDAC IRQ error message
EDAC, amd64: Autoload module using x86_cpu_id
EDAC, sb_edac: Remove NULL pointer check on array pci_tad
EDAC: Remove NO_IRQ from powerpc-only drivers
EDAC, fsl_ddr: Fix error return code in fsl_mc_err_probe()
EDAC, fsl_ddr: Add entry to MAINTAINERS
EDAC: Move Doug Thompson to CREDITS
EDAC, I3000: Orphan driver
EDAC, fsl_ddr: Replace simple_strtoul() with kstrtoul()
EDAC, layerscape: Add Layerscape EDAC support
EDAC, fsl_ddr: Fix IRQ dispose warning when module is removed
EDAC, fsl_ddr: Add support for little endian
EDAC, fsl_ddr: Add missing DDR DRAM types
EDAC, fsl_ddr: Rename macros and names
EDAC, fsl-ddr: Separate FSL DDR driver from MPC85xx
EDAC, mpc85xx: Replace printk() with pr_* format
EDAC, mpc85xx: Drop setting/clearing RFXE bit in HID1
EDAC, altera: Rename MC trigger to common name
EDAC, altera: Rename device trigger to common name
...
Add the IRQF_ONESHOT and IRQF_TRIGGER_HIGH flags to disable the IRQ
while executing the IRQ handler. Remove the IRQF_SHARED because these
are not shared IRQs in the domain. Exposed when flooding IRQs.
Signed-off-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1474582419-7053-2-git-send-email-tthayer@opensource.altera.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Bank 4 is reserved on family 0x17 and shouldn't generate any MCE
records. However, broken hardware and software is not something unheard
of so warn about bank 4 errors. They shouldn't be coming from bank 4
naturally but users can still use mce_amd_inj to simulate errors from it
for testing purposed.
Also, avoid special handling in the injector mce_amd_inj like it is
being done on the older families.
[ bp: Rewrite commit message and merge into one patch. Use boot_cpu_data. ]
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Link: http://lkml.kernel.org/r/1473384591-5323-1-git-send-email-Yazen.Ghannam@amd.com
Link: http://lkml.kernel.org/r/1473384591-5323-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The MCA_SYND and MCA_IPID registers contain valuable information and
should be included in MCE output. The MCA_SYND register contains
syndrome and other error information, and the MCA_IPID register will
uniquely identify the MCA bank's type without having to rely on system
software.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472680624-34221-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Scalable MCA defines a number of IP types. An MCA bank on an SMCA
system is defined as one of these IP types. A bank's type is uniquely
identified by the combination of the HWID and MCATYPE values read from
its MCA_IPID register.
Add the required tables in order to be able to lookup error descriptions
based on a bank's type and the error's extended error code.
[ bp: Align comments, simplify a bit. ]
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472741832-1690-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The error descriptions defined for Fam17h can be reused for other SMCA
systems, so their names should reflect this.
Change f17h prefix to smca for error descriptions.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472673994-12235-4-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Print SyndV bit status and print the raw value of the MCA_SYND register.
Further decoding of the syndrome from struct mce.synd can be done in
other places where appropriate, e.g. DRAM ECC.
Boris: make the error stanza more compact by putting the error address
and syndrome on the same line:
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:2 (17:0:0) MC4_STATUS[-|CE|-|PCC|AddrV|-|-|SyndV|CECC]: 0x96204100001e0117
[Hardware Error]: Error Addr: 0x000000007f4c52e3, Syndrome: 0x0000000000000000
[Hardware Error]: Invalid IP block specified.
[Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: RD
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1467633035-32080-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
pvt->pci_tad is a NUM_CHANNELS array of struct pci_dev pointers and
hence cannot be NULL, so the NULL pointer check on pci_tad is redundant.
Remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20160908083801.14766-1-colin.king@canonical.com
Signed-off-by: Borislav Petkov <bp@suse.de>
We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it
from powerpc-only drivers.
The pdata structs are kzalloc'ed, so we don't need to initialise those
to 0, we can just drop the assignments entirely.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Johannes Thumshirn <morbidrsa@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: linuxppc-dev@ozlabs.org
Link: http://lkml.kernel.org/r/1473674436-19467-1-git-send-email-mpe@ellerman.id.au
Signed-off-by: Borislav Petkov <bp@suse.de>
Return negative error code from the edac_mc_add_mc() error handling case
instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: York Sun <york.sun@nxp.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1473350284-26482-1-git-send-email-weiyj.lk@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
The compatible DDR controllers may support DDR, DDR2, DDR3, DDR4 DRAM.
An individual controller doesn't support all of them. The EDAC driver
reads SDRAM_CFG to determine which mode is configured.
Add DDR4 and drop the defines used only in the mtype assignment.
Signed-off-by: York Sun <york.sun@nxp.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: morbidrsa@gmail.com
Cc: oss@buserror.net
Cc: stuart.yoder@nxp.com
Link: http://lkml.kernel.org/r/1470779760-16483-6-git-send-email-york.sun@nxp.com
Signed-off-by: Borislav Petkov <bp@suse.de>
The mpc85xx-compatible DDR controllers are used on ARM-based SoCs too.
Carve out the DDR part from the mpc85xx EDAC driver in preparation to
support both architectures.
Signed-off-by: York Sun <york.sun@nxp.com>
Cc: Johannes Thumshirn <morbidrsa@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: oss@buserror.net
Cc: stuart.yoder@nxp.com
Link: http://lkml.kernel.org/r/1470946525-3410-1-git-send-email-york.sun@nxp.com
Signed-off-by: Borislav Petkov <bp@suse.de>
On e500v1, read fault exception enable (RFXE) controls whether assertion
of core_fault_in causes a machine check interrupt. Assertion of
core_fault_in can result from uncorrectable data error, such as an L2
multi-bit ECC error. It can also occur from a system error if logic on
the integrated device signals a fault for nonfatal errors. RFXE bit is
cleared out of reset, and should be left clear for normal operation.
Assertion of core_fault_in does not cause a machine check.
RFXE is set specifically for RIO (Rapid IO) and PCI for book E to catch
the errors by machine check. With this bit set, the EDAC driver can't
get the interrupt in case of uncorrectable error. So this bit is cleared
in favor of EDAC. However, the benefit of catching such uncorrectable
error doesn't outweigh the other errors which may hang the system.
Besides, e500v2 has different errors masked by RFXE, and e500mc doesn't
support this bit. It is more reasonable to leave RFXE as is in the EDAC
driver, and leave the uncorrectable errors triggering machine check for
e500v1.
Suggested-by: Scott Wood <oss@buserror.net>
Signed-off-by: York Sun <york.sun@nxp.com>
Cc: Johannes Thumshirn <morbidrsa@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: oss@buserror.net
Cc: stuart.yoder@nxp.com
Link: http://lkml.kernel.org/r/1470779760-16483-2-git-send-email-york.sun@nxp.com
Signed-off-by: Borislav Petkov <bp@suse.de>
The L2 and OCRAM devices have different ecc trigger names than the other
EDAC devices (FIFO peripherals). Make them all the same and remove the
character array from the device structure.
Signed-off-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1471622666-15197-2-git-send-email-tthayer@opensource.altera.com
Signed-off-by: Borislav Petkov <bp@suse.de>
This is an entirely new driver instead of yet another set of patches
to sb_edac.c because:
1) Mapping from PCI devices to socket/memory controller is significantly
different. Skylake scatters devices on a socket across a number of
PCI buses.
2) There is an extra level of interleaving via the "mcroute" register
that would be a little messy to squeeze into the old driver.
3) Validation is getting too expensive. Changes to sb_edac need to
be checked against Sandy Bridge, Ivy Bridge, Haswell, Broadwell and
Knights Landing.
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to the reference manual of MPC8572 and T4240, bit 31 of
PEX_ERR_CAP_STAT is W1C (write 1 to clear).
Add the corresponding write to PEX_ERR_CAP_STAT in order to fix the PCIe
error capture.
Tested on a T4240 processor.
Signed-off-by: Tillmann Heidsieck <theidsieck@leenox.de>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20160815190849.29327-1-theidsieck@leenox.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Replace the deprecated create_singlethread_workqueue() with
alloc_ordered_workqueue() with WQ_MEM_RECLAIM. This is the identity
conversion.
It's not recommended to stall it from memory pressure. Hence,
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20160813164124.GA9077@Karyakshetra
Signed-off-by: Borislav Petkov <bp@suse.de>
Fix the following sparse warning:
drivers/edac/altera_edac.c:1649:23: warning:
symbol 'a10_eccmgr_ic_ops' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Reviewed-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lkml <linux-kernel@vger.kernel.org>
Link: http://lkml.kernel.org/r/1470836667-11822-1-git-send-email-weiyj.lk@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Fam15hMod60h systems are using the channel decode of Fam15hMod30h which
gives incorrect results. Fam15hMod60h systems should use the generic
channel decode method plus a couple more cases.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1470236355-30039-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
On Intel Xeon Phi Knights Landing processor family the channels of the
memory controller have untypical arrangement - MC0 is mapped to CH3,4,5
and MC1 is mapped to CH0,1,2. This causes the EDAC driver to report the
channel name incorrectly.
We missed this change earlier, so the code already contains similar
comment, but the translation function is incorrect.
Without this patch:
errors in DIMM_A and DIMM_D were reported in DIMM_D
errors in DIMM_B and DIMM_E were reported in DIMM_E
errors in DIMM_C and DIMM_F were reported in DIMM_F
Correct this.
Hubert Chrzaniuk:
- rebased to 4.8
- comments and code cleanup
Fixes: d0cdf90031 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support")
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Cc: lukasz.odzioba@intel.com
Cc: mchehab@kernel.org
Cc: <stable@vger.kernel.org> # v4.5..
Link: http://lkml.kernel.org/r/1469231089-22837-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
[ Boris: Simplify a bit by removing char mc. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
* Minor cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXlbvGAAoJEBLB8Bhh3lVK9g0P/16Z5/WIqrrLjJegcZ3qKb4m
u2b5FBwbTDQyj/smZmfZfhZUleO7HDayL188fHPKO11my+Mvjyws8IZCvHshy7+3
b6Krno8NXZuNYtm32srSQQoLE4eN5pookOPMe2KJcRFghwnDqOo0ZfMZ8Lt6RVGg
YtqiVZ3DwaqoHUbvxkHvdJuMbtVEZIEC6SzK9t4URmdncAx+3/BTLqDSYc4LaL+B
tI6REgMy+QT1IUMImuUbOA5ktd78K+D26YKF9NuTSWNQvwzjSYy4y6WmDsjYpZF6
h6YWgGwERIy6gws5IdvkBMKX0s1RV7/sY6xFBsEQrX+IFywyY5OcI1gPRttD6VAn
ORJ4j7WNH8Phqld9YxjEX8yTwGxlxge88J/lb0i9s5jXPMX7ioxAqAjdpsJj6PlA
sNDP0fTyCECqME73cWaAS9bdrzB2dgOxEDfcWKWR24vZeHfnh72owD1k+J0iWu9c
0R2usPB1Yjbo6ODGpqG1R+u09LR+RpNe38oT5GcsFrj6/EaGz0kn2QulaNtylsDo
J5qkm02X1h3+F8/UHrzKavOGk82IQzLCbcW7eHPmyzkH8PTe1semLk9jHnXN5sK3
uL2nirZQVIF6QLbYxf0Uf2oSVs05xc7u9ZT+QOl6hqh/6/K7DWC45SoABwMXDaHH
jXr+y4vtr5IR8GQfZ/nd
=mdxx
-----END PGP SIGNATURE-----
Merge tag 'edac_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Pull EDAC updates from Borislav Petkov:
"This last cycle, Thor was busy adding Arria10 eth FIFO support to the
altera_edac driver along with other improvements. We have two
cleanups/fixes too.
Summary:
- Altera Arria10 ethernet FIFO buffer support (Thor Thayer)
- Minor cleanups"
* tag 'edac_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
ARM: dts: Add Arria10 Ethernet EDAC devicetree entry
EDAC, altera: Add Arria10 Ethernet EDAC support
EDAC, altera: Add Arria10 ECC memory init functions
Documentation: dt: socfpga: Add Arria10 Ethernet binding
EDAC, altera: Drop some ifdeffery
EDAC, altera: Add panic flag check to A10 IRQ
EDAC, altera: Check parent status for Arria10 EDAC block
EDAC, altera: Make all private data structures static
EDAC: Correct channel count limit
EDAC, amd64_edac: Init opstate at the proper time during init
EDAC, altera: Handle Arria10 SDRAM child node
EDAC, altera: Add ECC Manager IRQ controller support
Documentation: dt: socfpga: Add interrupt-controller to ecc-manager
In commit 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver
detection") I broke Knights Landing because I failed to notice that it
called a wrapper macro "sbridge_get_all_devices_knl" instead of
"sbridge_get_all_devices" like all the other types.
Now that we include the processor type in the pci_id_table structure we
can skip the wrappers and just have the sbridge_get_all_devices() check
the type to decide whether to allow duplicate devices and controllers to
have registers spread across buses.
Fixes: 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")
Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for additional memory module ECCs, the IRQ function will
check a panic flag before doing a kernel panic on double bit errors.
OCRAM uncorrectable errors cause a panic because sleep/resume functions
and FPGA contents during sleep are stored in OCRAM.
ECCs on peripheral FIFO buffers will not cause a kernel panic on DBERRs
because the packet can be retried and therefore recovered.
Signed-off-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1466603939-7526-3-git-send-email-tthayer@opensource.altera.com
Signed-off-by: Borislav Petkov <bp@suse.de>
c44696fff0 ("EDAC: Remove arbitrary limit on number of channels")
lifted the arbitrary limit on memory controller channels in EDAC.
However, the dynamic channel attributes dynamic_csrow_dimm_attr and
dynamic_csrow_ce_count_attr remained 6.
This wasn't a problem except channels 6 and 7 weren't visible in sysfs
on machines with more than 6 channels after the conversion to static
attr groups with
2c1946b6d6 ("EDAC: Use static attribute groups for managing sysfs entries")
[ without that, we're exploding in edac_create_sysfs_mci_device()
because we're dereferencing out of the bounds of the
dynamic_csrow_dimm_attr array. ]
Add attributes for channels 6 and 7 along with a guard for the
future, should more channels be required and/or to sanity check for
misconfigured machines.
We still need to check against the number of channels present on the MC
first, as Thor reported.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reported-by: Hironobu Ishii <ishii.hironobu@jp.fujitsu.com>
Tested-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: <stable@vger.kernel.org> # 4.2
It is useless to do it if we're loaded on unsupported hardware so do
that only after we have detected at least 1 supported AMD northbridge.
Signed-off-by: Borislav Petkov <bp@suse.de>
Separate the device match arrays for each platform to prevent CycloneV
matches when calling of_platform_populate() on the Arria10 ECC manager
node.
If the SDRAM is a child node of ECC manager, call probe function via
of_platform_populate().
Signed-off-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1464193783-5071-4-git-send-email-tthayer@opensource.altera.com
Signed-off-by: Borislav Petkov <bp@suse.de>
In commit
2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")
we switched from using PCI ids to determine which platform we are
running on to using CPU model instead.
I forgot that Broadwell-DE has its own distinct model number different
from Broadwell-EP or -EX.
Fixing this isn't just adding a line to the array of cpuids - the
exising code assumed a 1:1 mapping between entries in that array and the
"enum type" values. Added the type to pci_id_table structure to remove
this dependency and allows two Broadwell cpu models.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Fixes: 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")
Link: http://lkml.kernel.org/r/b3cffe40dec6dfe0235a5d52a504f0ba86a07ce7.1464902605.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Broadwell made a small change to the rank target register moving the
target rank ID field up from bits 16:19 to bits 20:23.
Also found that the offset field grew by one bit in the IVY_BRIDGE to
HASWELL transition, so fix the RIR_OFFSET() macro too.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org # v3.19+
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/2943fb819b1f7e396681165db9c12bb3df0e0b16.1464735623.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
* Remove ad-hoc buffering of MCE records in sb_edac and i7core_edac. (Tony Luck)
* Do not register sb_edac with pci_register_driver(). (Tony Luck)
* Add support for Skylake to ie31200_edac. (Jason Baron)
* Do not register amd64_edac with pci_register_driver(). (Borislav Petkov)
+ the usual round of cleanups and fixes all over the place.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXOaHaAAoJEBLB8Bhh3lVKRdoP/jDewaX4GAxgcKaK9Kx7VjMe
i42E/7syh5iLoyFgZ1hUjvqiQHN4RyWloUYyNHbNxGS9uXCMXIUoJQd2FjOY442s
oeEaoMCfZsJLzKQti3fUPK9GugZ57BSZxfn6NlJPpyG4FxSirOcZFCxGaNuyqqrh
0M+gFvbWfZcVDLE0FI5CUYZWRxsk//+pLM4ZkQZFA8Eo1tGVc3r0zEEAlL/uEMDW
P0ATvRHNUeib9YQGTdSD7cNUpRX4SX+T8lDUiaVm+tHL5ES3b4rayMBdbVMrahfc
Gnxu9GtO5gWXEY6QDDpOx+VkbfFmDFodx833psJ3MVD8evEFHHdinkgDaptLrrV6
92ZDKR5s3W6tKXkcmGuExrtc17UgjLcRZCebXbv+5FJlVrslzvQ9ESOTRyiZBrCD
ZpFi2TZhpYU1uEWuoBZCbWNXW2pcSt7/bQ9bYUvfrvNfgPzPnblubJuVKRcfY0WB
x2vj0PNnckpoPRvskV7GEe0Y/JISzAxBQUK6XO+GJgMgz5M+23SEaSVU5yeyf26e
x/yD5yImQGN84AxfRMMvbR2JvpVLN3vdFWtWigneht160erVA/Qgw5Gcrpw53tZr
zPD3fGaetrNgS8BvJAy/c6xHV1j6A1RGCGThy8Ivqep9WD90a5JhR1NRjZp6m8sf
aB0A1zTP7e+hRFkZGahO
=ud2P
-----END PGP SIGNATURE-----
Merge tag 'edac_for_4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Pull EDAC updates from Borislav Petkov:
"It was pretty busy in EDAC land this time:
- Altera Arria10 L2 cache and On-Chip RAM ECC handling (Thor Thayer)
- Remove ad-hoc buffering of MCE records in sb_edac and i7core_edac
(Tony Luck)
- Do not register sb_edac with pci_register_driver() (Tony Luck)
- Add support for Skylake to ie31200_edac (Jason Baron)
- Do not register amd64_edac with pci_register_driver() (Borislav
Petkov)
... plus the usual round of cleanups and fixes all over the place"
* tag 'edac_for_4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (25 commits)
EDAC, amd64_edac: Drop pci_register_driver() use
EDAC, ie31200_edac: Add Skylake support
EDAC, sb_edac: Use cpu family/model in driver detection
EDAC, i7core: Remove double buffering of error records
EDAC, amd64_edac: Issue driver banner only on success
ARM: socfpga: Initialize Arria10 OCRAM ECC on startup
EDAC: Increment correct counter in edac_inc_ue_error()
EDAC, sb_edac: Remove double buffering of error records
EDAC: Fix used after kfree() error in edac_unregister_sysfs()
EDAC, altera: Avoid unused function warnings
EDAC, altera: Remove useless casts
ARM: socfpga: Enable Arria10 OCRAM ECC on startup
EDAC, altera: Add Arria10 OCRAM ECC support
Documentation: dt: socfpga: Add Altera Arria10 OCRAM binding
EDAC, altera: Make OCRAM ECC dependency check generic
EDAC, altera: Add register offset for ECC Enable
EDAC, altera: Extract error inject operations to a struct fops
ARM: socfpga: Enable Arria10 L2 cache ECC on startup
EDAC, altera: Add Arria10 L2 Cache ECC handling
Documentation, dt, socfpga: Add Altera Arria10 L2 cache binding
...
Use X86_FEATURE_SMCA when detecting if SMCA is available instead of
directly using CPUID 0x80000007_EBX.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1462971509-3856-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- remove homegrown instances counting.
- take F3 PCI device from amd_nb caching instead of F2 which was used with the
PCI core.
With those changes, the driver doesn't need to register a PCI driver and
relies on the northbridges caching which we do anyway on AMD.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Skylake adjusts some register locations, but otherwise follows the
existing model quite closely. I was able to verify that the 'ce_count'
increments when 'bad dimms' are used. The accounting of 'ce_count' and
'ue_count' is the primary functionality of interest for us. Tested on
Intel(R) Xeon(R) CPU E3-1260L v5 @ 2.90GHz.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1462547927-22679-1-git-send-email-jbaron@akamai.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Instead of picking a random PCI ID from the dozen or so we need to
access, just use x86_match_cpu() to pick based on CPU model number. The
choosing of PCI devices has been problematic in the past, see
11249e7399 ("sb_edac: Fix detection on SNB machines")
which fixed problems introduced by
d0585cd815 ("sb_edac: Claim a different PCI device").
This is especially ugly if future hardware might not even have
EDAC-relevant registers in PCI config space and we would still be
required to choose some "random" PCI devices to scan for just so our
driver loads.
Is this cleaner/clearer? It deletes much more code than it adds. Only
tested on Broadwell. The driver loads/unloads and loads again. Still
decodes errors too.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
In the bad old days the functions from x86_mce_decoder_chain could be
called in machine check context. So we used to carefully copy them and
defer processing until later. But in
f29a7aff4b ("x86/mce: Avoid potential deadlock due to printk() in MCE context")
we switched the logging code to save the record in a genpool, and call
the functions that registered to be notified later from a work queue.
So drop all the double buffering and do all the work we want to do as
soon as i7core_mce_check_error() is called.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/29ab2c370915c6e132fc5d88e7b72cb834bedbfe.1461855008.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Both of these drivers can return NOTIFY_BAD, but this terminates
processing other callbacks that were registered later on the chain.
Since the driver did nothing to log the error it seems wrong to prevent
other interested parties from seeing it. E.g. neither of them had even
bothered to check the type of the error to see if it was a memory error
before the return NOTIFY_BAD.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/72937355dd92318d2630979666063f8a2853495b.1461864507.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
In the bad old days the functions from x86_mce_decoder_chain could be
called in machine check context. So we used to carefully copy them and
defer processing until later. But in
f29a7aff4b ("x86/mce: Avoid potential deadlock due to printk() in MCE context")
we switched the logging code to save the record in a genpool, and call
the functions that registered to be notified later from a work queue.
So drop all the double buffering and do all the work we want to do as
soon as sbridge_mce_check_error() is called.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: patrickg@supermicro.com
Link: http://lkml.kernel.org/r/100025611cd780d9bca72792b2b2146760da53e0.1460756761.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Code flow looks like this:
device_unregister(&mci->dev);
-> kobject_put+0x25/0x50
-> kobject_cleanup+0x77/0x190
-> device_release+0x32/0xa0
-> mci_attr_release+0x36/0x70
-> kfree(mci);
bus_unregister(mci->bus);
Fix is to grab a local copy of "mci->bus" and use that when we call
bus_unregister().
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/21d595b0ab3d718d9cb206647f4ec91c05e62ec4.1461261078.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
The recently added Arria10 OCRAM ECC support caused some new harmless
warnings about unused functions when it is disabled:
drivers/edac/altera_edac.c:1067:20: error: 'altr_edac_a10_ecc_irq' defined but not used [-Werror=unused-function]
drivers/edac/altera_edac.c:658:12: error: 'altr_check_ecc_deps' defined but not used [-Werror=unused-function]
This rearranges the code slightly to have those two functions inside
of the same #ifdef that hides their callers. It also manages to
avoid a forward declaration of the IRQ handler in the process.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: Alan Tull <atull@opensource.altera.com>
Cc: Dinh Nguyen <dinguyen@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Fixes: c7b4be8db8 ("EDAC, altera: Add Arria10 OCRAM ECC support")
Link: http://lkml.kernel.org/r/1460837650-1237650-2-git-send-email-arnd@arndb.de
Signed-off-by: Borislav Petkov <bp@suse.de>
The altera EDAC driver refers to its per-device data
using a cast to '(void *)', which makes the pointer
non-const, though both the source and destination are
actually const.
Removing the annotation makes the reference (almost)
fit into a single line for improved readability, and
ensures that it is actually defined as const.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: Alan Tull <atull@opensource.altera.com>
Cc: Dinh Nguyen <dinguyen@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1460837650-1237650-1-git-send-email-arnd@arndb.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Haswell and Broadwell can be configured to hash the channel
interleave function using bits [27:12] of the physical address.
On those processor models we must check to see if hashing is
enabled (bit21 of the HASWELL_HASYSDEFEATURE2 register) and
act accordingly.
Based on a patch by patrickg <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In commit:
eb1af3b71f ("Fix computation of channel address")
I switched the "sck_way" variable from holding the log2 value read
from the h/w to instead be the actual number. Unfortunately it
is needed in log2 form when used to shift the address.
Tested-by: Patrick Geary <patrickg@supermicro.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: eb1af3b71f ("Fix computation of channel address")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fix spelling typos found in printk
within various part of the kernel sources.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull RAS updates from Ingo Molnar:
"Various RAS updates:
- AMD MCE support updates for future CPUs, fixes and 'SMCA' (Scalable
MCA) error decoding support (Aravind Gopalakrishnan)
- x86 memcpy_mcsafe() support, to enable smart(er) hardware error
recovery in NVDIMM drivers, based on an extension of the x86
exception handling code. (Tony Luck)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/sb_edac: Fix computation of channel address
x86/mm, x86/mce: Add memcpy_mcsafe()
x86/mce/AMD: Document some functionality
x86/mce: Clarify comments regarding deferred error
x86/mce/AMD: Fix logic to obtain block address
x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errors
x86/mce: Move MCx_CONFIG MSR definitions
x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries
x86/mm: Expand the exception table logic to allow new handling options
x86/mce/AMD: Set MCAX Enable bit
x86/mce/AMD: Carve out threshold block preparation
x86/mce/AMD: Fix LVT offset configuration for thresholding
x86/mce/AMD: Reduce number of blocks scanned per bank
x86/mce/AMD: Do not perform shared bank check for future processors
x86/mce: Fix order of AMD MCE init function call
Large memory Haswell-EX systems with multiple DIMMs per channel were
sometimes reporting the wrong DIMM.
Found three problems:
1) Debug printouts for socket and channel interleave were not interpreting
the register fields correctly. The socket interleave field is a 2^X
value (0=1, 1=2, 2=4, 3=8). The channel interleave is X+1 (0=1, 1=2,
2=3. 3=4).
2) Actual use of the socket interleave value didn't interpret as 2^X
3) Conversion of address to channel address was complicated, and wrong.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For Scalable MCA enabled processors, errors are listed per IP block. And
since it is not required for an IP to map to a particular bank, we need
to use HWID and McaType values from the MCx_IPID register to figure out
which IP a given bank represents.
We also have a new bit (TCC) in the MCx_STATUS register to indicate Task
context is corrupt.
Add logic here to decode errors from all known IP blocks for Fam17h
Model 00-0fh and to print TCC errors.
[ Minor fixups. ]
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1457021458-2522-3-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Correct a typo introduced by
d0cdf90031 ("EDAC, sb_edac: Add Knights Landing (Xeon Phi gen 2) support")
As a result under some configurations DIMMs were not correctly
recognized. Problem affects only Xeon Phi architecture.
Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1457361045-26221-1-git-send-email-hubert.chrzaniuk@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
debugfs_remove() is used to remove a file or a directory from the
debugfs filesystem on an EDAC device exit. However edac_debugfs might
not be empty. This is similar to
30f84a891b ("EDAC: Use edac_debugfs_remove_recursive()")
which changed the EDAC MCI code to use edac_debugfs_remove_recursive().
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1455064165-3816-1-git-send-email-tthayer@opensource.altera.com
Signed-off-by: Borislav Petkov <bp@suse.de>
We were getting this build warning:
drivers/edac/mpc85xx_edac.c:1247:6: warning: unused variable 'pvr'
pvr is only used if CONFIG_FSL_SOC_BOOKE is defined. Declare it
__maybe_unused.
Suggested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1454427573-7994-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
They're both running only when ->edac_check is initialized so remove
that check from the workqueue function itself. Synchronize/generalize
the ->op_state check between the two.
Kill useless comments, while at it.
Signed-off-by: Borislav Petkov <bp@suse.de>
We use the ->edac_check function pointers to determine whether we need
to setup a polling workqueue. However, the destroy path is not balanced
and we might try to teardown an unitialized workqueue.
Balance init and destroy paths by looking at ->edac_check in both cases.
Set op_state to OP_OFFLINE *before* destroying anything.
Reported-by: Zhiqiang Hou <Zhiqiang.Hou@freescale.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
dct_sel_base_off is declared as a u64 but we're only using the lower 32
bits because of a shift wrapping bug. This can possibly truncate the
upper 16 bits of DctSelBaseOffset[47:26], causing us to misdecode the CS
row.
Fixes: c8e518d567 ('amd64_edac: Sanitize f10_get_base_addr_offset')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20160120095451.GB19898@mwanda
Signed-off-by: Borislav Petkov <bp@suse.de>
Knights Landing does not come with register that could be used to fetch
DIMM width. However the value is fixed for this architecture so it can
be hardcoded.
Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1449840082-18673-1-git-send-email-hubert.chrzaniuk@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Hide the EDAC workqueue pointer in a separate compilation unit and add
accessors for the workqueue manipulations needed.
Remove edac_pci_reset_delay_period() which wasn't used by anything. It
seems it got added without a user with
91b99041c1 ("drivers/edac: updated PCI monitoring")
Signed-off-by: Borislav Petkov <bp@suse.de>
It cannot fail now. We either load EDAC core after having successfully
initialized edac_subsys or we don't.
Signed-off-by: Borislav Petkov <bp@suse.de>
This was really dumb - reference counting for the main EDAC sysfs
object. While we could've simply registered it as the first thing in the
module init path and then hand it around to what needs it.
Do that and rip out all the code around it, thus simplifying the whole
handling significantly.
Move the edac_subsys node back to edac_module.c.
Signed-off-by: Borislav Petkov <bp@suse.de>
EDAC workqueue destruction is really fragile. We cancel delayed work
but if it is still running and requeues itself, we still go ahead and
destroy the workqueue and the queued work explodes when workqueue core
attempts to run it.
Make the destruction more robust by switching op_state to offline so
that requeuing stops. Cancel any pending work *synchronously* too.
EDAC i7core: Driver loaded.
general protection fault: 0000 [#1] SMP
CPU 12
Modules linked in:
Supported: Yes
Pid: 0, comm: kworker/0:1 Tainted: G IE 3.0.101-0-default #1 HP ProLiant DL380 G7
RIP: 0010:[<ffffffff8107dcd7>] [<ffffffff8107dcd7>] __queue_work+0x17/0x3f0
< ... regs ...>
Process kworker/0:1 (pid: 0, threadinfo ffff88019def6000, task ffff88019def4600)
Stack:
...
Call Trace:
call_timer_fn
run_timer_softirq
__do_softirq
call_softirq
do_softirq
irq_exit
smp_apic_timer_interrupt
apic_timer_interrupt
intel_idle
cpuidle_idle_call
cpu_idle
Code: ...
RIP __queue_work
RSP <...>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Originally the mpc85xx-pci-edac driver bound directly to the PCI
controller node.
Commit
905e75c46d ("powerpc/fsl-pci: Unify pci/pcie initialization code")
turned the PCI controller code into a platform device. Since we can't
have two drivers binding to the same device, the EDAC code was changed
to be called into as a library-style submodule. However, this doesn't
work if the EDAC driver is built as a module.
Commit
8d8fcba6d1ea ("EDAC: Rip out the edac_subsys reference counting")
exposed another problem with this approach -- mpc85xx_pci_err_probe()
was being called in the same early boot phase that the PCI controller
is initialized, rather than in the device_initcall phase that the EDAC
layer expects. This caused a crash on boot.
To fix this, the PCI controller code now creates a child platform device
specifically for EDAC, which the mpc85xx-pci-edac driver binds to.
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Jia Hongtao <B38951@freescale.com>
Cc: Jiri Kosina <jkosina@suse.com>
Cc: Kim Phillips <kim.phillips@freescale.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Masanari Iida <standby24x7@gmail.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rob Herring <robh@kernel.org>
Link: http://lkml.kernel.org/r/1449774432-18593-1-git-send-email-scottwood@freescale.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Knights Landing is the next generation architecture for HPC market.
KNL introduces concept of a tile and CHA - Cache/Home Agent for memory
accesses.
Some things are fixed in KNL:
() There's single DIMM slot per channel
() There's 2 memory controllers with 3 channels each, however,
from EDAC standpoint, it is presented as single memory controller
with 6 channels. In order to represent 2 MCs w/ 3 CH, it would
require major redesign of EDAC core driver.
Basically, two functionalities are added/extended:
() during driver initialization KNL topology is being recognized, i.e.
which channels are populated with what DIMM sizes
(knl_get_dimm_capacity function)
() handle MCE errors - channel swizzling
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Jim Snow <jim.m.snow@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1449136134-23706-5-git-send-email-hubert.chrzaniuk@intel.com
[ Rebase to 4.4-rc3. ]
Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Add options to sbridge_get_all_devices() to allow for duplicate device
IDs and devices that are scattered across mulitple PCI buses.
Signed-off-by: Jim Snow <jim.m.snow@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1449136134-23706-4-git-send-email-hubert.chrzaniuk@intel.com
[ Rebase to 4.4-rc3. ]
Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
SAD limit, interleave mode and DRAM related functionalities are now
virtualized, so that overriding them is easier.
Signed-off-by: Jim Snow <jim.m.snow@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1449136134-23706-3-git-send-email-hubert.chrzaniuk@intel.com
[ Rebase to 4.4-rc3. ]
Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
These new helpers simplify implementing multi-driver modules and
properly handle failure to register one driver by unregistering all
previously registered drivers.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1449073138-10852-2-git-send-email-thierry.reding@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
These new helpers simplify implementing multi-driver modules and
properly handle failure to register one driver by unregistering all
previously registered drivers.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1449136632-11680-1-git-send-email-thierry.reding@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
The asm-generic changes for 4.4 are mostly a series from Christoph Hellwig
to clean up various abuses of headers in there. The patch to rename the
io-64-nonatomic-*.h headers caused some conflicts with new users, so I
added a workaround that we can remove in the next merge window.
The only other patch is a warning fix from Marek Vasut
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVjzaf2CrR//JCVInAQImmhAA20fZ91sUlnA5skKNPT1phhF6Z7UF2Sx5
nPKcHQD3HA3lT1OKfPBYvCo+loYflvXFLaQThVylVcnE/8ecAEMtft4nnGW2nXvh
sZqHIZ8fszTB53cynAZKTjdobD1wu33Rq7XRzg0ugn1mdxFkOzCHW/xDRvWRR5TL
rdQjzzgvn2PNlqFfHlh6cZ5ykShM36AIKs3WGA0H0Y/aYsE9GmDOAUp41q1mLXnA
4lKQaIxoeOa+kmlsUB0wEHUecWWWJH4GAP+CtdKzTX9v12bGNhmiKUMCETG78BT3
uL8irSqaViNwSAS9tBxSpqvmVUsa5aCA5M3MYiO+fH9ifd7wbR65g/wq39D3Pc01
KnZ3BNVRW5XSA3c86pr8vbg/HOynUXK8TN0lzt6rEk8bjoPBNEDy5YWzy0t6reVe
wX65F+ver8upjOKe9yl2Jsg+5Kcmy79GyYjLUY3TU2mZ+dIdScy/jIWatXe/OTKZ
iB4Ctc4MDe9GDECmlPOWf98AXqsBUuKQiWKCN/OPxLtFOeWBvi4IzvFuO8QvnL9p
jZcRDmIlIWAcDX/2wMnLjV+Hqi3EeReIrYznxTGnO7HHVInF555GP51vFaG5k+SN
smJQAB0/sostmC1OCCqBKq5b6/li95/No7+0v0SUhJJ5o76AR5CcNsnolXesw1fu
vTUkB/I66Hk=
=dQKG
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanups from Arnd Bergmann:
"The asm-generic changes for 4.4 are mostly a series from Christoph
Hellwig to clean up various abuses of headers in there. The patch to
rename the io-64-nonatomic-*.h headers caused some conflicts with new
users, so I added a workaround that we can remove in the next merge
window.
The only other patch is a warning fix from Marek Vasut"
* tag 'asm-generic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: temporarily add back asm-generic/io-64-nonatomic*.h
asm-generic: cmpxchg: avoid warnings from macro-ized cmpxchg() implementations
gpio-mxc: stop including <asm-generic/bug>
n_tracesink: stop including <asm-generic/bug>
n_tracerouter: stop including <asm-generic/bug>
mlx5: stop including <asm-generic/kmap_types.h>
hifn_795x: stop including <asm-generic/kmap_types.h>
drbd: stop including <asm-generic/kmap_types.h>
move count_zeroes.h out of asm-generic
move io-64-nonatomic*.h out of asm-generic
Pull RAS changes from Ingo Molnar:
"The main system reliability related changes were from x86, but also
some generic RAS changes:
- AMD MCE error injection subsystem enhancements. (Aravind
Gopalakrishnan)
- Fix MCE and CPU hotplug interaction bug. (Ashok Raj)
- kcrash bootup robustness fix. (Baoquan He)
- kcrash cleanups. (Borislav Petkov)
- x86 microcode driver rework: simplify it by unmodularizing it and
other cleanups. (Borislav Petkov)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/mce: Add a default case to the switch in __mcheck_cpu_ancient_init()
x86/mce: Add a Scalable MCA vendor flags bit
MAINTAINERS: Unify the microcode driver section
x86/microcode/intel: Move #ifdef DEBUG inside the function
x86/microcode/amd: Remove maintainers from comments
x86/microcode: Remove modularization leftovers
x86/microcode: Merge the early microcode loader
x86/microcode: Unmodularize the microcode driver
x86/mce: Fix thermal throttling reporting after kexec
kexec/crash: Say which char is the unrecognized
x86/setup/crash: Check memblock_reserve() retval
x86/setup/crash: Cleanup some more
x86/setup/crash: Remove alignment variable
x86/setup: Cleanup crashkernel reservation functions
x86/amd_nb, EDAC: Rename amd_get_node_id()
x86/setup: Do not reserve crashkernel high memory if low reservation failed
x86/microcode/amd: Do not overwrite final patch levels
x86/microcode/amd: Extract current patch level read to a function
x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC
x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts
...
The PAGES_TO_MiB macro is used for unit conversion but the
trace_mc_event() tracepoint expects a page address. Fix that.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1445341538-24271-1-git-send-email-tanxiaojun@huawei.com
Signed-off-by: Borislav Petkov <bp@suse.de>
This function doesn't give us the "Node ID" as the function name
suggests. Rather, it receives a PCI device as argument, checks
the available F3 PCI device IDs in the system and returns the
index of the matching Bus/Device IDs.
Rename it to amd_pci_dev_to_node_id().
No functional change is introduced.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1445246268-26285-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The bootloader may or may not enable the ECC_CORR_EN bit. By
not enabling ECC_CORR_EN, when error happens, it is the user's
responsibility to perform a full SDRAM scrub.
Remove the check for ECC_CORR_EN.
Signed-off-by: Dinh Nguyen <dinguyen@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Thor Thayer <tthayer@opensource.altera.com>
Link: http://lkml.kernel.org/r/1444864456-21778-1-git-send-email-dinguyen@opensource.altera.com
Signed-off-by: Borislav Petkov <bp@suse.de>
These are not implementations of default architecture code but helpers
for drivers. Move them to the place they belong to.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Acked-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
debugfs_remove() is used to remove a file or a directory from the
debugfs filesystem, but mci->debugfs might not empty.
This can be triggered by the following sequence:
1) Enable CONFIG_EDAC_DEBUG
2) insmod an EDAC module (like i3000_edac or similar)
3) rmmod this module
4) we can see files remaining under <debugfs_mountpoint>/edac/ like
"fake_inject", for example.
Removing edac_core then, causes a NULL pointer dereference.
Reported-by: Yun Wu (Abel) <wuyun.wu@huawei.com>
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1444787364-104353-1-git-send-email-tanxiaojun@huawei.com
Signed-off-by: Borislav Petkov <bp@suse.de>
This platform driver has an OF device ID table but the OF module alias
information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20150917114619.GA13145@goodgumbo.baconseed.org
Signed-off-by: Borislav Petkov <bp@suse.de>
Git provides us all the changelogs anyway. So trim the comments section
here. Update the copyrights info while at it.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1443440593-2316-3-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
The scrub rate control register has moved to function 2 in PCI config
space and is at a different offset on family 0x15, models 0x60 and
later. The minimum recommended scrub rate has also changed. (Refer to
D18F2x1c9_dct[1:0][DramScrub] in Fam15hM60h BKDG).
Adjust set_scrub_rate() and get_scrub_rate() functions to accommodate
this.
Tested on F15hM60h, Fam15h, models 00h-0fh and Fam10h systems.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1443440593-2316-2-git-send-email-Aravind.Gopalakrishnan@amd.com
[ Cleanup conditionals. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Updating dimm_label to an empty string does not make much sense. Change
the sysfs dimm_label store operation to fail a request when an input
string is empty.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: elliott@hpe.com
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1443124767.25474.172.camel@hpe.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Sysfs "dimm_label" and "chX_dimm_label" nodes have the following issues
in their store operation:
1) A newline-terminated input string causes redundant newlines:
# echo "test" > /sys/bus/mc0/devices/dimm0/dimm_label
# cat /sys/bus/mc0/devices/dimm0/dimm_label
test
# od -bc /sys/bus/mc0/devices/dimm0/dimm_label
0000000 164 145 163 164 012 012
t e s t \n \n
0000006
2) The original label string (31 characters) cannot be stored due to
an improper size check:
# echo "CPU_SrcID#0_Ha#0_Chan#0_DIMM#0" > /sys/bus/mc0/devices/dimm0/dimm_label
# cat /sys/bus/mc0/devices/dimm0/dimm_label
# od -bc /sys/bus/mc0/devices/dimm0/dimm_label
0000000 012 012
\n \n
0000002
3) An input string longer than the buffer size results a wrong label
info as it allows a retry with the remaining string:
# echo "CPU_SrcID#0_Ha#0_Chan#0_DIMM#0_TEST" > /sys/bus/mc0/devices/dimm0/dimm_label
# cat /sys/bus/mc0/devices/dimm0/dimm_label
_TEST
Fix these issues by making the following changes:
1) Replace a newline character at the end by setting a null. It also
assures that the string is null-terminated in the label buffer.
2) Check the label buffer size with 'sizeof(dimm->label)'.
3) Fail a request if its string exceeds the label buffer size.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Robert Elliott <elliott@hpe.com>
Link: http://lkml.kernel.org/r/1443121564.25474.160.camel@hpe.com
Signed-off-by: Borislav Petkov <bp@suse.de>
After
7d375bffa5 ("sb_edac: Fix support for systems with two home agents per socket")
sysfs "dimm_label" and "chX_dimm_label" show their label string without a
newline "\n" at the end.
[root@orange ~]# cat /sys/bus/mc0/devices/dimm0/dimm_label
CPU_SrcID#0_Ha#0_Chan#0_DIMM#0[root@orange ~]#
[root@orange ~]# cat /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label
CPU_SrcID#0_Ha#0_Chan#0_DIMM#0[root@orange ~]#
The label strings now have 31 characters, which are the same as
EDAC_MC_LABEL_LEN. Since the snprintf()s in channel_dimm_label_show()
and dimmdev_label_show() limit the whole length by EDAC_MC_LABEL_LEN,
the newline in the format "%s\n" is ignored.
[root@orange ~]# od -bc /sys/bus/mc0/devices/dimm0/dimm_label
0000000 103 120 125 137 123 162 143 111 104 043 060 137 110 141 043 060
C P U _ S r c I D # 0 _ H a # 0
0000020 137 103 150 141 156 043 060 137 104 111 115 115 043 060 000
_ C h a n # 0 _ D I M M # 0 \0
0000037
Fix it by using 'sizeof(dimm->label) + 1' as the whole length in the
snprintf()s in channel_dimm_label_show() and dimmdev_label_show().
Reported-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Link: http://lkml.kernel.org/r/1442933883-21587-2-git-send-email-toshi.kani@hpe.com
Signed-off-by: Borislav Petkov <bp@suse.de>
In commit
7d375bffa5 ("sb_edac: Fix support for systems with two home agents per socket")
NUM_CHANNELS was changed to 8 and the channel space was renumerated to
handle EN, EP, and EX configurations.
The *_mci_bind_devs() functions - except for sbridge_mci_bind_devs() -
got a new device presence check in the form of saw_chan_mask. However,
sbridge_mci_bind_devs() still uses the NUM_CHANNELS for loop.
With the increase in NUM_CHANNELS, this loop fails at index 4 since
SB only has 4 TADs. This results in the following error on SB machines:
EDAC sbridge: Some needed devices are missing
EDAC sbridge: Couldn't find mci handler
EDAC sbridge: Couldn't find mci handle
This patch adapts the saw_chan_mask logic for sbridge_mci_bind_devs() as
well.
After this patch:
EDAC MC0: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#0: DEV 0000:3f:0e.0 (POLLED)
EDAC MC1: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#1: DEV 0000:7f:0e.0 (POLLED)
Signed-off-by: Seth Jennings <sjenning@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # v4.2
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1438798561-10180-1-git-send-email-sjenning@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
We already have edac_mem_types[] that enumerates the different kinds of
memory. So, use that and remove the redundant memory_type[] array here.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1442436811-23382-2-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Drop CONFIG_EDAC_DEBUG ifdeffery too, while at it.
Tested-by: Loc Ho <lho@apm.com>
Cc: linux-edac@vger.kernel.org
Signed-off-by: Borislav Petkov <bp@suse.de>
This driver creates its debugfs hierarchy under the toplevel debugfs dir
- see i5100_init() - so make it use edac_debugfs_create_dir_at( , NULL)
because we're not breaking userspace. Oh well.
Signed-off-by: Borislav Petkov <bp@suse.de>