Commit Graph

174 Commits

Author SHA1 Message Date
Muralidhara M K
9a5f580c1c EDAC/mc: Add support for HBM3 memory type
AMD MI300A models use HBM3 (High Bandwidth Memory Gen 3) memory. HBM is
a high-speed computer memory interface for 3D-stacked synchronous
dynamic random-access memory (SDRAM).

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231102114225.2006878-4-muralimk@amd.com
2023-11-28 16:39:12 +01:00
Serge Semin
93df194765 EDAC/mc: Drop duplicated dimm->nr_pages debug printout
The duplicated edac_dbg()-based dimm->nr_pages print was introduced in

  6e84d359b2 ("edac_mc: Cleanup per-dimm_info debug messages").

The duplicated line can be found even in the commit message text:

  [ 1011.380101] EDAC DEBUG: edac_mc_dump_dimm:   dimm->nr_pages = 0x40000
  [ 1011.380103] EDAC DEBUG: edac_mc_dump_dimm:   dimm->grain = 8
  [ 1011.380104] EDAC DEBUG: edac_mc_dump_dimm:   dimm->nr_pages = 0x40000

Drop the second edac_dbg() call.

  [ bp: Massage commit message. ]

Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220822190730.27277-14-Sergey.Semin@baikalelectronics.ru
2022-09-01 08:52:18 +02:00
Borislav Petkov
13088b65d9 EDAC: Use kcalloc()
It is syntactic sugar anyway:

  # drivers/edac/edac_mc.o:

   text    data     bss     dec     hex filename
  13378     324       8   13710    358e edac_mc.o.before
  13378     324       8   13710    358e edac_mc.o.after

md5:
   70a53ee3ac7f867730e35c2be9110748  edac_mc.o.before.asm
   70a53ee3ac7f867730e35c2be9110748  edac_mc.o.after.asm

  # drivers/edac/edac_device.o:

   text    data     bss     dec     hex filename
   5704     120       4    5828    16c4 edac_device.o.before
   5704     120       4    5828    16c4 edac_device.o.after

md5:
   880563c859da6eb9aca85ec431fdbaeb  edac_device.o.before.asm
   880563c859da6eb9aca85ec431fdbaeb  edac_device.o.after.asm

No functional changes.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220412211957.28899-1-bp@alien8.de
2022-05-02 11:32:44 +02:00
Borislav Petkov
713c4ff885 EDAC/mc: Get rid of edac_align_ptr()
Get rid of it now that it is unused.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220310095254.1510-6-bp@alien8.de
2022-04-11 12:04:11 +02:00
Borislav Petkov
0bbb265f70 EDAC/mc: Get rid of silly one-shot struct allocation in edac_mc_alloc()
This has probably meant something at some point but there's no need for
it anymore - the struct mem_ctl_info allocation can happen with normal,
boring k*alloc() calls like everyone else does it.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220310095254.1510-2-bp@alien8.de
2022-04-11 11:25:39 +02:00
Borislav Petkov
1422df58e5 Merge branch 'edac-amd64' into edac-updates-for-v5.18
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-03-21 10:34:57 +01:00
Eliav Farber
f8efca92ae EDAC: Fix calculation of returned address and next offset in edac_align_ptr()
Do alignment logic properly and use the "ptr" local variable for
calculating the remainder of the alignment.

This became an issue because struct edac_mc_layer has a size that is not
zero modulo eight, and the next offset that was prepared for the private
data was unaligned, causing an alignment exception.

The patch in Fixes: which broke this actually wanted to "what we
actually care about is the alignment of the actual pointer that's about
to be returned." But it didn't check that alignment.

Use the correct variable "ptr" for that.

  [ bp: Massage commit message. ]

Fixes: 8447c4d15e ("edac: Do alignment logic properly in edac_align_ptr()")
Signed-off-by: Eliav Farber <farbere@amazon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220113100622.12783-2-farbere@amazon.com
2022-02-15 15:54:46 +01:00
Eliav Farber
b0596da1a0 EDAC/mc: Remove unnecessary cast to char * in edac_align_ptr()
Remove the forgotten (char *) casts as that function returns void *.

  [ bp: Rewrite commit message. ]

Signed-off-by: Eliav Farber <farbere@amazon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220113100622.12783-3-farbere@amazon.com
2022-01-26 19:42:09 +01:00
Yazen Ghannam
f957112423 EDAC: Add RDDR5 and LRDDR5 memory types
Include Registered-DDR5 and Load-Reduced DDR5 in the list of memory
types.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208174356.1997855-2-yazen.ghannam@amd.com
2021-12-10 12:51:28 +01:00
Len Baker
fca6116564 EDAC/mc: Replace strcpy(), sprintf() and snprintf() with strscpy() or scnprintf()
strcpy() performs no bounds checking on the destination buffer. This
could result in linear overflows beyond the end of the buffer, leading
to all kinds of misbehavior. The safe replacement is strscpy().
[1][2]

However, to simplify and clarify the code, to concatenate labels use the
scnprintf() function. This way it is not necessary to check the return
value of strscpy() (-E2BIG if the parameter count is 0 or the src was
truncated) since scnprintf() always returns the number of chars written
into the buffer. This function always returns a nul-terminated string
even if it needs to be truncated.

While at it, fix all other broken string generation code that wrongly
interprets snprintf()'s return code or just uses sprintf(), implement
that using scnprintf() here too. Drop breaks in loops around
scnprintf() as it is safe now to loop.

Moreover, the check is not needed: for the case when the buffer is
exhausted, len never gets zero because scnprintf() takes the full buffer
length as input parameter, but excludes the trailing '\0' in its return
code and thus, 1 is the minimum len.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy
[2] https://github.com/KSPP/linux/issues/88

 [ rric: Replace snprintf() with scnprintf(), rework sprintf() user,
   drop breaks in loops around scnprintf(), introduce 'end' pointer to
   reduce pointer arithmetic, use prefix pattern for e->location,
   adjust subject and description ]

Co-developed-by: Joe Perches <joe@perches.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Len Baker <len.baker@gmx.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210903150539.7282-1-len.baker@gmx.com
2021-09-15 13:52:58 +02:00
Naveen Krishna Chatradhi
e1ca90b7cc EDAC/mc: Add new HBM2 memory type
Add a new entry to 'enum mem_type' and a new string to 'edac_mem_types[]'
for HBM2 (High Bandwidth Memory Gen 2) new memory type.

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210630152828.162659-4-nchatrad@amd.com
2021-07-20 09:20:49 -07:00
Qiuxu Zhuo
bc1c99a597 EDAC: Add DDR5 new memory type
Add a new entry to 'enum mem_type' and a new string to
'edac_mem_types[]' for DDR5 new memory type.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-11-19 12:57:09 -08:00
Qiuxu Zhuo
3b20369313 EDAC: Add three new memory types
There are {Low-Power DDR3/4, WIO2} types of memory.
Add new entries to 'enum mem_type' and new strings to
'edac_mem_types[]' for the new types.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-11-05 08:30:22 -08:00
Zhenzhong Duan
e9ff6636d3 EDAC/mc: Call edac_inc_ue_error() before panic
By calling edac_inc_ue_error() before panic, we get a correct UE error
count for core dump analysis.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200610065846.3626-2-zhenzhong.duan@gmail.com
2020-06-15 11:19:52 -07:00
Tony Luck
7fc0b9b995 EDAC: Drop the EDAC report status checks
When acpi_extlog was added, we were worried that the same error would
be reported more than once by different subsystems. But in the ensuing
years I've seen complaints that people could not find an error log
(because this mechanism suppressed the log they were looking for).

Rip it all out. People are smart enough to notice the same address from
different reporting mechanisms.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com
2020-04-14 16:01:01 +02:00
Robert Richter
4aa92c8646 EDAC/mc: Remove per layer counters
Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
turns out that only the leaves in the memory hierarchy are consumed
(in sysfs), but not the intermediate layers, e.g.:

  count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];

These unused counters only add complexity, remove them. The error
counter values are directly stored in struct dimm_info now.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-11-rrichter@marvell.com
2020-02-17 13:37:00 +01:00
Robert Richter
1853ee7299 EDAC/mc: Remove detail[] string and cleanup error string generation
The error descriptor is passed to the error reporting functions, so
the error details can be directly generated there. Move string
generation from edac_raw_mc_handle_error() to edac_ce_error() and
edac_ue_error(). The intermediate detail[] string can be removed then.

Also, cleanup the string generation by switching to a single variant
only using the ternary operator.

 [ bp: put ternary operators on a separate line for better readability
   and use the short-form "inline if" in edac_mc_handle_error(). ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-10-rrichter@marvell.com
2020-02-17 13:36:28 +01:00
Robert Richter
6ab76179ad EDAC/mc: Pass the error descriptor to error reporting functions
Most arguments of error reporting functions are already stored in the
struct edac_raw_error_desc error descriptor. Pass the error descriptor
to the functions and reduce the functions' argument list.

 [ bp: Sort function args in reverse fir tree order. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-9-rrichter@marvell.com
2020-02-17 13:35:30 +01:00
Robert Richter
67792cf958 EDAC/mc: Remove enable_per_layer_report function argument
Many functions carry the enable_per_layer_report argument. This is a
bool value indicating the error information contains some location
data where the error occurred. This can easily being determined by
checking the pos[] array for values. Negative values indicate there is
no location available. So if the top layer is negative, the error
location is unknown.

Just check if the top layer is negative and remove
enable_per_layer_report as function argument and also from struct
edac_raw_error_desc.

 [ bp: Reflow comments to 80 columns, while at it. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-8-rrichter@marvell.com
2020-02-17 13:13:16 +01:00
Robert Richter
65bb4d1af9 EDAC/mc: Report "unknown memory" on too many DIMM labels found
There is a limitation to report only EDAC_MAX_LABELS in e->label of
the error descriptor. This is to prevent a potential string overflow.

The current implementation falls back to "any memory" in this case and
also stops all further processing to find a unique row and channel of
the possible error location.

Reporting "any memory" is wrong as the memory controller reported an
error location for one of the layers. Instead, report "unknown memory"
and also do not break early in the loop to further check row and channel
for uniqueness.

 [ bp: Massage commit message. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-7-rrichter@marvell.com
2020-02-17 13:10:14 +01:00
Robert Richter
6334dc4e3f EDAC/mc: Carve out error increment into a separate function
Carve out the error_count increment into a separate function
edac_inc_csrow(). This better separates code and reduces the indentation
level.

Implementation note: The function edac_inc_csrow() counts the same
as before, ->ce_count is only incremented if row >= 0. This is esp.
true for the case of (!e->enable_per_layer_report). Here, a DIMM was
not found, variable row still has a value of -1 and ->ce_count is not
incremented.

 [ bp: Massage commit message. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200214141757.8976-1-rrichter@marvell.com
2020-02-17 13:07:50 +01:00
Robert Richter
91b327f672 EDAC/mc: Determine mci pointer from the error descriptor
Each struct mci has its own error descriptor. Create a function
error_desc_to_mci() to determine the corresponding mci from an
error descriptor. This removes @mci from the parameter list of
edac_raw_mc_handle_error() as the mci pointer does not need to be passed
any longer.

 [ bp: Massage commit message. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-5-rrichter@marvell.com
2020-02-17 13:05:10 +01:00
Robert Richter
672ef0e568 EDAC: Store error type in struct edac_raw_error_desc
Store the error type in struct edac_raw_error_desc. This makes the
type parameter of edac_raw_mc_handle_error() obsolete.

[ kernel-doc typo ]
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-4-rrichter@marvell.com
2020-02-17 13:02:30 +01:00
Robert Richter
1f27c79062 EDAC/mc: Reorder functions edac_mc_alloc*()
Reorder the new created functions edac_mc_alloc_csrows() and
edac_mc_alloc_dimms() and move them before edac_mc_alloc(). No further
code changes.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-3-rrichter@marvell.com
2020-02-17 12:57:18 +01:00
Robert Richter
aad28c6f6b EDAC/mc: Split edac_mc_alloc() into smaller functions
edac_mc_alloc() is huge. Factor out code by moving it to the two new
functions edac_mc_alloc_csrows() and edac_mc_alloc_dimms(). Do not
move code yet for better review.

 [ bp: sort local args in reversed fir tree order. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-2-rrichter@marvell.com
2020-02-17 12:47:50 +01:00
Robert Richter
bea1bfd5b7 EDAC/mc: Change mci device removal to use put_device()
There are dimm and csrow devices linked to the mci device esp. to show
up in sysfs. It must be granted that children devices are removed before
its mci parent. Thus, the release functions must be called in the
correct order and may not miss any child before releasing its parent. In
the current implementation this is only granted by the correct order of
release functions.

A much better approach is to use put_device() that releases the device
only after all users are gone. It is the recommended way to release a
device and free its memory. The function uses the device's refcount and
only frees it if there are no users of it anymore such as children.

So implement a mci_release() function to remove mci devices, use
put_device() to free them and early initialize the mci device right
after its struct has been allocated.

Change the release function so that it can be universally used no
matter if the device is registered or not. Since subsequent dimm
and csrow sysfs links are implemented as children devices, their
refcounts will keep the parent mci device from being removed as long
as sysfs entries exist and until all users have been unregistered in
edac_remove_sysfs_mci_device().

Remove edac_unregister_sysfs() and merge mci sysfs removal into
edac_remove_sysfs_mci_device(). There is only a single instance now that
removes the sysfs entries. The function can now be used in the error
paths for cleanup.

Also, create device release functions for all involved devices
(dev->release), remove device_type release functions (dev_type->
release) and also use dev->init_name instead of dev_set_name().

 [ bp: Massage commit message and comments. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200212120340.4764-5-rrichter@marvell.com
2020-02-17 12:32:44 +01:00
Robert Richter
216aa145aa EDAC/mc: Fix use-after-free and memleaks during device removal
A test kernel with the options DEBUG_TEST_DRIVER_REMOVE, KASAN and
DEBUG_KMEMLEAK set, revealed several issues when removing an mci device:

1) Use-after-free:

On 27.11.19 17:07:33, John Garry wrote:
> [   22.104498] BUG: KASAN: use-after-free in
> edac_remove_sysfs_mci_device+0x148/0x180

The use-after-free is caused by the mci_for_each_dimm() macro called in
edac_remove_sysfs_mci_device(). The iterator was introduced with

  c498afaf7d ("EDAC: Introduce an mci_for_each_dimm() iterator").

The iterator loop calls device_unregister(&dimm->dev), which removes
the sysfs entry of the device, but also frees the dimm struct in
dimm_attr_release(). When incrementing the loop in mci_for_each_dimm(),
the dimm struct is accessed again, after having been freed already.

The fix is to free all the mci device's subsequent dimm and csrow
objects at a later point, in _edac_mc_free(), when the mci device itself
is being freed.

This keeps the data structures intact and the mci device can be
fully used until its removal. The change allows the safe usage of
mci_for_each_dimm() to release dimm devices from sysfs.

2) Memory leaks:

Following memory leaks have been detected:

 # grep edac /sys/kernel/debug/kmemleak | sort | uniq -c
       1     [<000000003c0f58f9>] edac_mc_alloc+0x3bc/0x9d0      # mci->csrows
      16     [<00000000bb932dc0>] edac_mc_alloc+0x49c/0x9d0      # csr->channels
      16     [<00000000e2734dba>] edac_mc_alloc+0x518/0x9d0      # csr->channels[chn]
       1     [<00000000eb040168>] edac_mc_alloc+0x5c8/0x9d0      # mci->dimms
      34     [<00000000ef737c29>] ghes_edac_register+0x1c8/0x3f8 # see edac_mc_alloc()

All leaks are from memory allocated by edac_mc_alloc().

Note: The test above shows that edac_mc_alloc() was called here from
ghes_edac_register(), thus both functions show up in the stack trace
but the module causing the leaks is edac_mc. The comments with the data
structures involved were made manually by analyzing the objdump.

The data structures listed above and created by edac_mc_alloc() are
not properly removed during device removal, which is done in
edac_mc_free().

There are two paths implemented to remove the device depending on device
registration, _edac_mc_free() is called if the device is not registered
and edac_unregister_sysfs() otherwise.

The implemenations differ. For the sysfs case, the mci device removal
lacks the removal of subsequent data structures (csrows, channels,
dimms). This causes the memory leaks (see mci_attr_release()).

 [ bp: Massage commit message. ]

Fixes: c498afaf7d ("EDAC: Introduce an mci_for_each_dimm() iterator")
Fixes: faa2ad09c0 ("edac_mc: edac_mc_free() cannot assume mem_ctl_info is registered in sysfs.")
Fixes: 7a623c0390 ("edac: rewrite the sysfs code to use struct device")
Reported-by: John Garry <john.garry@huawei.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: John Garry <john.garry@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200212120340.4764-3-rrichter@marvell.com
2020-02-13 13:28:52 +01:00
Robert Richter
787d899914 EDAC: Unify the mc_event tracepoint call
The code in ghes_edac.c and edac_mc.c for grain_bits calculation and
calling trace_mc_event() is now the same. Move it to a single location
in edac_raw_mc_handle_error().

The only difference is the missing IS_ENABLED(CONFIG_RAS) switch, but
this is needed for ghes too.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-13-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
0d8292e003 EDAC/mc: Reduce indentation level in edac_mc_handle_error()
Reduce the indentation level in edac_mc_handle_error() a bit.

No functional changes.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-7-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
47bec6b4c3 EDAC/mc: Remove needless zero string termination
The e string to which this is pointing to has already been cleared
earlier in the function so remove the needless zero string termination.

 [ bp: Correct the commit message. ]

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-6-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
d260e8ff51 EDAC/mc: Do not BUG_ON() in edac_mc_alloc()
No need to crash the system in case edac_mc_alloc() is called with
invalid arguments, just warn and return. This would cause a checkpatch
warning when touching the code later, so just fix it.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-5-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
c498afaf7d EDAC: Introduce an mci_for_each_dimm() iterator
Introduce an mci_for_each_dimm() iterator. It returns a pointer to
a struct dimm_info. This makes the declaration and use of an index
obsolete and avoids access to internal data of struct mci (direct array
access etc).

 [ bp: push the struct dimm_info *dimm; declaration into the
   CONFIG_EDAC_DEBUG block. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-4-rrichter@marvell.com
2019-11-10 12:39:40 +01:00
Robert Richter
977b1ce7c1 EDAC: Remove EDAC_DIMM_OFF() macro
The EDAC_DIMM_OFF() macro takes 5 arguments to get the DIMM's index.
Simplify this by storing the index in struct dimm_info to avoid its
calculation and remove the EDAC_DIMM_OFF() macro. The index can be
directly used then.

Another advantage is that edac_mc_alloc() could be used even if the
exact size of the layers is unknown. Only the number of DIMMs would be
needed.

Rename iterator variable to idx, while at it. The name is more handy,
esp. when searching for it in the code.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-3-rrichter@marvell.com
2019-11-09 11:23:49 +01:00
Robert Richter
d55c79ac86 EDAC: Prefer 'unsigned int' to bare use of 'unsigned'
Use of 'unsigned int' instead of bare use of 'unsigned'. Fix this for
edac_mc*, ghes and the i5100 driver as reported by checkpatch.pl.

While at it, struct member dev_ch_attribute->channel is always used as
unsigned int. Change type to unsigned int to avoid type casts.

 [ bp: Massage. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190902123216.9809-2-rrichter@marvell.com
2019-09-03 19:21:19 +02:00
Robert Richter
718d58514e EDAC/mc: Cleanup _edac_mc_free() code
Remove needless and boilerplate variable declarations. No functional
changes.

 [ bp: Add newlines for better readability. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190624150758.6695-10-rrichter@marvell.com
2019-08-14 18:27:00 +02:00
Robert Richter
3724ace582 EDAC/mc: Fix grain_bits calculation
The grain in EDAC is defined as "minimum granularity for an error
report, in bytes". The following calculation of the grain_bits in
edac_mc is wrong:

	grain_bits = fls_long(e->grain) + 1;

Where grain_bits is defined as:

	grain = 1 << grain_bits

Example:

	grain = 8	# 64 bit (8 bytes)
	grain_bits = fls_long(8) + 1
	grain_bits = 4 + 1 = 5

	grain = 1 << grain_bits
	grain = 1 << 5 = 32

Replace it with the correct calculation:

	grain_bits = fls_long(e->grain - 1);

The example gives now:

	grain_bits = fls_long(8 - 1)
	grain_bits = fls_long(7)
	grain_bits = 3

	grain = 1 << 3 = 8

Also, check if the hardware reports a reasonable grain != 0 and fallback
with a warning to 1 byte granularity otherwise.

 [ bp: massage a bit. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190624150758.6695-2-rrichter@marvell.com
2019-08-03 12:05:51 +02:00
Robert Richter
29a0c84397 EDAC/mc: Fix edac_mc_find() in case no device is found
The function should return NULL in case no device is found, but it
always returns the last checked mc device from the list even if the
index did not match. Fix that.

I did some analysis why this did not raise any issues for about 3 years
and the reason is that edac_mc_find() is mostly used to search for
existing devices. Thus, the bug is not triggered.

 [ bp: Drop the if (mci->mc_idx > idx) test in favor of readability. ]

Fixes: c73e8833be ("EDAC, mc: Fix locking around mc_devices list")
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Link: https://lkml.kernel.org/r/20190514104838.15065-1-rrichter@marvell.com
2019-05-14 17:08:46 +02:00
Borislav Petkov
861e6ed667 EDAC: Drop per-memory controller buses
... and use the single edac_subsys object returned from
subsys_system_register(). The idea is to have a single bus
and multiple devices on it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
CC: Aristeu Rozanski Filho <arozansk@redhat.com>
CC: Greg KH <gregkh@linuxfoundation.org>
CC: Justin Ernst <justin.ernst@hpe.com>
CC: linux-edac <linux-edac@vger.kernel.org>
CC: Mauro Carvalho Chehab <mchehab@kernel.org>
CC: Russ Anderson <rja@hpe.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20180926152752.GG5584@zn.tnic
2018-11-13 21:55:24 +01:00
Takashi Iwai
b748f2de4b EDAC: Add missing MEM_LRDDR4 entry in edac_mem_types[]
The edac_mem_types[] array misses a MEM_LRDDR4 entry, which leads to
NULL pointer dereference when accessed via sysfs or such.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180810141426.8918-1-tiwai@suse.de
Fixes: 1e8096bb20 ("EDAC: Add LRDDR4 DRAM type")
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-08-17 15:13:34 +02:00
Tony Luck
001f86137d EDAC: Add new memory type for non-volatile DIMMs
There are now non-volatile versions of DIMMs. Add a new entry to "enum
mem_type" and a new string in edac_mem_types[].

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-acpi@vger.kernel.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/20180312182430.10335-3-tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-03-14 12:32:06 +01:00
Tony Luck
d6dd77ebcd EDAC: Drop duplicated array of strings for memory type names
Somehow we ended up with two separate arrays of strings to describe the
"enum mem_type" values.

In edac_mc.c we have an exported list edac_mem_types[] that is used
by a couple of drivers in debug messaged.

In edac_mc_sysfs.c we have a private list that is used to display
values in:
  /sys/devices/system/edac/mc/mc*/dimm*/dimm_mem_type
  /sys/devices/system/edac/mc/mc*/csrow*/mem_type

This list was missing a value for MEM_LRDDR3.

The string values in the two lists were different :-(

Combining the lists, I kept the values so that the sysfs output
will be unchanged as some scripts may depend on that.

Reported-by: Borislav Petkov <bp@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-acpi@vger.kernel.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/20180312182430.10335-2-tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-03-14 12:20:16 +01:00
Toshi Kani
3877c7d1e2 EDAC: Add helper which returns the loaded platform driver
Only a single EDAC platform driver can be loaded. When ghes_edac is
enabled, an EDAC platform driver still attempts to register itself and
fails in edac_mc_add_mc().

Add edac_get_owner() so that EDAC platform drivers can check the owner
first.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170823225447.15608-5-toshi.kani@hpe.com
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-09-25 12:55:59 +02:00
Borislav Petkov
bffc7dece9 EDAC: Rename report status accessors
Change them to have the edac_ prefix.

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-04-10 17:15:02 +02:00
Borislav Petkov
fee27d7d97 EDAC: Delete edac_stub.c
Move the remaining functionality to edac_mc.c. Convert "edac_report=" to
a module parameter.

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-04-10 17:14:48 +02:00
Borislav Petkov
be1d162948 EDAC: Issue tracepoint only when it is defined
... and this happens only when CONFIG_RAS is enabled.

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-04-10 17:14:38 +02:00
Borislav Petkov
8c22b4fece EDAC: Move edac_op_state to edac_mc.c
... as part of moving stuff away from edac_stub.c

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-04-10 17:14:29 +02:00
Borislav Petkov
d3116a0837 EDAC: Remove edac_err_assert
... and the glue around it. It is not needed anymore.

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-04-10 17:14:21 +02:00
Borislav Petkov
97bb6c17ad EDAC: Get rid of edac_handlers
Use mc_devices list instead to check whether we have EDAC driver
instances successfully registered with EDAC core.

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-04-10 17:14:17 +02:00
Yazen Ghannam
d7fc9d77ac EDAC: Add routine to check if MC devices list is empty
We need to know if any MC devices have been allocated.

Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1485537863-2707-7-git-send-email-Yazen.Ghannam@amd.com
[ Prettify text. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-01-28 14:36:47 +01:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00