linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-09-22 12:44:11 +08:00

History

Jiaqi Yan 44b8f8bf24 mm: memory-failure: add memory failure stats to sysfs Patch series "Introduce per NUMA node memory error statistics", v2. Background ========== In the RFC for Kernel Support of Memory Error Detection [1], one advantage of software-based scanning over hardware patrol scrubber is the ability to make statistics visible to system administrators. The statistics include 2 categories: * Memory error statistics, for example, how many memory error are encountered, how many of them are recovered by the kernel. Note these memory errors are non-fatal to kernel: during the machine check exception (MCE) handling kernel already classified MCE's severity to be unnecessary to panic (but either action required or optional). * Scanner statistics, for example how many times the scanner have fully scanned a NUMA node, how many errors are first detected by the scanner. The memory error statistics are useful to userspace and actually not specific to scanner detected memory errors, and are the focus of this patchset. Motivation ========== Memory error stats are important to userspace but insufficient in kernel today. Datacenter administrators can better monitor a machine's memory health with the visible stats. For example, while memory errors are inevitable on servers with 10+ TB memory, starting server maintenance when there are only 1~2 recovered memory errors could be overreacting; in cloud production environment maintenance usually means live migrate all the workload running on the server and this usually causes nontrivial disruption to the customer. Providing insight into the scope of memory errors on a system helps to determine the appropriate follow-up action. In addition, the kernel's existing memory error stats need to be standardized so that userspace can reliably count on their usefulness. Today kernel provides following memory error info to userspace, but they are not sufficient or have disadvantages: * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text Exposing memory error stats is also a good start for the in-kernel memory error detector. Today the data source of memory error stats are either direct memory error consumption, or hardware patrol scrubber detection (either signaled as UCNA or SRAO). Once in-kernel memory scanner is implemented, it will be the main source as it is usually configured to scan memory DIMMs constantly and faster than hardware patrol scrubber. How Implemented =============== As Naoya pointed out [2], exposing memory error statistics to userspace is useful independent of software or hardware scanner. Therefore we implement the memory error statistics independent of the in-kernel memory error detector. It exposes the following per NUMA node memory error counters: /sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. This approach can be easier to extend for future use cases than /proc/meminfo, trace event, and log. The following math holds for the statistics: * total = recovered + ignored + failed + delayed These memory error stats are reset during machine boot. The 1st commit introduces these sysfs entries. The 2nd commit populates memory error stats every time memory_failure attempts memory error recovery. The 3rd commit adds documentations for introduced stats. [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6 This patch (of 3): Today kernel provides following memory error info to userspace, but each has its own disadvantage * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text Exposes per NUMA node memory error stats as sysfs entries: /sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. The following math holds for the statistics: * total = recovered + ignored + failed + delayed Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2023-02-02 22:33:28 -08:00
..
acpi	ACPI: Fix selecting wrong ACPI fwnode for the iGPU on some Dell laptops	2023-01-10 20:23:48 +01:00
asm-generic	mm/uffd: always wr-protect pte in pte\|pmd_mkuffd_wp()	2023-01-18 17:12:37 -08:00
clocksource	Updates for timers, timekeeping and drivers:	2022-12-12 12:52:02 -08:00
crypto	crypto: acomp - define max size for destination	2022-12-09 18:45:00 +08:00
drm	Merge drm/drm-fixes into drm-misc-fixes	2023-01-03 08:32:12 +01:00
dt-bindings	remoteproc updates for v6.2	2022-12-21 09:37:14 -08:00
keys
kunit	kunit: add macro to allow conditionally exposing static symbols to tests	2022-12-12 14:13:48 -07:00
kvm	Merge branch kvm-arm64/pmu-unchained into kvmarm-master/next	2022-12-05 14:38:44 +00:00
linux	mm: memory-failure: add memory failure stats to sysfs	2023-02-02 22:33:28 -08:00
math-emu
media	Merge tag 'br-v6.2i' of git://linuxtv.org/hverkuil/media_tree into media_stage	2022-12-07 17:58:47 +01:00
memory
misc
net	rxrpc: Tidy up abort generation infrastructure	2023-01-06 09:43:32 +00:00
pcmcia
ras
rdma	RDMA: Extend RDMA kernel verbs ABI to support flush	2022-12-09 19:36:01 -04:00
rv
scsi	Merge branch '6.2/scsi-queue' into 6.2/scsi-fixes	2022-12-30 16:29:34 +00:00
soc	Networking changes for 6.2.	2022-12-13 15:47:48 -08:00
sound	ALSA: hda/hdmi: fix stream-id config keep-alive for rt suspend	2022-12-09 12:06:15 +01:00
target
trace	mm: discard __GFP_ATOMIC	2023-02-02 22:33:13 -08:00
uapi	mm: implement memory-deny-write-execute as a prctl	2023-02-02 22:33:24 -08:00
ufs
vdso
video	fbdev: omapfb: connector-analog-tv: remove support for platform data	2022-12-14 20:01:49 +01:00
xen	xen: make remove callback of xen driver void returned	2022-12-15 16:06:10 +01:00