linux/lib/Kconfig.debug

3054 lines
100 KiB
Plaintext
Raw Normal View History

# SPDX-License-Identifier: GPL-2.0-only
menu "Kernel hacking"
menu "printk and dmesg options"
config PRINTK_TIME
bool "Show timing information on printks"
depends on PRINTK
help
Selecting this option causes time stamps of the printk()
messages to be added to the output of the syslog() system
call and at the console.
The timestamp is always recorded internally, and exported
to /dev/kmsg. This flag just specifies if the timestamp should
be included, not that the timestamp is recorded.
The behavior is also controlled by the kernel command line
parameter printk.time=1. See Documentation/admin-guide/kernel-parameters.rst
printk: Add caller information to printk() output. Sometimes we want to print a series of printk() messages to consoles without being disturbed by concurrent printk() from interrupts and/or other threads. But we can't enforce printk() callers to use their local buffers because we need to ask them to make too much changes. Also, even buffering up to one line inside printk() might cause failing to emit an important clue under critical situation. Therefore, instead of trying to help buffering, let's try to help reconstructing messages by saving caller information as of calling log_store() and adding it as "[T$thread_id]" or "[C$processor_id]" upon printing to consoles. Some examples for console output: [ 1.222773][ T1] x86: Booting SMP configuration: [ 2.779635][ T1] pci 0000:00:01.0: PCI bridge to [bus 01] [ 5.069193][ T268] Fusion MPT base driver 3.04.20 [ 9.316504][ C2] random: fast init done [ 13.413336][ T3355] Initialized host personality Some examples for /dev/kmsg output: 6,496,1222773,-,caller=T1;x86: Booting SMP configuration: 6,968,2779635,-,caller=T1;pci 0000:00:01.0: PCI bridge to [bus 01] SUBSYSTEM=pci DEVICE=+pci:0000:00:01.0 6,1353,5069193,-,caller=T268;Fusion MPT base driver 3.04.20 5,1526,9316504,-,caller=C2;random: fast init done 6,1575,13413336,-,caller=T3355;Initialized host personality Note that this patch changes max length of messages which can be printed by printk() or written to /dev/kmsg interface from 992 bytes to 976 bytes, based on an assumption that userspace won't try to write messages hitting that border line to /dev/kmsg interface. Link: http://lkml.kernel.org/r/93f19e57-5051-c67d-9af4-b17624062d44@i-love.sakura.ne.jp Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org> Cc: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-12-18 05:05:04 +08:00
config PRINTK_CALLER
bool "Show caller information on printks"
depends on PRINTK
help
Selecting this option causes printk() to add a caller "thread id" (if
in task context) or a caller "processor id" (if not in task context)
to every message.
This option is intended for environments where multiple threads
concurrently call printk() for many times, for it is difficult to
interpret without knowing where these lines (or sometimes individual
line which was divided into multiple lines due to race) came from.
Since toggling after boot makes the code racy, currently there is
no option to enable/disable at the kernel command line parameter or
sysfs interface.
dump_stack: add vmlinux build ID to stack traces Add the running kernel's build ID[1] to the stacktrace information header. This makes it simpler for developers to locate the vmlinux with full debuginfo for a particular kernel stacktrace. Combined with scripts/decode_stracktrace.sh, a developer can download the correct vmlinux from a debuginfod[2] server and find the exact file and line number for the functions plus offsets in a stacktrace. This is especially useful for pstore crash debugging where the kernel crashes are recorded in the pstore logs and the recovery kernel is different or the debuginfo doesn't exist on the device due to space concerns (the data can be large and a security concern). The stacktrace can be analyzed after the crash by using the build ID to find the matching vmlinux and understand where in the function something went wrong. Example stacktrace from lkdtm: WARNING: CPU: 4 PID: 3255 at drivers/misc/lkdtm/bugs.c:83 lkdtm_WARNING+0x28/0x30 [lkdtm] Modules linked in: lkdtm rfcomm algif_hash algif_skcipher af_alg xt_cgroup uinput xt_MASQUERADE CPU: 4 PID: 3255 Comm: bash Not tainted 5.11 #3 aa23f7a1231c229de205662d5a9e0d4c580f19a1 Hardware name: Google Lazor (rev3+) with KB Backlight (DT) pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--) pc : lkdtm_WARNING+0x28/0x30 [lkdtm] The hex string aa23f7a1231c229de205662d5a9e0d4c580f19a1 is the build ID, following the kernel version number. Put it all behind a config option, STACKTRACE_BUILD_ID, so that kernel developers can remove this information if they decide it is too much. Link: https://lkml.kernel.org/r/20210511003845.2429846-5-swboyd@chromium.org Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId [1] Link: https://sourceware.org/elfutils/Debuginfod.html [2] Signed-off-by: Stephen Boyd <swboyd@chromium.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Evan Green <evgreen@chromium.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Young <dyoung@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sasha Levin <sashal@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 09:09:17 +08:00
config STACKTRACE_BUILD_ID
bool "Show build ID information in stacktraces"
depends on PRINTK
help
Selecting this option adds build ID information for symbols in
stacktraces printed with the printk format '%p[SR]b'.
This option is intended for distros where debuginfo is not easily
accessible but can be downloaded given the build ID of the vmlinux or
kernel module where the function is located.
config CONSOLE_LOGLEVEL_DEFAULT
int "Default console loglevel (1-15)"
range 1 15
default "7"
help
Default loglevel to determine what will be printed on the console.
Setting a default here is equivalent to passing in loglevel=<x> in
the kernel bootargs. loglevel=<x> continues to override whatever
value is specified here as well.
Note: This does not affect the log level of un-prefixed printk()
usage in the kernel. That is controlled by the MESSAGE_LOGLEVEL_DEFAULT
option.
config CONSOLE_LOGLEVEL_QUIET
int "quiet console loglevel (1-15)"
range 1 15
default "4"
help
loglevel to use when "quiet" is passed on the kernel commandline.
When "quiet" is passed on the kernel commandline this loglevel
will be used as the loglevel. IOW passing "quiet" will be the
equivalent of passing "loglevel=<CONSOLE_LOGLEVEL_QUIET>"
config MESSAGE_LOGLEVEL_DEFAULT
int "Default message log level (1-7)"
range 1 7
default "4"
help
Default log level for printk statements with no specified priority.
This was hard-coded to KERN_WARNING since at least 2.6.10 but folks
that are auditing their logs closely may want to set it to a lower
priority.
Note: This does not affect what message level gets printed on the console
by default. To change that, use loglevel=<x> in the kernel bootargs,
or pick a different CONSOLE_LOGLEVEL_DEFAULT configuration value.
config BOOT_PRINTK_DELAY
bool "Delay each boot printk message by N milliseconds"
depends on DEBUG_KERNEL && PRINTK && GENERIC_CALIBRATE_DELAY
help
This build option allows you to read kernel boot messages
by inserting a short delay after each one. The delay is
specified in milliseconds on the kernel command line,
using "boot_delay=N".
It is likely that you would also need to use "lpj=M" to preset
the "loops per jiffie" value.
See a previous boot log for the "lpj" value to use for your
system, and then set "lpj=M" before setting "boot_delay=N".
NOTE: Using this option may adversely affect SMP systems.
I.e., processors other than the first one may not boot up.
BOOT_PRINTK_DELAY also may cause LOCKUP_DETECTOR to detect
what it believes to be lockup conditions.
config DYNAMIC_DEBUG
bool "Enable dynamic printk() support"
default n
depends on PRINTK
depends on (DEBUG_FS || PROC_FS)
select DYNAMIC_DEBUG_CORE
help
Compiles debug level messages into the kernel, which would not
otherwise be available at runtime. These messages can then be
enabled/disabled based on various levels of scope - per source file,
function, module, format string, and line number. This mechanism
implicitly compiles in all pr_debug() and dev_dbg() calls, which
enlarges the kernel text size by about 2%.
If a source file is compiled with DEBUG flag set, any
pr_debug() calls in it are enabled by default, but can be
disabled at runtime as below. Note that DEBUG flag is
turned on by many CONFIG_*DEBUG* options.
Usage:
Dynamic debugging is controlled via the 'dynamic_debug/control' file,
which is contained in the 'debugfs' filesystem or procfs.
Thus, the debugfs or procfs filesystem must first be mounted before
making use of this feature.
We refer the control file as: <debugfs>/dynamic_debug/control. This
file contains a list of the debug statements that can be enabled. The
format for each line of the file is:
filename:lineno [module]function flags format
filename : source file of the debug statement
lineno : line number of the debug statement
module : module that contains the debug statement
function : function that contains the debug statement
flags : '=p' means the line is turned 'on' for printing
format : the format used for the debug statement
From a live system:
nullarbor:~ # cat <debugfs>/dynamic_debug/control
# filename:lineno [module]function flags format
fs/aio.c:222 [aio]__put_ioctx =_ "__put_ioctx:\040freeing\040%p\012"
fs/aio.c:248 [aio]ioctx_alloc =_ "ENOMEM:\040nr_events\040too\040high\012"
fs/aio.c:1770 [aio]sys_io_cancel =_ "calling\040cancel\012"
Example usage:
// enable the message at line 1603 of file svcsock.c
nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
<debugfs>/dynamic_debug/control
// enable all the messages in file svcsock.c
nullarbor:~ # echo -n 'file svcsock.c +p' >
<debugfs>/dynamic_debug/control
// enable all the messages in the NFS server module
nullarbor:~ # echo -n 'module nfsd +p' >
<debugfs>/dynamic_debug/control
// enable all 12 messages in the function svc_process()
nullarbor:~ # echo -n 'func svc_process +p' >
<debugfs>/dynamic_debug/control
// disable all 12 messages in the function svc_process()
nullarbor:~ # echo -n 'func svc_process -p' >
<debugfs>/dynamic_debug/control
See Documentation/admin-guide/dynamic-debug-howto.rst for additional
information.
config DYNAMIC_DEBUG_CORE
bool "Enable core function of dynamic debug support"
depends on PRINTK
depends on (DEBUG_FS || PROC_FS)
help
Enable core functional support of dynamic debug. It is useful
when you want to tie dynamic debug to your kernel modules with
DYNAMIC_DEBUG_MODULE defined for each of them, especially for
the case of embedded system where the kernel image size is
sensitive for people.
printf: add support for printing symbolic error names It has been suggested several times to extend vsnprintf() to be able to convert the numeric value of ENOSPC to print "ENOSPC". This implements that as a %p extension: With %pe, one can do if (IS_ERR(foo)) { pr_err("Sorry, can't do that: %pe\n", foo); return PTR_ERR(foo); } instead of what is seen in quite a few places in the kernel: if (IS_ERR(foo)) { pr_err("Sorry, can't do that: %ld\n", PTR_ERR(foo)); return PTR_ERR(foo); } If the value passed to %pe is an ERR_PTR, but the library function errname() added here doesn't know about the value, the value is simply printed in decimal. If the value passed to %pe is not an ERR_PTR, we treat it as an ordinary %p and thus print the hashed value (passing non-ERR_PTR values to %pe indicates a bug in the caller, but we can't do much about that). With my embedded hat on, and because it's not very invasive to do, I've made it possible to remove this. The errname() function and associated lookup tables take up about 3K. For most, that's probably quite acceptable and a price worth paying for more readable dmesg (once this starts getting used), while for those that disable printk() it's of very little use - I don't see a procfs/sysfs/seq_printf() file reasonably making use of this - and they clearly want to squeeze vmlinux as much as possible. Hence the default y if PRINTK. The symbols to include have been found by massaging the output of find arch include -iname 'errno*.h' | xargs grep -E 'define\s*E' In the cases where some common aliasing exists (e.g. EAGAIN=EWOULDBLOCK on all platforms, EDEADLOCK=EDEADLK on most), I've moved the more popular one (in terms of 'git grep -w Efoo | wc) to the bottom so that one takes precedence. Link: http://lkml.kernel.org/r/20191015190706.15989-1-linux@rasmusvillemoes.dk To: "Jonathan Corbet" <corbet@lwn.net> To: linux-kernel@vger.kernel.org Cc: "Andy Shevchenko" <andy.shevchenko@gmail.com> Cc: "Andrew Morton" <akpm@linux-foundation.org> Cc: "Joe Perches" <joe@perches.com> Cc: linux-doc@vger.kernel.org Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Reviewed-by: Petr Mladek <pmladek@suse.com> [andy.shevchenko@gmail.com: use abs()] Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-10-16 03:07:05 +08:00
config SYMBOLIC_ERRNAME
bool "Support symbolic error names in printf"
default y if PRINTK
help
If you say Y here, the kernel's printf implementation will
be able to print symbolic error names such as ENOSPC instead
of the number 28. It makes the kernel image slightly larger
(about 3KB), but can make the kernel logs easier to read.
config DEBUG_BUGVERBOSE
bool "Verbose BUG() reporting (adds 70K)" if DEBUG_KERNEL && EXPERT
depends on BUG && (GENERIC_BUG || HAVE_DEBUG_BUGVERBOSE)
default y
help
Say Y here to make BUG() panics output the file name and line number
of the BUG call as well as the EIP and oops trace. This aids
debugging but costs about 70-100K of memory.
endmenu # "printk and dmesg options"
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
config DEBUG_KERNEL
bool "Kernel debugging"
help
Say Y here if you are developing drivers or trying to debug and
identify kernel problems.
config DEBUG_MISC
bool "Miscellaneous debug code"
default DEBUG_KERNEL
depends on DEBUG_KERNEL
help
Say Y here if you need to enable miscellaneous debug code that should
be under a more specific debug option but isn't.
menu "Compile-time checks and compiler options"
config DEBUG_INFO
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
bool
help
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
A kernel debug info option other than "None" has been selected
in the "Debug information" choice below, indicating that debug
information will be generated for build targets.
# Clang generates .uleb128 with label differences for DWARF v5, a feature that
# older binutils ports do not support when utilizing RISC-V style linker
# relaxation: https://sourceware.org/bugzilla/show_bug.cgi?id=27215
config AS_HAS_NON_CONST_ULEB128
lib/Kconfig.debug: Add check for non-constant .{s,u}leb128 support to DWARF5 When building with a RISC-V kernel with DWARF5 debug info using clang and the GNU assembler, several instances of the following error appear: /tmp/vgettimeofday-48aa35.s:2963: Error: non-constant .uleb128 is not supported Dumping the .s file reveals these .uleb128 directives come from .debug_loc and .debug_ranges: .Ldebug_loc0: .byte 4 # DW_LLE_offset_pair .uleb128 .Lfunc_begin0-.Lfunc_begin0 # starting offset .uleb128 .Ltmp1-.Lfunc_begin0 # ending offset .byte 1 # Loc expr size .byte 90 # DW_OP_reg10 .byte 0 # DW_LLE_end_of_list .Ldebug_ranges0: .byte 4 # DW_RLE_offset_pair .uleb128 .Ltmp6-.Lfunc_begin0 # starting offset .uleb128 .Ltmp27-.Lfunc_begin0 # ending offset .byte 4 # DW_RLE_offset_pair .uleb128 .Ltmp28-.Lfunc_begin0 # starting offset .uleb128 .Ltmp30-.Lfunc_begin0 # ending offset .byte 0 # DW_RLE_end_of_list There is an outstanding binutils issue to support a non-constant operand to .sleb128 and .uleb128 in GAS for RISC-V but there does not appear to be any movement on it, due to concerns over how it would work with linker relaxation. To avoid these build errors, prevent DWARF5 from being selected when using clang and an assembler that does not have support for these symbol deltas, which can be easily checked in Kconfig with as-instr plus the small test program from the dwz test suite from the binutils issue. Link: https://sourceware.org/bugzilla/show_bug.cgi?id=27215 Link: https://github.com/ClangBuiltLinux/linux/issues/1719 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-10-15 04:42:11 +08:00
def_bool $(as-instr,.uleb128 .Lexpr_end4 - .Lexpr_start3\n.Lexpr_start3:\n.Lexpr_end4:)
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
choice
prompt "Debug information"
depends on DEBUG_KERNEL
help
Selecting something other than "None" results in a kernel image
that will include debugging info resulting in a larger kernel image.
This adds debug symbols to the kernel and modules (gcc -g), and
is needed if you intend to use kernel crashdump or binary object
tools like crash, kgdb, LKCD, gdb, etc on the kernel.
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
Choose which version of DWARF debug info to emit. If unsure,
select "Toolchain default".
config DEBUG_INFO_NONE
bool "Disable debug information"
help
Do not build the kernel with debugging information, which will
result in a faster and smaller build.
config DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
bool "Rely on the toolchain's implicit default DWARF version"
select DEBUG_INFO
depends on !CC_IS_CLANG || AS_IS_LLVM || CLANG_VERSION < 140000 || (AS_IS_GNU && AS_VERSION >= 23502 && AS_HAS_NON_CONST_ULEB128)
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
help
The implicit default version of DWARF debug info produced by a
toolchain changes over time.
This can break consumers of the debug info that haven't upgraded to
support newer revisions, and prevent testing newer versions, but
those should be less common scenarios.
config DEBUG_INFO_DWARF4
bool "Generate DWARF Version 4 debuginfo"
select DEBUG_INFO
depends on !CC_IS_CLANG || AS_IS_LLVM || (AS_IS_GNU && AS_VERSION >= 23502)
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
help
Generate DWARF v4 debug info. This requires gcc 4.5+, binutils 2.35.2
if using clang without clang's integrated assembler, and gdb 7.0+.
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
If you have consumers of DWARF debug info that are not ready for
newer revisions of DWARF, you may wish to choose this or have your
config select this.
config DEBUG_INFO_DWARF5
bool "Generate DWARF Version 5 debuginfo"
select DEBUG_INFO
depends on !ARCH_HAS_BROKEN_DWARF5
depends on !CC_IS_CLANG || AS_IS_LLVM || (AS_IS_GNU && AS_VERSION >= 23502 && AS_HAS_NON_CONST_ULEB128)
Kconfig.debug: make DEBUG_INFO selectable from a choice Currently it's not possible to enable DEBUG_INFO for an all*config build, since it is marked as "depends on !COMPILE_TEST". This generally makes sense because a debug build of an all*config target ends up taking much longer and the output is much larger. Having this be "default off" makes sense. However, there are cases where enabling DEBUG_INFO for such builds is useful for doing treewide A/B comparisons of build options, etc. Make DEBUG_INFO selectable from any of the DWARF version choice options, with DEBUG_INFO_NONE being the default for COMPILE_TEST. The mutually exclusive relationship between DWARF5 and BTF must be inverted, but the result remains the same. Additionally moves DEBUG_KERNEL and DEBUG_MISC up to the top of the menu because they were enabling features _above_ it, making it weird to navigate menuconfig. [keescook@chromium.org: make DEBUG_INFO always default=n] Link: https://lkml.kernel.org/r/20220128214131.580131-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/YfRY6+CaQxX7O8vF@dev-arch.archlinux-ax161 Link: https://lkml.kernel.org/r/20220125075126.891825-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 07:05:38 +08:00
help
Generate DWARF v5 debug info. Requires binutils 2.35.2, gcc 5.0+ (gcc
5.0+ accepts the -gdwarf-5 flag but only had partial support for some
draft features until 7.0), and gdb 8.0+.
Changes to the structure of debug info in Version 5 allow for around
15-18% savings in resulting image and debug info section sizes as
compared to DWARF Version 4. DWARF Version 5 standardizes previous
extensions such as accelerators for symbol indexing and the format
for fission (.dwo/.dwp) files. Users may not want to select this
config if they rely on tooling that has not yet been updated to
support DWARF Version 5.
endchoice # "Debug information"
if DEBUG_INFO
config DEBUG_INFO_REDUCED
bool "Reduce debugging information"
help
If you say Y here gcc is instructed to generate less debugging
information for structure types. This means that tools that
need full debugging information (like kgdb or systemtap) won't
be happy. But if you merely need debugging information to
resolve line numbers there is no loss. Advantage is that
build directory object sizes shrink dramatically over a full
DEBUG_INFO build and compile times are reduced too.
Only works with newer gcc versions.
Makefile.debug: support for -gz=zstd Make DEBUG_INFO_COMPRESSED a choice; DEBUG_INFO_COMPRESSED_NONE is the default, DEBUG_INFO_COMPRESSED_ZLIB uses zlib, DEBUG_INFO_COMPRESSED_ZSTD uses zstd. This renames the existing KConfig option DEBUG_INFO_COMPRESSED to DEBUG_INFO_COMPRESSED_ZLIB so users upgrading may need to reset the new Kconfigs. Some quick N=1 measurements with du, /usr/bin/time -v, and bloaty: clang-16, x86_64 defconfig plus CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_COMPRESSED_NONE=y: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:55.43 488M vmlinux 27.6% 136Mi 0.0% 0 .debug_info 6.1% 30.2Mi 0.0% 0 .debug_str_offsets 3.5% 17.2Mi 0.0% 0 .debug_line 3.3% 16.3Mi 0.0% 0 .debug_loclists 0.9% 4.62Mi 0.0% 0 .debug_str clang-16, x86_64 defconfig plus CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_COMPRESSED_ZLIB=y: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.35 385M vmlinux 21.8% 85.4Mi 0.0% 0 .debug_info 2.1% 8.26Mi 0.0% 0 .debug_str_offsets 2.1% 8.24Mi 0.0% 0 .debug_loclists 1.9% 7.48Mi 0.0% 0 .debug_line 0.5% 1.94Mi 0.0% 0 .debug_str clang-16, x86_64 defconfig plus CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_COMPRESSED_ZSTD=y: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.69 373M vmlinux 21.4% 81.4Mi 0.0% 0 .debug_info 2.3% 8.85Mi 0.0% 0 .debug_loclists 1.5% 5.71Mi 0.0% 0 .debug_line 0.5% 1.95Mi 0.0% 0 .debug_str_offsets 0.4% 1.62Mi 0.0% 0 .debug_str That's only a 3.11% overall binary size savings over zlib, but at no performance regression. Link: https://maskray.me/blog/2022-09-09-zstd-compressed-debug-sections Link: https://maskray.me/blog/2022-01-23-compressed-debug-sections Suggested-by: Sedat Dilek (DHL Supply Chain) <sedat.dilek@dhl.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-11-11 03:59:05 +08:00
choice
prompt "Compressed Debug information"
help
Compress the resulting debug info. Results in smaller debug info sections,
but requires that consumers are able to decompress the results.
If unsure, choose DEBUG_INFO_COMPRESSED_NONE.
config DEBUG_INFO_COMPRESSED_NONE
bool "Don't compress debug information"
help
Don't compress debug info sections.
config DEBUG_INFO_COMPRESSED_ZLIB
bool "Compress debugging information with zlib"
Makefile: support compressed debug info As debug information gets larger and larger, it helps significantly save the size of vmlinux images to compress the information in the debug information sections. Note: this debug info is typically split off from the final compressed kernel image, which is why vmlinux is what's used in conjunction with GDB. Minimizing the debug info size should have no impact on boot times, or final compressed kernel image size. All of the debug sections will have a `C` flag set. $ readelf -S <object file> $ bloaty vmlinux.gcc75.compressed.dwarf4 -- \ vmlinux.gcc75.uncompressed.dwarf4 FILE SIZE VM SIZE -------------- -------------- +0.0% +18 [ = ] 0 [Unmapped] -73.3% -114Ki [ = ] 0 .debug_aranges -76.2% -2.01Mi [ = ] 0 .debug_frame -73.6% -2.89Mi [ = ] 0 .debug_str -80.7% -4.66Mi [ = ] 0 .debug_abbrev -82.9% -4.88Mi [ = ] 0 .debug_ranges -70.5% -9.04Mi [ = ] 0 .debug_line -79.3% -10.9Mi [ = ] 0 .debug_loc -39.5% -88.6Mi [ = ] 0 .debug_info -18.2% -123Mi [ = ] 0 TOTAL $ bloaty vmlinux.clang11.compressed.dwarf4 -- \ vmlinux.clang11.uncompressed.dwarf4 FILE SIZE VM SIZE -------------- -------------- +0.0% +23 [ = ] 0 [Unmapped] -65.6% -871 [ = ] 0 .debug_aranges -77.4% -1.84Mi [ = ] 0 .debug_frame -82.9% -2.33Mi [ = ] 0 .debug_abbrev -73.1% -2.43Mi [ = ] 0 .debug_str -84.8% -3.07Mi [ = ] 0 .debug_ranges -65.9% -8.62Mi [ = ] 0 .debug_line -86.2% -40.0Mi [ = ] 0 .debug_loc -42.0% -64.1Mi [ = ] 0 .debug_info -22.1% -122Mi [ = ] 0 TOTAL For x86_64 defconfig + LLVM=1 (before): Elapsed (wall clock) time (h:mm:ss or m:ss): 3:22.03 Maximum resident set size (kbytes): 43856 For x86_64 defconfig + LLVM=1 (after): Elapsed (wall clock) time (h:mm:ss or m:ss): 3:32.52 Maximum resident set size (kbytes): 1566776 Thanks to: Nick Clifton helped us to provide the minimal binutils version. Sedat Dilek found an increase in size of debug .deb package. Cc: Nick Clifton <nickc@redhat.com> Suggested-by: David Blaikie <blaikie@google.com> Reviewed-by: Fangrui Song <maskray@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-05-27 01:18:29 +08:00
depends on $(cc-option,-gz=zlib)
depends on $(ld-option,--compress-debug-sections=zlib)
help
Compress the debug information using zlib. Requires GCC 5.0+ or Clang
5.0+, binutils 2.26+, and zlib.
Users of dpkg-deb via scripts/package/builddeb may find an increase in
size of their debug .deb packages with this config set, due to the
debug info being compressed with zlib, then the object files being
recompressed with a different compression scheme. But this is still
preferable to setting $KDEB_COMPRESS to "none" which would be even
larger.
Makefile.debug: support for -gz=zstd Make DEBUG_INFO_COMPRESSED a choice; DEBUG_INFO_COMPRESSED_NONE is the default, DEBUG_INFO_COMPRESSED_ZLIB uses zlib, DEBUG_INFO_COMPRESSED_ZSTD uses zstd. This renames the existing KConfig option DEBUG_INFO_COMPRESSED to DEBUG_INFO_COMPRESSED_ZLIB so users upgrading may need to reset the new Kconfigs. Some quick N=1 measurements with du, /usr/bin/time -v, and bloaty: clang-16, x86_64 defconfig plus CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_COMPRESSED_NONE=y: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:55.43 488M vmlinux 27.6% 136Mi 0.0% 0 .debug_info 6.1% 30.2Mi 0.0% 0 .debug_str_offsets 3.5% 17.2Mi 0.0% 0 .debug_line 3.3% 16.3Mi 0.0% 0 .debug_loclists 0.9% 4.62Mi 0.0% 0 .debug_str clang-16, x86_64 defconfig plus CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_COMPRESSED_ZLIB=y: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.35 385M vmlinux 21.8% 85.4Mi 0.0% 0 .debug_info 2.1% 8.26Mi 0.0% 0 .debug_str_offsets 2.1% 8.24Mi 0.0% 0 .debug_loclists 1.9% 7.48Mi 0.0% 0 .debug_line 0.5% 1.94Mi 0.0% 0 .debug_str clang-16, x86_64 defconfig plus CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_COMPRESSED_ZSTD=y: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.69 373M vmlinux 21.4% 81.4Mi 0.0% 0 .debug_info 2.3% 8.85Mi 0.0% 0 .debug_loclists 1.5% 5.71Mi 0.0% 0 .debug_line 0.5% 1.95Mi 0.0% 0 .debug_str_offsets 0.4% 1.62Mi 0.0% 0 .debug_str That's only a 3.11% overall binary size savings over zlib, but at no performance regression. Link: https://maskray.me/blog/2022-09-09-zstd-compressed-debug-sections Link: https://maskray.me/blog/2022-01-23-compressed-debug-sections Suggested-by: Sedat Dilek (DHL Supply Chain) <sedat.dilek@dhl.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-11-11 03:59:05 +08:00
config DEBUG_INFO_COMPRESSED_ZSTD
bool "Compress debugging information with zstd"
depends on $(cc-option,-gz=zstd)
depends on $(ld-option,--compress-debug-sections=zstd)
help
Compress the debug information using zstd. This may provide better
compression than zlib, for about the same time costs, but requires newer
toolchain support. Requires GCC 13.0+ or Clang 16.0+, binutils 2.40+, and
zstd.
endchoice # "Compressed Debug information"
kbuild: Support split debug info v4 This is an alternative approach to lower the overhead of debug info (as we discussed a few days ago) gcc 4.7+ and newer binutils have a new "split debug info" debug info model where the debug info is only written once into central ".dwo" files. This avoids having to copy it around multiple times, from the object files to the final executable. It lowers the disk space requirements. In addition it defaults to compressed debug data. More details here: http://gcc.gnu.org/wiki/DebugFission This patch adds a new option to enable it. It has to be an option, because it'll undoubtedly break everyone's debuginfo packaging scheme. gdb/objdump/etc. all still work, if you have new enough versions. I don't see big compile wins (maybe a second or two faster or so), but the object dirs with debuginfo get significantly smaller. My standard kernel config (slightly bigger than defconfig) shrinks from 2.9G disk space to 1.1G objdir (with non reduced debuginfo). I presume if you are IO limited the compile time difference will be larger. Only problem I've seen so far is that it doesn't play well with older versions of ccache (apparently fixed, see https://bugzilla.samba.org/show_bug.cgi?id=10005) v2: various fixes from Dirk Gouders. Improve commit message slightly. v3: Fix clean rules and improve Kconfig slightly v4: Fix merge error in last version (Sam Ravnborg) Clarify description that it mainly helps disk size. Cc: Dirk Gouders <dirk@gouders.net> Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Michal Marek <mmarek@suse.cz>
2014-07-31 02:50:18 +08:00
config DEBUG_INFO_SPLIT
bool "Produce split debuginfo in .dwo files"
depends on $(cc-option,-gsplit-dwarf)
# RISC-V linker relaxation + -gsplit-dwarf has issues with LLVM and GCC
# prior to 12.x:
# https://github.com/llvm/llvm-project/issues/56642
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99090
depends on !RISCV || GCC_VERSION >= 120000
kbuild: Support split debug info v4 This is an alternative approach to lower the overhead of debug info (as we discussed a few days ago) gcc 4.7+ and newer binutils have a new "split debug info" debug info model where the debug info is only written once into central ".dwo" files. This avoids having to copy it around multiple times, from the object files to the final executable. It lowers the disk space requirements. In addition it defaults to compressed debug data. More details here: http://gcc.gnu.org/wiki/DebugFission This patch adds a new option to enable it. It has to be an option, because it'll undoubtedly break everyone's debuginfo packaging scheme. gdb/objdump/etc. all still work, if you have new enough versions. I don't see big compile wins (maybe a second or two faster or so), but the object dirs with debuginfo get significantly smaller. My standard kernel config (slightly bigger than defconfig) shrinks from 2.9G disk space to 1.1G objdir (with non reduced debuginfo). I presume if you are IO limited the compile time difference will be larger. Only problem I've seen so far is that it doesn't play well with older versions of ccache (apparently fixed, see https://bugzilla.samba.org/show_bug.cgi?id=10005) v2: various fixes from Dirk Gouders. Improve commit message slightly. v3: Fix clean rules and improve Kconfig slightly v4: Fix merge error in last version (Sam Ravnborg) Clarify description that it mainly helps disk size. Cc: Dirk Gouders <dirk@gouders.net> Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Michal Marek <mmarek@suse.cz>
2014-07-31 02:50:18 +08:00
help
Generate debug info into separate .dwo files. This significantly
reduces the build directory size for builds with DEBUG_INFO,
because it stores the information only once on disk in .dwo
files instead of multiple times in object files and executables.
In addition the debug information is also compressed.
Requires recent gcc (4.7+) and recent gdb/binutils.
Any tool that packages or reads debug information would need
to know about the .dwo files and include them.
Incompatible with older versions of ccache.
kbuild: add ability to generate BTF type info for vmlinux This patch adds new config option to trigger generation of BTF type information from DWARF debuginfo for vmlinux and kernel modules through pahole, which in turn relies on libbpf for btf_dedup() algorithm. The intent is to record compact type information of all types used inside kernel, including all the structs/unions/typedefs/etc. This enables BPF's compile-once-run-everywhere ([0]) approach, in which tracing programs that are inspecting kernel's internal data (e.g., struct task_struct) can be compiled on a system running some kernel version, but would be possible to run on other kernel versions (and configurations) without recompilation, even if the layout of structs changed and/or some of the fields were added, removed, or renamed. This is only possible if BPF loader can get kernel type info to adjust all the offsets correctly. This patch is a first time in this direction, making sure that BTF type info is part of Linux kernel image in non-loadable ELF section. BTF deduplication ([1]) algorithm typically provides 100x savings compared to DWARF data, so resulting .BTF section is not big as is typically about 2MB in size. [0] http://vger.kernel.org/lpc-bpf2018.html#session-2 [1] https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-03 00:49:50 +08:00
config DEBUG_INFO_BTF
bool "Generate BTF type information"
depends on !DEBUG_INFO_SPLIT && !DEBUG_INFO_REDUCED
depends on !GCC_PLUGIN_RANDSTRUCT || COMPILE_TEST
depends on BPF_SYSCALL
Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "Various misc subsystems, before getting into the post-linux-next material. 41 patches. Subsystems affected by this patch series: procfs, misc, core-kernel, lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump, taskstats, panic, kcov, resource, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang" kernel/resource: fix kfree() of bootmem memory again kcov: properly handle subsequent mmap calls kcov: split ioctl handling into locked and unlocked parts panic: move panic_print before kmsg dumpers panic: add option to dump all CPUs backtraces in panic_print docs: sysctl/kernel: add missing bit to panic_print taskstats: remove unneeded dead assignment kasan: no need to unset panic_on_warn in end_report() ubsan: no need to unset panic_on_warn in ubsan_epilogue() panic: unset panic_on_warn inside panic() docs: kdump: add scp example to write out the dump file docs: kdump: update description about sysfs file system support arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible cgroup: use irqsave in cgroup_rstat_flush_locked(). fat: use pointer to simple type in put_user() minix: fix bug when opening a file with O_DIRECT ...
2022-03-25 05:14:07 +08:00
depends on !DEBUG_INFO_DWARF5 || PAHOLE_VERSION >= 121
# pahole uses elfutils, which does not have support for Hexagon relocations
depends on !HEXAGON
kbuild: add ability to generate BTF type info for vmlinux This patch adds new config option to trigger generation of BTF type information from DWARF debuginfo for vmlinux and kernel modules through pahole, which in turn relies on libbpf for btf_dedup() algorithm. The intent is to record compact type information of all types used inside kernel, including all the structs/unions/typedefs/etc. This enables BPF's compile-once-run-everywhere ([0]) approach, in which tracing programs that are inspecting kernel's internal data (e.g., struct task_struct) can be compiled on a system running some kernel version, but would be possible to run on other kernel versions (and configurations) without recompilation, even if the layout of structs changed and/or some of the fields were added, removed, or renamed. This is only possible if BPF loader can get kernel type info to adjust all the offsets correctly. This patch is a first time in this direction, making sure that BTF type info is part of Linux kernel image in non-loadable ELF section. BTF deduplication ([1]) algorithm typically provides 100x savings compared to DWARF data, so resulting .BTF section is not big as is typically about 2MB in size. [0] http://vger.kernel.org/lpc-bpf2018.html#session-2 [1] https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-03 00:49:50 +08:00
help
Generate deduplicated BTF type information from DWARF debug info.
Turning this on expects presence of pahole tool, which will convert
DWARF type info into equivalent deduplicated BTF type info.
kbuild: Build kernel module BTFs if BTF is enabled and pahole supports it Detect if pahole supports split BTF generation, and generate BTF for each selected kernel module, if it does. This is exposed to Makefiles and C code as CONFIG_DEBUG_INFO_BTF_MODULES flag. Kernel module BTF has to be re-generated if either vmlinux's BTF changes or module's .ko changes. To achieve that, I needed a helper similar to if_changed, but that would allow to filter out vmlinux from the list of updated dependencies for .ko building. I've put it next to the only place that uses and needs it, but it might be a better idea to just add it along the other if_changed variants into scripts/Kbuild.include. Each kernel module's BTF deduplication is pretty fast, as it does only incremental BTF deduplication on top of already deduplicated vmlinux BTF. To show the added build time, I've first ran make only just built kernel (to establish the baseline) and then forced only BTF re-generation, without regenerating .ko files. The build was performed with -j60 parallelization on 56-core machine. The final time also includes bzImage building, so it's not a pure BTF overhead. $ time make -j60 ... make -j60 27.65s user 10.96s system 782% cpu 4.933 total $ touch ~/linux-build/default/vmlinux && time make -j60 ... make -j60 123.69s user 27.85s system 1566% cpu 9.675 total So 4.6 seconds real time, with noticeable part spent in compressed vmlinux and bzImage building. To show size savings, I've built my kernel configuration with about 700 kernel modules with full BTF per each kernel module (without deduplicating against vmlinux) and with split BTF against deduplicated vmlinux (approach in this patch). Below are top 10 modules with biggest BTF sizes. And total size of BTF data across all kernel modules. It shows that split BTF "compresses" 115MB down to 5MB total. And the biggest kernel modules get a downsize from 500-570KB down to 200-300KB. FULL BTF ======== $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 115710691 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 570570 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 520240 ./drivers/gpu/drm/radeon/radeon.ko 503849 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 491777 ./fs/xfs/xfs.ko 411544 ./drivers/net/ethernet/intel/i40e/i40e.ko 403904 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 398754 ./drivers/infiniband/core/ib_core.ko 397224 ./fs/cifs/cifs.ko 386249 ./fs/nfsd/nfsd.ko 379738 SPLIT BTF ========= $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 5194047 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 293206 ./drivers/gpu/drm/radeon/radeon.ko 282103 ./fs/xfs/xfs.ko 222150 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 198503 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 198356 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 113444 ./fs/cifs/cifs.ko 109379 ./arch/x86/kvm/kvm.ko 100225 ./drivers/gpu/drm/drm.ko 94827 ./drivers/infiniband/core/ib_core.ko 91188 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201110011932.3201430-4-andrii@kernel.org
2020-11-10 09:19:30 +08:00
config PAHOLE_HAS_SPLIT_BTF
def_bool PAHOLE_VERSION >= 119
kbuild: Build kernel module BTFs if BTF is enabled and pahole supports it Detect if pahole supports split BTF generation, and generate BTF for each selected kernel module, if it does. This is exposed to Makefiles and C code as CONFIG_DEBUG_INFO_BTF_MODULES flag. Kernel module BTF has to be re-generated if either vmlinux's BTF changes or module's .ko changes. To achieve that, I needed a helper similar to if_changed, but that would allow to filter out vmlinux from the list of updated dependencies for .ko building. I've put it next to the only place that uses and needs it, but it might be a better idea to just add it along the other if_changed variants into scripts/Kbuild.include. Each kernel module's BTF deduplication is pretty fast, as it does only incremental BTF deduplication on top of already deduplicated vmlinux BTF. To show the added build time, I've first ran make only just built kernel (to establish the baseline) and then forced only BTF re-generation, without regenerating .ko files. The build was performed with -j60 parallelization on 56-core machine. The final time also includes bzImage building, so it's not a pure BTF overhead. $ time make -j60 ... make -j60 27.65s user 10.96s system 782% cpu 4.933 total $ touch ~/linux-build/default/vmlinux && time make -j60 ... make -j60 123.69s user 27.85s system 1566% cpu 9.675 total So 4.6 seconds real time, with noticeable part spent in compressed vmlinux and bzImage building. To show size savings, I've built my kernel configuration with about 700 kernel modules with full BTF per each kernel module (without deduplicating against vmlinux) and with split BTF against deduplicated vmlinux (approach in this patch). Below are top 10 modules with biggest BTF sizes. And total size of BTF data across all kernel modules. It shows that split BTF "compresses" 115MB down to 5MB total. And the biggest kernel modules get a downsize from 500-570KB down to 200-300KB. FULL BTF ======== $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 115710691 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 570570 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 520240 ./drivers/gpu/drm/radeon/radeon.ko 503849 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 491777 ./fs/xfs/xfs.ko 411544 ./drivers/net/ethernet/intel/i40e/i40e.ko 403904 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 398754 ./drivers/infiniband/core/ib_core.ko 397224 ./fs/cifs/cifs.ko 386249 ./fs/nfsd/nfsd.ko 379738 SPLIT BTF ========= $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 5194047 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 293206 ./drivers/gpu/drm/radeon/radeon.ko 282103 ./fs/xfs/xfs.ko 222150 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 198503 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 198356 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 113444 ./fs/cifs/cifs.ko 109379 ./arch/x86/kvm/kvm.ko 100225 ./drivers/gpu/drm/drm.ko 94827 ./drivers/infiniband/core/ib_core.ko 91188 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201110011932.3201430-4-andrii@kernel.org
2020-11-10 09:19:30 +08:00
compiler_types: define __user as __attribute__((btf_type_tag("user"))) The __user attribute is currently mainly used by sparse for type checking. The attribute indicates whether a memory access is in user memory address space or not. Such information is important during tracing kernel internal functions or data structures as accessing user memory often has different mechanisms compared to accessing kernel memory. For example, the perf-probe needs explicit command line specification to indicate a particular argument or string in user-space memory ([1], [2], [3]). Currently, vmlinux BTF is available in kernel with many distributions. If __user attribute information is available in vmlinux BTF, the explicit user memory access information from users will not be necessary as the kernel can figure it out by itself with vmlinux BTF. Besides the above possible use for perf/probe, another use case is for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf programs, users can write direct code like p->m1->m2 and "p" could be a function parameter. Without __user information in BTF, the verifier will assume p->m1 accessing kernel memory and will generate normal loads. Let us say "p" actually tagged with __user in the source code. In such cases, p->m1 is actually accessing user memory and direct load is not right and may produce incorrect result. For such cases, bpf_probe_read_user() will be the correct way to read p->m1. To support encoding __user information in BTF, a new attribute __attribute__((btf_type_tag("<arbitrary_string>"))) is implemented in clang ([4]). For example, if we have #define __user __attribute__((btf_type_tag("user"))) during kernel compilation, the attribute "user" information will be preserved in dwarf. After pahole converting dwarf to BTF, __user information will be available in vmlinux BTF. The following is an example with latest upstream clang (clang14) and pahole 1.23: [$ ~] cat test.c #define __user __attribute__((btf_type_tag("user"))) int foo(int __user *arg) { return *arg; } [$ ~] clang -O2 -g -c test.c [$ ~] pahole -JV test.o ... [1] INT int size=4 nr_bits=32 encoding=SIGNED [2] TYPE_TAG user type_id=1 [3] PTR (anon) type_id=2 [4] FUNC_PROTO (anon) return=1 args=(3 arg) [5] FUNC foo type_id=4 [$ ~] You can see for the function argument "int __user *arg", its type is described as PTR -> TYPE_TAG(user) -> INT The kernel can use this information for bpf verification or other use cases. Current btf_type_tag is only supported in clang (>= clang14) and pahole (>= 1.23). gcc support is also proposed and under development ([5]). [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2 [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2 [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2 [4] https://reviews.llvm.org/D111199 [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/ Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154600.652613-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 23:46:00 +08:00
config PAHOLE_HAS_BTF_TAG
def_bool PAHOLE_VERSION >= 123
compiler_types: define __user as __attribute__((btf_type_tag("user"))) The __user attribute is currently mainly used by sparse for type checking. The attribute indicates whether a memory access is in user memory address space or not. Such information is important during tracing kernel internal functions or data structures as accessing user memory often has different mechanisms compared to accessing kernel memory. For example, the perf-probe needs explicit command line specification to indicate a particular argument or string in user-space memory ([1], [2], [3]). Currently, vmlinux BTF is available in kernel with many distributions. If __user attribute information is available in vmlinux BTF, the explicit user memory access information from users will not be necessary as the kernel can figure it out by itself with vmlinux BTF. Besides the above possible use for perf/probe, another use case is for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf programs, users can write direct code like p->m1->m2 and "p" could be a function parameter. Without __user information in BTF, the verifier will assume p->m1 accessing kernel memory and will generate normal loads. Let us say "p" actually tagged with __user in the source code. In such cases, p->m1 is actually accessing user memory and direct load is not right and may produce incorrect result. For such cases, bpf_probe_read_user() will be the correct way to read p->m1. To support encoding __user information in BTF, a new attribute __attribute__((btf_type_tag("<arbitrary_string>"))) is implemented in clang ([4]). For example, if we have #define __user __attribute__((btf_type_tag("user"))) during kernel compilation, the attribute "user" information will be preserved in dwarf. After pahole converting dwarf to BTF, __user information will be available in vmlinux BTF. The following is an example with latest upstream clang (clang14) and pahole 1.23: [$ ~] cat test.c #define __user __attribute__((btf_type_tag("user"))) int foo(int __user *arg) { return *arg; } [$ ~] clang -O2 -g -c test.c [$ ~] pahole -JV test.o ... [1] INT int size=4 nr_bits=32 encoding=SIGNED [2] TYPE_TAG user type_id=1 [3] PTR (anon) type_id=2 [4] FUNC_PROTO (anon) return=1 args=(3 arg) [5] FUNC foo type_id=4 [$ ~] You can see for the function argument "int __user *arg", its type is described as PTR -> TYPE_TAG(user) -> INT The kernel can use this information for bpf verification or other use cases. Current btf_type_tag is only supported in clang (>= clang14) and pahole (>= 1.23). gcc support is also proposed and under development ([5]). [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2 [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2 [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2 [4] https://reviews.llvm.org/D111199 [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/ Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154600.652613-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27 23:46:00 +08:00
depends on CC_IS_CLANG
help
Decide whether pahole emits btf_tag attributes (btf_type_tag and
btf_decl_tag) or not. Currently only clang compiler implements
these attributes, so make the config depend on CC_IS_CLANG.
kbuild: Build kernel module BTFs if BTF is enabled and pahole supports it Detect if pahole supports split BTF generation, and generate BTF for each selected kernel module, if it does. This is exposed to Makefiles and C code as CONFIG_DEBUG_INFO_BTF_MODULES flag. Kernel module BTF has to be re-generated if either vmlinux's BTF changes or module's .ko changes. To achieve that, I needed a helper similar to if_changed, but that would allow to filter out vmlinux from the list of updated dependencies for .ko building. I've put it next to the only place that uses and needs it, but it might be a better idea to just add it along the other if_changed variants into scripts/Kbuild.include. Each kernel module's BTF deduplication is pretty fast, as it does only incremental BTF deduplication on top of already deduplicated vmlinux BTF. To show the added build time, I've first ran make only just built kernel (to establish the baseline) and then forced only BTF re-generation, without regenerating .ko files. The build was performed with -j60 parallelization on 56-core machine. The final time also includes bzImage building, so it's not a pure BTF overhead. $ time make -j60 ... make -j60 27.65s user 10.96s system 782% cpu 4.933 total $ touch ~/linux-build/default/vmlinux && time make -j60 ... make -j60 123.69s user 27.85s system 1566% cpu 9.675 total So 4.6 seconds real time, with noticeable part spent in compressed vmlinux and bzImage building. To show size savings, I've built my kernel configuration with about 700 kernel modules with full BTF per each kernel module (without deduplicating against vmlinux) and with split BTF against deduplicated vmlinux (approach in this patch). Below are top 10 modules with biggest BTF sizes. And total size of BTF data across all kernel modules. It shows that split BTF "compresses" 115MB down to 5MB total. And the biggest kernel modules get a downsize from 500-570KB down to 200-300KB. FULL BTF ======== $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 115710691 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 570570 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 520240 ./drivers/gpu/drm/radeon/radeon.ko 503849 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 491777 ./fs/xfs/xfs.ko 411544 ./drivers/net/ethernet/intel/i40e/i40e.ko 403904 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 398754 ./drivers/infiniband/core/ib_core.ko 397224 ./fs/cifs/cifs.ko 386249 ./fs/nfsd/nfsd.ko 379738 SPLIT BTF ========= $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 5194047 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 293206 ./drivers/gpu/drm/radeon/radeon.ko 282103 ./fs/xfs/xfs.ko 222150 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 198503 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 198356 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 113444 ./fs/cifs/cifs.ko 109379 ./arch/x86/kvm/kvm.ko 100225 ./drivers/gpu/drm/drm.ko 94827 ./drivers/infiniband/core/ib_core.ko 91188 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201110011932.3201430-4-andrii@kernel.org
2020-11-10 09:19:30 +08:00
config PAHOLE_HAS_LANG_EXCLUDE
def_bool PAHOLE_VERSION >= 124
help
Support for the --lang_exclude flag which makes pahole exclude
compilation units from the supplied language. Used in Kbuild to
omit Rust CUs which are not supported in version 1.24 of pahole,
otherwise it would emit malformed kernel and module binaries when
using DEBUG_INFO_BTF_MODULES.
kbuild: Build kernel module BTFs if BTF is enabled and pahole supports it Detect if pahole supports split BTF generation, and generate BTF for each selected kernel module, if it does. This is exposed to Makefiles and C code as CONFIG_DEBUG_INFO_BTF_MODULES flag. Kernel module BTF has to be re-generated if either vmlinux's BTF changes or module's .ko changes. To achieve that, I needed a helper similar to if_changed, but that would allow to filter out vmlinux from the list of updated dependencies for .ko building. I've put it next to the only place that uses and needs it, but it might be a better idea to just add it along the other if_changed variants into scripts/Kbuild.include. Each kernel module's BTF deduplication is pretty fast, as it does only incremental BTF deduplication on top of already deduplicated vmlinux BTF. To show the added build time, I've first ran make only just built kernel (to establish the baseline) and then forced only BTF re-generation, without regenerating .ko files. The build was performed with -j60 parallelization on 56-core machine. The final time also includes bzImage building, so it's not a pure BTF overhead. $ time make -j60 ... make -j60 27.65s user 10.96s system 782% cpu 4.933 total $ touch ~/linux-build/default/vmlinux && time make -j60 ... make -j60 123.69s user 27.85s system 1566% cpu 9.675 total So 4.6 seconds real time, with noticeable part spent in compressed vmlinux and bzImage building. To show size savings, I've built my kernel configuration with about 700 kernel modules with full BTF per each kernel module (without deduplicating against vmlinux) and with split BTF against deduplicated vmlinux (approach in this patch). Below are top 10 modules with biggest BTF sizes. And total size of BTF data across all kernel modules. It shows that split BTF "compresses" 115MB down to 5MB total. And the biggest kernel modules get a downsize from 500-570KB down to 200-300KB. FULL BTF ======== $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 115710691 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 570570 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 520240 ./drivers/gpu/drm/radeon/radeon.ko 503849 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 491777 ./fs/xfs/xfs.ko 411544 ./drivers/net/ethernet/intel/i40e/i40e.ko 403904 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 398754 ./drivers/infiniband/core/ib_core.ko 397224 ./fs/cifs/cifs.ko 386249 ./fs/nfsd/nfsd.ko 379738 SPLIT BTF ========= $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 5194047 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 293206 ./drivers/gpu/drm/radeon/radeon.ko 282103 ./fs/xfs/xfs.ko 222150 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 198503 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 198356 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 113444 ./fs/cifs/cifs.ko 109379 ./arch/x86/kvm/kvm.ko 100225 ./drivers/gpu/drm/drm.ko 94827 ./drivers/infiniband/core/ib_core.ko 91188 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201110011932.3201430-4-andrii@kernel.org
2020-11-10 09:19:30 +08:00
config DEBUG_INFO_BTF_MODULES
bool "Generate BTF type information for kernel modules"
default y
kbuild: Build kernel module BTFs if BTF is enabled and pahole supports it Detect if pahole supports split BTF generation, and generate BTF for each selected kernel module, if it does. This is exposed to Makefiles and C code as CONFIG_DEBUG_INFO_BTF_MODULES flag. Kernel module BTF has to be re-generated if either vmlinux's BTF changes or module's .ko changes. To achieve that, I needed a helper similar to if_changed, but that would allow to filter out vmlinux from the list of updated dependencies for .ko building. I've put it next to the only place that uses and needs it, but it might be a better idea to just add it along the other if_changed variants into scripts/Kbuild.include. Each kernel module's BTF deduplication is pretty fast, as it does only incremental BTF deduplication on top of already deduplicated vmlinux BTF. To show the added build time, I've first ran make only just built kernel (to establish the baseline) and then forced only BTF re-generation, without regenerating .ko files. The build was performed with -j60 parallelization on 56-core machine. The final time also includes bzImage building, so it's not a pure BTF overhead. $ time make -j60 ... make -j60 27.65s user 10.96s system 782% cpu 4.933 total $ touch ~/linux-build/default/vmlinux && time make -j60 ... make -j60 123.69s user 27.85s system 1566% cpu 9.675 total So 4.6 seconds real time, with noticeable part spent in compressed vmlinux and bzImage building. To show size savings, I've built my kernel configuration with about 700 kernel modules with full BTF per each kernel module (without deduplicating against vmlinux) and with split BTF against deduplicated vmlinux (approach in this patch). Below are top 10 modules with biggest BTF sizes. And total size of BTF data across all kernel modules. It shows that split BTF "compresses" 115MB down to 5MB total. And the biggest kernel modules get a downsize from 500-570KB down to 200-300KB. FULL BTF ======== $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 115710691 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 570570 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 520240 ./drivers/gpu/drm/radeon/radeon.ko 503849 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 491777 ./fs/xfs/xfs.ko 411544 ./drivers/net/ethernet/intel/i40e/i40e.ko 403904 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 398754 ./drivers/infiniband/core/ib_core.ko 397224 ./fs/cifs/cifs.ko 386249 ./fs/nfsd/nfsd.ko 379738 SPLIT BTF ========= $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }' 5194047 $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10 ./drivers/gpu/drm/i915/i915.ko 293206 ./drivers/gpu/drm/radeon/radeon.ko 282103 ./fs/xfs/xfs.ko 222150 ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 198503 ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 198356 ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 113444 ./fs/cifs/cifs.ko 109379 ./arch/x86/kvm/kvm.ko 100225 ./drivers/gpu/drm/drm.ko 94827 ./drivers/infiniband/core/ib_core.ko 91188 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201110011932.3201430-4-andrii@kernel.org
2020-11-10 09:19:30 +08:00
depends on DEBUG_INFO_BTF && MODULES && PAHOLE_HAS_SPLIT_BTF
help
Generate compact split BTF type information for kernel modules.
config MODULE_ALLOW_BTF_MISMATCH
bool "Allow loading modules with non-matching BTF type info"
depends on DEBUG_INFO_BTF_MODULES
help
For modules whose split BTF does not match vmlinux, load without
BTF rather than refusing to load. The default behavior with
module BTF enabled is to reject modules with such mismatches;
this option will still load module BTF where possible but ignore
it when a mismatch is found.
config GDB_SCRIPTS
bool "Provide GDB scripts for kernel debugging"
help
This creates the required links to GDB helper scripts in the
build directory. If you load vmlinux into gdb, the helper
scripts will be automatically imported by gdb as well, and
additional functions are available to analyze a Linux kernel
instance. See Documentation/dev-tools/gdb-kernel-debugging.rst
for further details.
endif # DEBUG_INFO
config FRAME_WARN
int "Warn for stack frames larger than"
range 0 8192
default 0 if KMSAN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 2048 if PARISC
default 1536 if (!64BIT && XTENSA)
default 1280 if KASAN && !64BIT
default 1024 if !64BIT
default 2048 if 64BIT
help
Tell the compiler to warn at build time for stack frames larger than this.
Setting this too low will cause a lot of warnings.
Setting it to 0 disables the warning.
config STRIP_ASM_SYMS
bool "Strip assembler-generated symbols during link"
default n
help
Strip internal assembler-generated symbols during a link (symbols
that look like '.Lxxx') so they don't pollute the output of
get_wchan() and suchlike.
config READABLE_ASM
bool "Generate readable assembler code"
depends on DEBUG_KERNEL
Makefile: remove stale cc-option checks cc-option, cc-option-yn, and cc-disable-warning all invoke the compiler during build time, and can slow down the build when these checks become stale for our supported compilers, whose minimally supported versions increases over time. See Documentation/process/changes.rst for the current supported minimal versions (GCC 4.9+, clang 10.0.1+). Compiler version support for these flags may be verified on godbolt.org. The following flags are GCC only and supported since at least GCC 4.9. Remove cc-option and cc-disable-warning tests. * -fno-tree-loop-im * -Wno-maybe-uninitialized * -fno-reorder-blocks * -fno-ipa-cp-clone * -fno-partial-inlining * -femit-struct-debug-baseonly * -fno-inline-functions-called-once * -fconserve-stack The following flags are supported by all supported versions of GCC and Clang. Remove their cc-option, cc-option-yn, and cc-disable-warning tests. * -fno-delete-null-pointer-checks * -fno-var-tracking * -Wno-array-bounds The following configs are made dependent on GCC, since they use GCC specific flags. * READABLE_ASM * DEBUG_SECTION_MISMATCH -mfentry was not supported by s390-linux-gnu-gcc until gcc-9+, add a comment. --param=allow-store-data-races=0 was renamed to -fno-allow-store-data-races in the GCC 10 release; add a comment. -Wmaybe-uninitialized (GCC specific) was being added for CONFIG_GCOV, then again unconditionally; add it only once. Also, base RETPOLINE_CFLAGS and RETPOLINE_VDSO_CFLAGS on CONFIC_CC_IS_* then remove cc-option tests for Clang. Link: https://github.com/ClangBuiltLinux/linux/issues/1436 Acked-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-08-17 04:25:01 +08:00
depends on CC_IS_GCC
help
Disable some compiler optimizations that tend to generate human unreadable
assembler output. This may make the kernel slightly slower, but it helps
to keep kernel developers who have to stare a lot at assembler listings
sane.
config HEADERS_INSTALL
bool "Install uapi headers to usr/include"
depends on !UML
help
This option will install uapi headers (headers exported to user-space)
into the usr/include directory for use during the kernel build.
This is unneeded for building the kernel itself, but needed for some
user-space program samples. It is also needed by some features such
as uapi header sanity checks.
config DEBUG_SECTION_MISMATCH
bool "Enable full Section mismatch analysis"
Makefile: remove stale cc-option checks cc-option, cc-option-yn, and cc-disable-warning all invoke the compiler during build time, and can slow down the build when these checks become stale for our supported compilers, whose minimally supported versions increases over time. See Documentation/process/changes.rst for the current supported minimal versions (GCC 4.9+, clang 10.0.1+). Compiler version support for these flags may be verified on godbolt.org. The following flags are GCC only and supported since at least GCC 4.9. Remove cc-option and cc-disable-warning tests. * -fno-tree-loop-im * -Wno-maybe-uninitialized * -fno-reorder-blocks * -fno-ipa-cp-clone * -fno-partial-inlining * -femit-struct-debug-baseonly * -fno-inline-functions-called-once * -fconserve-stack The following flags are supported by all supported versions of GCC and Clang. Remove their cc-option, cc-option-yn, and cc-disable-warning tests. * -fno-delete-null-pointer-checks * -fno-var-tracking * -Wno-array-bounds The following configs are made dependent on GCC, since they use GCC specific flags. * READABLE_ASM * DEBUG_SECTION_MISMATCH -mfentry was not supported by s390-linux-gnu-gcc until gcc-9+, add a comment. --param=allow-store-data-races=0 was renamed to -fno-allow-store-data-races in the GCC 10 release; add a comment. -Wmaybe-uninitialized (GCC specific) was being added for CONFIG_GCOV, then again unconditionally; add it only once. Also, base RETPOLINE_CFLAGS and RETPOLINE_VDSO_CFLAGS on CONFIC_CC_IS_* then remove cc-option tests for Clang. Link: https://github.com/ClangBuiltLinux/linux/issues/1436 Acked-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-08-17 04:25:01 +08:00
depends on CC_IS_GCC
help
The section mismatch analysis checks if there are illegal
references from one section to another section.
During linktime or runtime, some sections are dropped;
any use of code/data previously in these sections would
most likely result in an oops.
In the code, functions and variables are annotated with
__init,, etc. (see the full list in include/linux/init.h),
which results in the code/data being placed in specific sections.
The section mismatch analysis is always performed after a full
kernel build, and enabling this option causes the following
kbuild: create *.mod with full directory path and remove MODVERDIR While descending directories, Kbuild produces objects for modules, but do not link final *.ko files; it is done in the modpost. To keep track of modules, Kbuild creates a *.mod file in $(MODVERDIR) for every module it is building. Some post-processing steps read the necessary information from *.mod files. This avoids descending into directories again. This mechanism was introduced in 2003 or so. Later, commit 551559e13af1 ("kbuild: implement modules.order") added modules.order. So, we can simply read it out to know all the modules with directory paths. This is easier than parsing the first line of *.mod files. $(MODVERDIR) has a flat directory structure, that is, *.mod files are named only with base names. This is based on the assumption that the module name is unique across the tree. This assumption is really fragile. Stephen Rothwell reported a race condition caused by a module name conflict: https://lkml.org/lkml/2019/5/13/991 In parallel building, two different threads could write to the same $(MODVERDIR)/*.mod simultaneously. Non-unique module names are the source of all kind of troubles, hence commit 3a48a91901c5 ("kbuild: check uniqueness of module names") introduced a new checker script. However, it is still fragile in the build system point of view because this race happens before scripts/modules-check.sh is invoked. If it happens again, the modpost will emit unclear error messages. To fix this issue completely, create *.mod with full directory path so that two threads never attempt to write to the same file. $(MODVERDIR) is no longer needed. Since modules with directory paths are listed in modules.order, Kbuild is still able to find *.mod files without additional descending. I also killed cmd_secanalysis; scripts/mod/sumversion.c computes MD4 hash for modules with MODULE_VERSION(). When CONFIG_DEBUG_SECTION_MISMATCH=y, it occurs not only in the modpost stage, but also during directory descending, where sumversion.c may parse stale *.mod files. It would emit 'No such file or directory' warning when an object consisting a module is renamed, or when a single-obj module is turned into a multi-obj module or vice versa. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Nicolas Pitre <nico@fluxnic.net>
2019-07-17 14:17:57 +08:00
additional step to occur:
- Add the option -fno-inline-functions-called-once to gcc commands.
When inlining a function annotated with __init in a non-init
function, we would lose the section information and thus
the analysis would not catch the illegal reference.
This option tells gcc to inline less (but it does result in
a larger kernel).
config SECTION_MISMATCH_WARN_ONLY
bool "Make section mismatch errors non-fatal"
default y
help
If you say N here, the build process will fail if there are any
section mismatch, instead of just throwing warnings.
If unsure, say Y.
config DEBUG_FORCE_FUNCTION_ALIGN_64B
bool "Force all function address 64B aligned"
depends on EXPERT && (X86_64 || ARM64 || PPC32 || PPC64 || ARC || RISCV || S390)
select FUNCTION_ALIGNMENT_64B
./Makefile: add debug option to enable function aligned on 32 bytes Recently 0day reported many strange performance changes (regression or improvement), in which there was no obvious relation between the culprit commit and the benchmark at the first look, and it causes people to doubt the test itself is wrong. Upon further check, many of these cases are caused by the change to the alignment of kernel text or data, as whole text/data of kernel are linked together, change in one domain may affect alignments of other domains. gcc has an option '-falign-functions=n' to force text aligned, and with that option enabled, some of those performance changes will be gone, like [1][2][3]. Add this option so that developers and 0day can easily find performance bump caused by text alignment change, as tracking these strange bump is quite time consuming. Though it can't help in other cases like data alignment changes like [4]. Following is some size data for v5.7 kernel built with a RHEL config used in 0day: text data bss dec filename 19738771 13292906 5554236 38585913 vmlinux.noalign 19758591 13297002 5529660 38585253 vmlinux.align32 Raw vmlinux size in bytes: v5.7 v5.7+align32 253950832 254018000 +0.02% Some benchmark data, most of them have no big change: * hackbench: [ -1.8%, +0.5%] * fsmark: [ -3.2%, +3.4%] # ext4/xfs/btrfs * kbuild: [ -2.0%, +0.9%] * will-it-scale: [ -0.5%, +1.8%] # mmap1/pagefault3 * netperf: - TCP_CRR [+16.6%, +97.4%] - TCP_RR [-18.5%, -1.8%] - TCP_STREAM [ -1.1%, +1.9%] [1] https://lore.kernel.org/lkml/20200114085637.GA29297@shao2-debian/ [2] https://lore.kernel.org/lkml/20200330011254.GA14393@feng-iot/ [3] https://lore.kernel.org/lkml/1d98d1f0-fe84-6df7-f5bd-f4cb2cdb7f45@intel.com/ [4] https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/ Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Andy Shevchenko <andriy.shevchenko@intel.com> Link: http://lkml.kernel.org/r/1595475001-90945-1-git-send-email-feng.tang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:34:13 +08:00
help
There are cases that a commit from one domain changes the function
address alignment of other domains, and cause magic performance
bump (regression or improvement). Enable this option will help to
verify if the bump is caused by function alignment changes, while
it will slightly increase the kernel size and affect icache usage.
It is mainly for debug and performance tuning use.
#
# Select this config option from the architecture Kconfig, if it
# is preferred to always offer frame pointers as a config
# option on the architecture (regardless of KERNEL_DEBUG):
#
config ARCH_WANT_FRAME_POINTERS
bool
config FRAME_POINTER
bool "Compile the kernel with frame pointers"
depends on DEBUG_KERNEL && (M68K || UML || SUPERH) || ARCH_WANT_FRAME_POINTERS
default y if (DEBUG_INFO && UML) || ARCH_WANT_FRAME_POINTERS
help
If you say Y here the resulting kernel image will be slightly
larger and slower, but it gives very useful debugging information
in case of kernel bugs. (precise oopses/stacktraces/warnings)
config OBJTOOL
bool
config STACK_VALIDATION
bool "Compile-time stack metadata validation"
depends on HAVE_STACK_VALIDATION && UNWINDER_FRAME_POINTER
select OBJTOOL
default n
help
Validate frame pointer rules at compile-time. This helps ensure that
runtime stack traces are more reliable.
For more information, see
tools/objtool/Documentation/objtool.txt.
config NOINSTR_VALIDATION
bool
depends on HAVE_NOINSTR_VALIDATION && DEBUG_ENTRY
select OBJTOOL
default y
config VMLINUX_MAP
bool "Generate vmlinux.map file when linking"
depends on EXPERT
help
Selecting this option will pass "-Map=vmlinux.map" to ld
when linking vmlinux. That file can be useful for verifying
and debugging magic section games, and for seeing which
pieces of code get eliminated with
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION.
config DEBUG_FORCE_WEAK_PER_CPU
bool "Force weak per-cpu definitions"
depends on DEBUG_KERNEL
help
s390 and alpha require percpu variables in modules to be
defined weak to work around addressing range issue which
puts the following two restrictions on percpu variable
definitions.
1. percpu symbols must be unique whether static or not
2. percpu variables can't be defined inside a function
To ensure that generic code follows the above rules, this
option forces all percpu variables to be defined as weak.
endmenu # "Compiler options"
menu "Generic Kernel Debugging Instruments"
config MAGIC_SYSRQ
bool "Magic SysRq key"
depends on !UML
help
If you say Y here, you will have some control over the system even
if the system crashes for example during kernel debugging (e.g., you
will be able to flush the buffer cache to disk, reboot the system
immediately or dump some status information). This is accomplished
by pressing various keys while holding SysRq (Alt+PrintScreen). It
also works on a serial console (on PC hardware at least), if you
send a BREAK and then within 5 seconds a command keypress. The
keys are documented in <file:Documentation/admin-guide/sysrq.rst>.
Don't say Y unless you really know what this hack does.
config MAGIC_SYSRQ_DEFAULT_ENABLE
hex "Enable magic SysRq key functions by default"
depends on MAGIC_SYSRQ
default 0x1
help
Specifies which SysRq key functions are enabled by default.
This may be set to 1 or 0 to enable or disable them all, or
to a bitmask as described in Documentation/admin-guide/sysrq.rst.
config MAGIC_SYSRQ_SERIAL
bool "Enable magic SysRq key over serial"
depends on MAGIC_SYSRQ
default y
help
Many embedded boards have a disconnected TTL level serial which can
generate some garbage that can lead to spurious false sysrq detects.
This option allows you to decide whether you want to enable the
magic SysRq key.
config MAGIC_SYSRQ_SERIAL_SEQUENCE
string "Char sequence that enables magic SysRq over serial"
depends on MAGIC_SYSRQ_SERIAL
default ""
help
Specifies a sequence of characters that can follow BREAK to enable
SysRq on a serial console.
If unsure, leave an empty string and the option will not be enabled.
config DEBUG_FS
bool "Debug Filesystem"
help
debugfs is a virtual file system that kernel developers use to put
debugging files into. Enable this option to be able to read and
write to these files.
For detailed documentation on the debugfs API, see
Documentation/filesystems/.
If unsure, say N.
choice
prompt "Debugfs default access"
depends on DEBUG_FS
default DEBUG_FS_ALLOW_ALL
help
This selects the default access restrictions for debugfs.
It can be overridden with kernel command line option
debugfs=[on,no-mount,off]. The restrictions apply for API access
and filesystem registration.
config DEBUG_FS_ALLOW_ALL
bool "Access normal"
help
No restrictions apply. Both API and filesystem registration
is on. This is the normal default operation.
config DEBUG_FS_DISALLOW_MOUNT
bool "Do not register debugfs as filesystem"
help
The API is open but filesystem is not loaded. Clients can still do
their work and read with debug tools that do not need
debugfs filesystem.
config DEBUG_FS_ALLOW_NONE
bool "No access"
help
Access is off. Clients get -PERM when trying to create nodes in
debugfs tree and debugfs is not registered as a filesystem.
Client can then back-off or continue without debugfs access.
endchoice
source "lib/Kconfig.kgdb"
source "lib/Kconfig.ubsan"
source "lib/Kconfig.kcsan"
endmenu
menu "Networking Debugging"
source "net/Kconfig.debug"
endmenu # "Networking Debugging"
init: introduce DEBUG_MISC option Patch series "init: Do not select DEBUG_KERNEL by default", v5. CONFIG_DEBUG_KERNEL has been designed to just enable Kconfig options. Kernel code generatoin should not depend on CONFIG_DEBUG_KERNEL. Proposed alternative plan: let's add a new symbol, something like DEBUG_MISC ("Miscellaneous debug code that should be under a more specific debug option but isn't"), make it depend on DEBUG_KERNEL and be "default DEBUG_KERNEL" but allow itself to be turned off, and then mechanically change the small handful of "#ifdef CONFIG_DEBUG_KERNEL" to "#ifdef CONFIG_DEBUG_MISC". This patch (of 5): Introduce DEBUG_MISC ("Miscellaneous debug code that should be under a more specific debug option but isn't"), make it depend on DEBUG_KERNEL and be "default DEBUG_KERNEL" but allow itself to be turned off, and then mechanically change the small handful of "#ifdef CONFIG_DEBUG_KERNEL" to "#ifdef CONFIG_DEBUG_MISC". Link: http://lkml.kernel.org/r/20190413224438.10802-2-okaya@kernel.org Signed-off-by: Sinan Kaya <okaya@kernel.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Florian Westphal <fw@strlen.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: James Hogan <jhogan@kernel.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Thomas Bogendoerfer <tbogendoerfer@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 06:44:00 +08:00
menu "Memory Debugging"
source "mm/Kconfig.debug"
config DEBUG_OBJECTS
bool "Debug object operations"
depends on DEBUG_KERNEL
help
If you say Y here, additional code will be inserted into the
kernel to track the life time of various objects and validate
the operations on those objects.
config DEBUG_OBJECTS_SELFTEST
bool "Debug objects selftest"
depends on DEBUG_OBJECTS
help
This enables the selftest of the object debug code.
config DEBUG_OBJECTS_FREE
bool "Debug objects in freed memory"
depends on DEBUG_OBJECTS
help
This enables checks whether a k/v free operation frees an area
which contains an object which has not been deactivated
properly. This can make kmalloc/kfree-intensive workloads
much slower.
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 15:55:01 +08:00
config DEBUG_OBJECTS_TIMERS
bool "Debug timer objects"
depends on DEBUG_OBJECTS
help
If you say Y here, additional code will be inserted into the
timer routines to track the life time of timer objects and
validate the timer operations.
config DEBUG_OBJECTS_WORK
bool "Debug work objects"
depends on DEBUG_OBJECTS
help
If you say Y here, additional code will be inserted into the
work queue routines to track the life time of work objects and
validate the work operations.
tree/tiny rcu: Add debug RCU head objects Helps finding racy users of call_rcu(), which results in hangs because list entries are overwritten and/or skipped. Changelog since v4: - Bissectability is now OK - Now generate a WARN_ON_ONCE() for non-initialized rcu_head passed to call_rcu(). Statically initialized objects are detected with object_is_static(). - Rename rcu_head_init_on_stack to init_rcu_head_on_stack. - Remove init_rcu_head() completely. Changelog since v3: - Include comments from Lai Jiangshan This new patch version is based on the debugobjects with the newly introduced "active state" tracker. Non-initialized entries are all considered as "statically initialized". An activation fixup (triggered by call_rcu()) takes care of performing the debug object initialization without issuing any warning. Since we cannot increase the size of struct rcu_head, I don't see much room to put an identifier for statically initialized rcu_head structures. So for now, we have to live without "activation without explicit init" detection. But the main purpose of this debug option is to detect double-activations (double call_rcu() use of a rcu_head before the callback is executed), which is correctly addressed here. This also detects potential internal RCU callback corruption, which would cause the callbacks to be executed twice. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: David S. Miller <davem@davemloft.net> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: akpm@linux-foundation.org CC: mingo@elte.hu CC: laijs@cn.fujitsu.com CC: dipankar@in.ibm.com CC: josh@joshtriplett.org CC: dvhltc@us.ibm.com CC: niv@us.ibm.com CC: tglx@linutronix.de CC: peterz@infradead.org CC: rostedt@goodmis.org CC: Valdis.Kletnieks@vt.edu CC: dhowells@redhat.com CC: eric.dumazet@gmail.com CC: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2010-04-17 20:48:42 +08:00
config DEBUG_OBJECTS_RCU_HEAD
bool "Debug RCU callbacks objects"
depends on DEBUG_OBJECTS
tree/tiny rcu: Add debug RCU head objects Helps finding racy users of call_rcu(), which results in hangs because list entries are overwritten and/or skipped. Changelog since v4: - Bissectability is now OK - Now generate a WARN_ON_ONCE() for non-initialized rcu_head passed to call_rcu(). Statically initialized objects are detected with object_is_static(). - Rename rcu_head_init_on_stack to init_rcu_head_on_stack. - Remove init_rcu_head() completely. Changelog since v3: - Include comments from Lai Jiangshan This new patch version is based on the debugobjects with the newly introduced "active state" tracker. Non-initialized entries are all considered as "statically initialized". An activation fixup (triggered by call_rcu()) takes care of performing the debug object initialization without issuing any warning. Since we cannot increase the size of struct rcu_head, I don't see much room to put an identifier for statically initialized rcu_head structures. So for now, we have to live without "activation without explicit init" detection. But the main purpose of this debug option is to detect double-activations (double call_rcu() use of a rcu_head before the callback is executed), which is correctly addressed here. This also detects potential internal RCU callback corruption, which would cause the callbacks to be executed twice. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: David S. Miller <davem@davemloft.net> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: akpm@linux-foundation.org CC: mingo@elte.hu CC: laijs@cn.fujitsu.com CC: dipankar@in.ibm.com CC: josh@joshtriplett.org CC: dvhltc@us.ibm.com CC: niv@us.ibm.com CC: tglx@linutronix.de CC: peterz@infradead.org CC: rostedt@goodmis.org CC: Valdis.Kletnieks@vt.edu CC: dhowells@redhat.com CC: eric.dumazet@gmail.com CC: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2010-04-17 20:48:42 +08:00
help
Enable this to turn on debugging of RCU list heads (call_rcu() usage).
percpu_counter: add debugobj support All percpu counters are linked to a global list on initialization and removed from it on destruction. The list is walked during CPU up/down. If a percpu counter is freed without being properly destroyed, the system will oops only on the next CPU up/down making it pretty nasty to track down. This patch adds debugobj support for percpu counters so that such problems can be found easily. As percpu counters don't make sense on stack and can't be statically initialized, debugobj support is pretty simple. It's initialized and activated on counter initialization, and deactivatd and destroyed on counter destruction. With this patch applied, the bug fixed by commit 602586a83b719df0fbd94196a1359ed35aeb2df3 (shmem: put_super must percpu_counter_destroy) triggers the following warning on tmpfs unmount and the system won't oops on the next cpu up/down operation. ------------[ cut here ]------------ WARNING: at lib/debugobjects.c:259 debug_print_object+0x5c/0x70() Hardware name: Bochs ODEBUG: free active (active state 0) object type: percpu_counter Modules linked in: Pid: 3999, comm: umount Not tainted 2.6.36-rc2-work+ #5 Call Trace: [<ffffffff81083f7f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff81084076>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813b45cc>] debug_print_object+0x5c/0x70 [<ffffffff813b50e5>] debug_check_no_obj_freed+0x125/0x210 [<ffffffff811577d3>] kfree+0xb3/0x2f0 [<ffffffff81132edd>] shmem_put_super+0x1d/0x30 [<ffffffff81162e96>] generic_shutdown_super+0x56/0xe0 [<ffffffff81162f86>] kill_anon_super+0x16/0x60 [<ffffffff81162ff7>] kill_litter_super+0x27/0x30 [<ffffffff81163295>] deactivate_locked_super+0x45/0x60 [<ffffffff81163cfa>] deactivate_super+0x4a/0x70 [<ffffffff8117d446>] mntput_no_expire+0x86/0xe0 [<ffffffff8117df7f>] sys_umount+0x6f/0x360 [<ffffffff8103f01b>] system_call_fastpath+0x16/0x1b ---[ end trace cce2a341ba3611a7 ]--- Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Thomas Gleixner <tglxlinutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:23:05 +08:00
config DEBUG_OBJECTS_PERCPU_COUNTER
bool "Debug percpu counter objects"
depends on DEBUG_OBJECTS
help
If you say Y here, additional code will be inserted into the
percpu counter routines to track the life time of percpu counter
objects and validate the percpu counter operations.
config DEBUG_OBJECTS_ENABLE_DEFAULT
int "debug_objects bootup default value (0-1)"
range 0 1
default "1"
depends on DEBUG_OBJECTS
help
Debug objects boot parameter default value
mm: shrinkers: introduce debugfs interface for memory shrinkers This commit introduces the /sys/kernel/debug/shrinker debugfs interface which provides an ability to observe the state of individual kernel memory shrinkers. Because the feature adds some memory overhead (which shouldn't be large unless there is a huge amount of registered shrinkers), it's guarded by a config option (enabled by default). This commit introduces the "count" interface for each shrinker registered in the system. The output is in the following format: <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1>... <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1>... ... To reduce the size of output on machines with many thousands cgroups, if the total number of objects on all nodes is 0, the line is omitted. If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed as cgroup inode id. If the shrinker is not numa-aware, 0's are printed for all nodes except the first one. This commit gives debugfs entries simple numeric names, which are not very convenient. The following commit in the series will provide shrinkers with more meaningful names. [akpm@linux-foundation.org: remove WARN_ON_ONCE(), per Roman] Reported-by: syzbot+300d27c79fe6d4cbcc39@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/20220601032227.4076670-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-01 11:22:23 +08:00
config SHRINKER_DEBUG
bool "Enable shrinker debugging support"
depends on DEBUG_FS
help
Say Y to enable the shrinker debugfs interface which provides
visibility into the kernel memory shrinkers subsystem.
Disable it to avoid an extra memory footprint.
config DEBUG_STACK_USAGE
bool "Stack utilization instrumentation"
arch: Remove Itanium (IA-64) architecture The Itanium architecture is obsolete, and an informal survey [0] reveals that any residual use of Itanium hardware in production is mostly HP-UX or OpenVMS based. The use of Linux on Itanium appears to be limited to enthusiasts that occasionally boot a fresh Linux kernel to see whether things are still working as intended, and perhaps to churn out some distro packages that are rarely used in practice. None of the original companies behind Itanium still produce or support any hardware or software for the architecture, and it is listed as 'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers that contributed on behalf of those companies (nor anyone else, for that matter) have been willing to support or maintain the architecture upstream or even be responsible for applying the odd fix. The Intel firmware team removed all IA-64 support from the Tianocore/EDK2 reference implementation of EFI in 2018. (Itanium is the original architecture for which EFI was developed, and the way Linux supports it deviates significantly from other architectures.) Some distros, such as Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have dropped support years ago. While the argument is being made [1] that there is a 'for the common good' angle to being able to build and run existing projects such as the Grid Community Toolkit [2] on Itanium for interoperability testing, the fact remains that none of those projects are known to be deployed on Linux/ia64, and very few people actually have access to such a system in the first place. Even if there were ways imaginable in which Linux/ia64 could be put to good use today, what matters is whether anyone is actually doing that, and this does not appear to be the case. There are no emulators widely available, and so boot testing Itanium is generally infeasible for ordinary contributors. GCC still supports IA-64 but its compile farm [3] no longer has any IA-64 machines. GLIBC would like to get rid of IA-64 [4] too because it would permit some overdue code cleanups. In summary, the benefits to the ecosystem of having IA-64 be part of it are mostly theoretical, whereas the maintenance overhead of keeping it supported is real. So let's rip off the band aid, and remove the IA-64 arch code entirely. This follows the timeline proposed by the Debian/ia64 maintainer [5], which removes support in a controlled manner, leaving IA-64 in a known good state in the most recent LTS release. Other projects will follow once the kernel support is removed. [0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/ [1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/ [2] https://gridcf.org/gct-docs/latest/index.html [3] https://cfarm.tetaneutral.net/machines/list/ [4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/ [5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/ Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-10-20 21:54:33 +08:00
depends on DEBUG_KERNEL
help
Enables the display of the minimum amount of free stack which each
task has ever had available in the sysrq-T and sysrq-P debug output.
Also emits a message to dmesg when a process exits if that process
used more stack space than previously exiting processes.
This option will slow down process creation somewhat.
config SCHED_STACK_END_CHECK
bool "Detect stack corruption on calls to schedule()"
depends on DEBUG_KERNEL
default n
help
This option checks for a stack overrun on calls to schedule().
If the stack end location is found to be over written always panic as
the content of the corrupted region can no longer be trusted.
This is to ensure no erroneous behaviour occurs which could result in
data corruption or a sporadic crash at a later stage once the region
is examined. The runtime overhead introduced is minimal.
mm/debug: add tests validating architecture page table helpers This adds tests which will validate architecture page table helpers and other accessors in their compliance with expected generic MM semantics. This will help various architectures in validating changes to existing page table helpers or addition of new ones. This test covers basic page table entry transformations including but not limited to old, young, dirty, clean, write, write protect etc at various level along with populating intermediate entries with next page table page and validating them. Test page table pages are allocated from system memory with required size and alignments. The mapped pfns at page table levels are derived from a real pfn representing a valid kernel text symbol. This test gets called via late_initcall(). This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. Any architecture, which is willing to subscribe this test will need to select ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64, x86, s390 and powerpc platforms where the test is known to build and run successfully Going forward, other architectures too can subscribe the test after fixing any build or runtime problems with their page table helpers. Folks interested in making sure that a given platform's page table helpers conform to expected generic MM semantics should enable the above config which will just trigger this test during boot. Any non conformity here will be reported as an warning which would need to be fixed. This test will help catch any changes to the agreed upon semantics expected from generic MM and enable platforms to accommodate it thereafter. [anshuman.khandual@arm.com: v17] Link: http://lkml.kernel.org/r/1587436495-22033-3-git-send-email-anshuman.khandual@arm.com [anshuman.khandual@arm.com: v18] Link: http://lkml.kernel.org/r/1588564865-31160-3-git-send-email-anshuman.khandual@arm.com Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> [ppc32] Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Link: http://lkml.kernel.org/r/1583919272-24178-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:47:15 +08:00
config ARCH_HAS_DEBUG_VM_PGTABLE
bool
help
An architecture should select this when it can successfully
build and run DEBUG_VM_PGTABLE.
config DEBUG_VM_IRQSOFF
def_bool DEBUG_VM && !PREEMPT_RT
config DEBUG_VM
bool "Debug VM"
depends on DEBUG_KERNEL
help
Enable this to turn on extended checks in the virtual-memory system
that may impact performance.
If unsure, say N.
lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme On big systems, the mm refcount can become highly contented when doing a lot of context switching with threaded applications. user<->idle switch is one of the important cases. Abandoning lazy tlb entirely slows this switching down quite a bit in the common uncontended case, so that is not viable. Implement a scheme where lazy tlb mm references do not contribute to the refcount, instead they get explicitly removed when the refcount reaches zero. The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they switch away from this mm to init_mm if it was being used as the lazy tlb mm. Enabling the shoot lazies option therefore requires that the arch ensures that mm_cpumask contains all CPUs that could possibly be using mm. A DEBUG_VM option IPIs every CPU in the system after this to ensure there are no references remaining before the mm is freed. Shootdown IPIs cost could be an issue, but they have not been observed to be a serious problem with this scheme, because short-lived processes tend not to migrate CPUs much, therefore they don't get much chance to leave lazy tlb mm references on remote CPUs. There are a lot of options to reduce them if necessary, described in comments. The near-worst-case can be benchmarked with will-it-scale: context_switch1_threads -t $(($(nproc) / 2)) This will create nproc threads (nproc / 2 switching pairs) all sharing the same mm that spread over all CPUs so each CPU does thread->idle->thread switching. [ Rik came up with basically the same idea a few years ago, so credit to him for that. ] Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/ Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/ Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-03 15:18:36 +08:00
config DEBUG_VM_SHOOT_LAZIES
bool "Debug MMU_LAZY_TLB_SHOOTDOWN implementation"
depends on DEBUG_VM
depends on MMU_LAZY_TLB_SHOOTDOWN
help
Enable additional IPIs that ensure lazy tlb mm references are removed
before the mm is freed.
If unsure, say N.
Maple Tree: add new data structure Patch series "Introducing the Maple Tree" The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. Davidlor said : Yes I like the maple tree, and at this stage I don't think we can ask for : more from this series wrt the MM - albeit there seems to still be some : folks reporting breakage. Fundamentally I see Liam's work to (re)move : complexity out of the MM (not to say that the actual maple tree is not : complex) by consolidating the three complimentary data structures very : much worth it considering performance does not take a hit. This was very : much a turn off with the range locking approach, which worst case scenario : incurred in prohibitive overhead. Also as Liam and Matthew have : mentioned, RCU opens up a lot of nice performance opportunities, and in : addition academia[1] has shown outstanding scalability of address spaces : with the foundation of replacing the locked rbtree with RCU aware trees. A similar work has been discovered in the academic press https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Sheer coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. This patch (of 70): The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. There is additional BUG_ON() calls added within the tree, most of which are in debug code. These will be replaced with a WARN_ON() call in the future. There is also additional BUG_ON() calls within the code which will also be reduced in number at a later date. These exist to catch things such as out-of-range accesses which would crash anyways. Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-07 03:48:39 +08:00
config DEBUG_VM_MAPLE_TREE
bool "Debug VM maple trees"
depends on DEBUG_VM
Maple Tree: add new data structure Patch series "Introducing the Maple Tree" The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. Davidlor said : Yes I like the maple tree, and at this stage I don't think we can ask for : more from this series wrt the MM - albeit there seems to still be some : folks reporting breakage. Fundamentally I see Liam's work to (re)move : complexity out of the MM (not to say that the actual maple tree is not : complex) by consolidating the three complimentary data structures very : much worth it considering performance does not take a hit. This was very : much a turn off with the range locking approach, which worst case scenario : incurred in prohibitive overhead. Also as Liam and Matthew have : mentioned, RCU opens up a lot of nice performance opportunities, and in : addition academia[1] has shown outstanding scalability of address spaces : with the foundation of replacing the locked rbtree with RCU aware trees. A similar work has been discovered in the academic press https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Sheer coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. This patch (of 70): The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. There is additional BUG_ON() calls added within the tree, most of which are in debug code. These will be replaced with a WARN_ON() call in the future. There is also additional BUG_ON() calls within the code which will also be reduced in number at a later date. These exist to catch things such as out-of-range accesses which would crash anyways. Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-07 03:48:39 +08:00
select DEBUG_MAPLE_TREE
help
Maple Tree: add new data structure Patch series "Introducing the Maple Tree" The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. Davidlor said : Yes I like the maple tree, and at this stage I don't think we can ask for : more from this series wrt the MM - albeit there seems to still be some : folks reporting breakage. Fundamentally I see Liam's work to (re)move : complexity out of the MM (not to say that the actual maple tree is not : complex) by consolidating the three complimentary data structures very : much worth it considering performance does not take a hit. This was very : much a turn off with the range locking approach, which worst case scenario : incurred in prohibitive overhead. Also as Liam and Matthew have : mentioned, RCU opens up a lot of nice performance opportunities, and in : addition academia[1] has shown outstanding scalability of address spaces : with the foundation of replacing the locked rbtree with RCU aware trees. A similar work has been discovered in the academic press https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Sheer coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. This patch (of 70): The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. There is additional BUG_ON() calls added within the tree, most of which are in debug code. These will be replaced with a WARN_ON() call in the future. There is also additional BUG_ON() calls within the code which will also be reduced in number at a later date. These exist to catch things such as out-of-range accesses which would crash anyways. Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-07 03:48:39 +08:00
Enable VM maple tree debugging information and extra validations.
If unsure, say N.
config DEBUG_VM_RB
bool "Debug VM red-black trees"
depends on DEBUG_VM
help
Enable VM red-black tree debugging information and extra validations.
If unsure, say N.
page-flags: introduce page flags policies wrt compound pages This patch adds a third argument to macros which create function definitions for page flags. This argument defines how page-flags helpers behave on compound functions. For now we define four policies: - PF_ANY: the helper function operates on the page it gets, regardless if it's non-compound, head or tail. - PF_HEAD: the helper function operates on the head page of the compound page if it gets tail page. - PF_NO_TAIL: only head and non-compond pages are acceptable for this helper function. - PF_NO_COMPOUND: only non-compound pages are acceptable for this helper function. For now we use policy PF_ANY for all helpers, which matches current behaviour. We do not enforce the policy for TESTPAGEFLAG, because we have flags checked for random pages all over the kernel. Noticeable exception to this is PageTransHuge() which triggers VM_BUG_ON() for tail page. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:51:21 +08:00
config DEBUG_VM_PGFLAGS
bool "Debug page-flags operations"
depends on DEBUG_VM
help
Enables extra validation on page flags operations.
If unsure, say N.
mm/debug: add tests validating architecture page table helpers This adds tests which will validate architecture page table helpers and other accessors in their compliance with expected generic MM semantics. This will help various architectures in validating changes to existing page table helpers or addition of new ones. This test covers basic page table entry transformations including but not limited to old, young, dirty, clean, write, write protect etc at various level along with populating intermediate entries with next page table page and validating them. Test page table pages are allocated from system memory with required size and alignments. The mapped pfns at page table levels are derived from a real pfn representing a valid kernel text symbol. This test gets called via late_initcall(). This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. Any architecture, which is willing to subscribe this test will need to select ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64, x86, s390 and powerpc platforms where the test is known to build and run successfully Going forward, other architectures too can subscribe the test after fixing any build or runtime problems with their page table helpers. Folks interested in making sure that a given platform's page table helpers conform to expected generic MM semantics should enable the above config which will just trigger this test during boot. Any non conformity here will be reported as an warning which would need to be fixed. This test will help catch any changes to the agreed upon semantics expected from generic MM and enable platforms to accommodate it thereafter. [anshuman.khandual@arm.com: v17] Link: http://lkml.kernel.org/r/1587436495-22033-3-git-send-email-anshuman.khandual@arm.com [anshuman.khandual@arm.com: v18] Link: http://lkml.kernel.org/r/1588564865-31160-3-git-send-email-anshuman.khandual@arm.com Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> [ppc32] Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Link: http://lkml.kernel.org/r/1583919272-24178-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:47:15 +08:00
config DEBUG_VM_PGTABLE
bool "Debug arch page table for semantics compliance"
depends on MMU
depends on ARCH_HAS_DEBUG_VM_PGTABLE
default y if DEBUG_VM
help
This option provides a debug method which can be used to test
architecture page table helper functions on various platforms in
verifying if they comply with expected generic MM semantics. This
will help architecture code in making sure that any changes or
new additions of these helpers still conform to expected
semantics of the generic MM. Platforms will have to opt in for
this through ARCH_HAS_DEBUG_VM_PGTABLE.
If unsure, say N.
config ARCH_HAS_DEBUG_VIRTUAL
bool
config DEBUG_VIRTUAL
bool "Debug VM translations"
depends on DEBUG_KERNEL && ARCH_HAS_DEBUG_VIRTUAL
help
Enable some costly sanity checks in virtual to page code. This can
catch mistakes with virt_to_page() and friends.
If unsure, say N.
config DEBUG_NOMMU_REGIONS
bool "Debug the global anon/private NOMMU mapping region tree"
depends on DEBUG_KERNEL && !MMU
help
This option causes the global tree of anonymous and private mapping
regions to be regularly checked for invalid topology.
config DEBUG_MEMORY_INIT
bool "Debug memory initialisation" if EXPERT
default !EXPERT
help
Enable this for additional checks during memory initialisation.
The sanity checks verify aspects of the VM such as the memory model
and other information provided by the architecture. Verbose
information will be printed at KERN_DEBUG loglevel depending
on the mminit_loglevel= command-line option.
If unsure, say Y
config MEMORY_NOTIFIER_ERROR_INJECT
tristate "Memory hotplug notifier error injection module"
depends on MEMORY_HOTPLUG && NOTIFIER_ERROR_INJECTION
help
This option provides the ability to inject artificial errors to
memory hotplug notifier chain callbacks. It is controlled through
debugfs interface under /sys/kernel/debug/notifier-error-inject/memory
If the notifier call chain should be failed with some events
notified, write the error code to "actions/<notifier event>/error".
Example: Inject memory hotplug offline error (-12 == -ENOMEM)
# cd /sys/kernel/debug/notifier-error-inject/memory
# echo -12 > actions/MEM_GOING_OFFLINE/error
# echo offline > /sys/devices/system/memory/memoryXXX/state
bash: echo: write error: Cannot allocate memory
To compile this code as a module, choose M here: the module will
be called memory-notifier-error-inject.
If unsure, say N.
config DEBUG_PER_CPU_MAPS
bool "Debug access to per_cpu maps"
depends on DEBUG_KERNEL
depends on SMP
help
Say Y to verify that the per_cpu map being accessed has
been set up. This adds a fair amount of code to kernel memory
and decreases performance.
Say N if unsure.
config DEBUG_KMAP_LOCAL
bool "Debug kmap_local temporary mappings"
depends on DEBUG_KERNEL && KMAP_LOCAL
help
This option enables additional error checking for the kmap_local
infrastructure. Disable for production use.
config ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP
bool
config DEBUG_KMAP_LOCAL_FORCE_MAP
bool "Enforce kmap_local temporary mappings"
depends on DEBUG_KERNEL && ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP
select KMAP_LOCAL
select DEBUG_KMAP_LOCAL
help
This option enforces temporary mappings through the kmap_local
mechanism for non-highmem pages and on non-highmem systems.
Disable this for production systems!
config DEBUG_HIGHMEM
bool "Highmem debugging"
depends on DEBUG_KERNEL && HIGHMEM
select DEBUG_KMAP_LOCAL_FORCE_MAP if ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP
select DEBUG_KMAP_LOCAL
help
This option enables additional error checking for high memory
systems. Disable for production systems.
config HAVE_DEBUG_STACKOVERFLOW
bool
config DEBUG_STACKOVERFLOW
bool "Check for stack overflows"
depends on DEBUG_KERNEL && HAVE_DEBUG_STACKOVERFLOW
help
Say Y here if you want to check for overflows of kernel, IRQ
and exception stacks (if your architecture uses them). This
option will show detailed messages if free stack space drops
below a certain limit.
These kinds of bugs usually occur when call-chains in the
kernel get too deep, especially when interrupts are
involved.
Use this in cases where you see apparently random memory
corruption, especially if it appears in 'struct thread_info'
If in doubt, say "N".
config CODE_TAGGING
bool
select KALLSYMS
lib: add allocation tagging support for memory allocation profiling Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_tags" codetag type with /proc/allocinfo interface to output allocation tag information when the feature is enabled. CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory allocation profiling instrumentation. Memory allocation profiling can be enabled or disabled at runtime using /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation profiling by default. [surenb@google.com: Documentation/filesystems/proc.rst: fix allocinfo title] Link: https://lkml.kernel.org/r/20240326073813.727090-1-surenb@google.com [surenb@google.com: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU] Link: https://lkml.kernel.org/r/20240402180933.1663992-2-surenb@google.com [klarasmodin@gmail.com: explicitly include irqflags.h in alloc_tag.h] Link: https://lkml.kernel.org/r/20240407133252.173636-1-klarasmodin@gmail.com [surenb@google.com: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()] Link: https://lkml.kernel.org/r/20240417003349.2520094-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-14-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Klara Modin <klarasmodin@gmail.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-22 00:36:35 +08:00
config MEM_ALLOC_PROFILING
bool "Enable memory allocation profiling"
default n
depends on PROC_FS
depends on !DEBUG_FORCE_WEAK_PER_CPU
select CODE_TAGGING
select PAGE_EXTENSION
select SLAB_OBJ_EXT
lib: add allocation tagging support for memory allocation profiling Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_tags" codetag type with /proc/allocinfo interface to output allocation tag information when the feature is enabled. CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory allocation profiling instrumentation. Memory allocation profiling can be enabled or disabled at runtime using /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation profiling by default. [surenb@google.com: Documentation/filesystems/proc.rst: fix allocinfo title] Link: https://lkml.kernel.org/r/20240326073813.727090-1-surenb@google.com [surenb@google.com: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU] Link: https://lkml.kernel.org/r/20240402180933.1663992-2-surenb@google.com [klarasmodin@gmail.com: explicitly include irqflags.h in alloc_tag.h] Link: https://lkml.kernel.org/r/20240407133252.173636-1-klarasmodin@gmail.com [surenb@google.com: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()] Link: https://lkml.kernel.org/r/20240417003349.2520094-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-14-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Klara Modin <klarasmodin@gmail.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-22 00:36:35 +08:00
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
memory leaks with a low performance and memory impact.
config MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
bool "Enable memory allocation profiling by default"
default y
depends on MEM_ALLOC_PROFILING
config MEM_ALLOC_PROFILING_DEBUG
bool "Memory allocation profiler debugging"
default n
depends on MEM_ALLOC_PROFILING
select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
help
Adds warnings with helpful error messages for memory allocation
profiling.
kasan: add kernel address sanitizer infrastructure Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. KASAN uses compile-time instrumentation for checking every memory access, therefore GCC > v4.9.2 required. v4.9.2 almost works, but has issues with putting symbol aliases into the wrong section, which breaks kasan instrumentation of globals. This patch only adds infrastructure for kernel address sanitizer. It's not available for use yet. The idea and some code was borrowed from [1]. Basic idea: The main idea of KASAN is to use shadow memory to record whether each byte of memory is safe to access or not, and use compiler's instrumentation to check the shadow memory on each memory access. Address sanitizer uses 1/8 of the memory addressable in kernel for shadow memory and uses direct mapping with a scale and offset to translate a memory address to its corresponding shadow address. Here is function to translate address to corresponding shadow address: unsigned long kasan_mem_to_shadow(unsigned long addr) { return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET; } where KASAN_SHADOW_SCALE_SHIFT = 3. So for every 8 bytes there is one corresponding byte of shadow memory. The following encoding used for each shadow byte: 0 means that all 8 bytes of the corresponding memory region are valid for access; k (1 <= k <= 7) means that the first k bytes are valid for access, and other (8 - k) bytes are not; Any negative value indicates that the entire 8-bytes are inaccessible. Different negative values used to distinguish between different kinds of inaccessible memory (redzones, freed memory) (see mm/kasan/kasan.h). To be able to detect accesses to bad memory we need a special compiler. Such compiler inserts a specific function calls (__asan_load*(addr), __asan_store*(addr)) before each memory access of size 1, 2, 4, 8 or 16. These functions check whether memory region is valid to access or not by checking corresponding shadow memory. If access is not valid an error printed. Historical background of the address sanitizer from Dmitry Vyukov: "We've developed the set of tools, AddressSanitizer (Asan), ThreadSanitizer and MemorySanitizer, for user space. We actively use them for testing inside of Google (continuous testing, fuzzing, running prod services). To date the tools have found more than 10'000 scary bugs in Chromium, Google internal codebase and various open-source projects (Firefox, OpenSSL, gcc, clang, ffmpeg, MySQL and lots of others): [2] [3] [4]. The tools are part of both gcc and clang compilers. We have not yet done massive testing under the Kernel AddressSanitizer (it's kind of chicken and egg problem, you need it to be upstream to start applying it extensively). To date it has found about 50 bugs. Bugs that we've found in upstream kernel are listed in [5]. We've also found ~20 bugs in out internal version of the kernel. Also people from Samsung and Oracle have found some. [...] As others noted, the main feature of AddressSanitizer is its performance due to inline compiler instrumentation and simple linear shadow memory. User-space Asan has ~2x slowdown on computational programs and ~2x memory consumption increase. Taking into account that kernel usually consumes only small fraction of CPU and memory when running real user-space programs, I would expect that kernel Asan will have ~10-30% slowdown and similar memory consumption increase (when we finish all tuning). I agree that Asan can well replace kmemcheck. We have plans to start working on Kernel MemorySanitizer that finds uses of unitialized memory. Asan+Msan will provide feature-parity with kmemcheck. As others noted, Asan will unlikely replace debug slab and pagealloc that can be enabled at runtime. Asan uses compiler instrumentation, so even if it is disabled, it still incurs visible overheads. Asan technology is easily portable to other architectures. Compiler instrumentation is fully portable. Runtime has some arch-dependent parts like shadow mapping and atomic operation interception. They are relatively easy to port." Comparison with other debugging features: ======================================== KMEMCHECK: - KASan can do almost everything that kmemcheck can. KASan uses compile-time instrumentation, which makes it significantly faster than kmemcheck. The only advantage of kmemcheck over KASan is detection of uninitialized memory reads. Some brief performance testing showed that kasan could be x500-x600 times faster than kmemcheck: $ netperf -l 30 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec no debug: 87380 16384 16384 30.00 41624.72 kasan inline: 87380 16384 16384 30.00 12870.54 kasan outline: 87380 16384 16384 30.00 10586.39 kmemcheck: 87380 16384 16384 30.03 20.23 - Also kmemcheck couldn't work on several CPUs. It always sets number of CPUs to 1. KASan doesn't have such limitation. DEBUG_PAGEALLOC: - KASan is slower than DEBUG_PAGEALLOC, but KASan works on sub-page granularity level, so it able to find more bugs. SLUB_DEBUG (poisoning, redzones): - SLUB_DEBUG has lower overhead than KASan. - SLUB_DEBUG in most cases are not able to detect bad reads, KASan able to detect both reads and writes. - In some cases (e.g. redzone overwritten) SLUB_DEBUG detect bugs only on allocation/freeing of object. KASan catch bugs right before it will happen, so we always know exact place of first bad read/write. [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel [2] https://code.google.com/p/address-sanitizer/wiki/FoundBugs [3] https://code.google.com/p/thread-sanitizer/wiki/FoundBugs [4] https://code.google.com/p/memory-sanitizer/wiki/FoundBugs [5] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel#Trophies Based on work by Andrey Konovalov. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-14 06:39:17 +08:00
source "lib/Kconfig.kasan"
mm: add Kernel Electric-Fence infrastructure Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7. This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors. This series enables KFENCE for the x86 and arm64 architectures, and adds KFENCE hooks to the SLAB and SLUB allocators. KFENCE is designed to be enabled in production kernels, and has near zero performance overhead. Compared to KASAN, KFENCE trades performance for precision. The main motivation behind KFENCE's design, is that with enough total uptime KFENCE will detect bugs in code paths not typically exercised by non-production test workloads. One way to quickly achieve a large enough total uptime is when the tool is deployed across a large fleet of machines. KFENCE objects each reside on a dedicated page, at either the left or right page boundaries. The pages to the left and right of the object page are "guard pages", whose attributes are changed to a protected state, and cause page faults on any attempted access to them. Such page faults are then intercepted by KFENCE, which handles the fault gracefully by reporting a memory access error. Guarded allocations are set up based on a sample interval (can be set via kfence.sample_interval). After expiration of the sample interval, the next allocation through the main allocator (SLAB or SLUB) returns a guarded allocation from the KFENCE object pool. At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To enable/disable a KFENCE allocation through the main allocator's fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE. The KFENCE memory pool is of fixed size, and if the pool is exhausted no further KFENCE allocations occur. The default config is conservative with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB pages). We have verified by running synthetic benchmarks (sysbench I/O, hackbench) and production server-workload benchmarks that a kernel with KFENCE (using sample intervals 100-500ms) is performance-neutral compared to a non-KFENCE baseline kernel. KFENCE is inspired by GWP-ASan [1], a userspace tool with similar properties. The name "KFENCE" is a homage to the Electric Fence Malloc Debugger [2]. For more details, see Documentation/dev-tools/kfence.rst added in the series -- also viewable here: https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst [1] http://llvm.org/docs/GwpAsan.html [2] https://linux.die.net/man/3/efence This patch (of 9): This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors. KFENCE is designed to be enabled in production kernels, and has near zero performance overhead. Compared to KASAN, KFENCE trades performance for precision. The main motivation behind KFENCE's design, is that with enough total uptime KFENCE will detect bugs in code paths not typically exercised by non-production test workloads. One way to quickly achieve a large enough total uptime is when the tool is deployed across a large fleet of machines. KFENCE objects each reside on a dedicated page, at either the left or right page boundaries. The pages to the left and right of the object page are "guard pages", whose attributes are changed to a protected state, and cause page faults on any attempted access to them. Such page faults are then intercepted by KFENCE, which handles the fault gracefully by reporting a memory access error. To detect out-of-bounds writes to memory within the object's page itself, KFENCE also uses pattern-based redzones. The following figure illustrates the page layout: ---+-----------+-----------+-----------+-----------+-----------+--- | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | ---+-----------+-----------+-----------+-----------+-----------+--- Guarded allocations are set up based on a sample interval (can be set via kfence.sample_interval). After expiration of the sample interval, a guarded allocation from the KFENCE object pool is returned to the main allocator (SLAB or SLUB). At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To enable/disable a KFENCE allocation through the main allocator's fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE. To date, we have verified by running synthetic benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE is performance-neutral compared to the non-KFENCE baseline. For more details, see Documentation/dev-tools/kfence.rst (added later in the series). [elver@google.com: fix parameter description for kfence_object_start()] Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com [elver@google.com: avoid stalling work queue task without allocations] Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com [elver@google.com: fix potential deadlock due to wake_up()] Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com [elver@google.com: add option to use KFENCE without static keys] Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com [elver@google.com: add missing copyright and description headers] Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: SeongJae Park <sjpark@amazon.de> Co-developed-by: Marco Elver <elver@google.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Joern Engel <joern@purestorage.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:18:53 +08:00
source "lib/Kconfig.kfence"
kmsan: add KMSAN runtime core For each memory location KernelMemorySanitizer maintains two types of metadata: 1. The so-called shadow of that location - а byte:byte mapping describing whether or not individual bits of memory are initialized (shadow is 0) or not (shadow is 1). 2. The origins of that location - а 4-byte:4-byte mapping containing 4-byte IDs of the stack traces where uninitialized values were created. Each struct page now contains pointers to two struct pages holding KMSAN metadata (shadow and origins) for the original struct page. Utility routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata creation, addressing, copying and checking. mm/kmsan/report.c performs error reporting in the cases an uninitialized value is used in a way that leads to undefined behavior. KMSAN compiler instrumentation is responsible for tracking the metadata along with the kernel memory. mm/kmsan/instrumentation.c provides the implementation for instrumentation hooks that are called from files compiled with -fsanitize=kernel-memory. To aid parameter passing (also done at instrumentation level), each task_struct now contains a struct kmsan_task_state used to track the metadata of function parameters and return values for that task. Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The KMSAN_SANITIZE:=n Makefile directive can be used to completely disable KMSAN instrumentation for certain files. Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly created stack memory initialized. Users can also use functions from include/linux/kmsan-checks.h to mark certain memory regions as uninitialized or initialized (this is called "poisoning" and "unpoisoning") or check that a particular region is initialized. Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-15 23:03:45 +08:00
source "lib/Kconfig.kmsan"
kasan: add kernel address sanitizer infrastructure Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. KASAN uses compile-time instrumentation for checking every memory access, therefore GCC > v4.9.2 required. v4.9.2 almost works, but has issues with putting symbol aliases into the wrong section, which breaks kasan instrumentation of globals. This patch only adds infrastructure for kernel address sanitizer. It's not available for use yet. The idea and some code was borrowed from [1]. Basic idea: The main idea of KASAN is to use shadow memory to record whether each byte of memory is safe to access or not, and use compiler's instrumentation to check the shadow memory on each memory access. Address sanitizer uses 1/8 of the memory addressable in kernel for shadow memory and uses direct mapping with a scale and offset to translate a memory address to its corresponding shadow address. Here is function to translate address to corresponding shadow address: unsigned long kasan_mem_to_shadow(unsigned long addr) { return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET; } where KASAN_SHADOW_SCALE_SHIFT = 3. So for every 8 bytes there is one corresponding byte of shadow memory. The following encoding used for each shadow byte: 0 means that all 8 bytes of the corresponding memory region are valid for access; k (1 <= k <= 7) means that the first k bytes are valid for access, and other (8 - k) bytes are not; Any negative value indicates that the entire 8-bytes are inaccessible. Different negative values used to distinguish between different kinds of inaccessible memory (redzones, freed memory) (see mm/kasan/kasan.h). To be able to detect accesses to bad memory we need a special compiler. Such compiler inserts a specific function calls (__asan_load*(addr), __asan_store*(addr)) before each memory access of size 1, 2, 4, 8 or 16. These functions check whether memory region is valid to access or not by checking corresponding shadow memory. If access is not valid an error printed. Historical background of the address sanitizer from Dmitry Vyukov: "We've developed the set of tools, AddressSanitizer (Asan), ThreadSanitizer and MemorySanitizer, for user space. We actively use them for testing inside of Google (continuous testing, fuzzing, running prod services). To date the tools have found more than 10'000 scary bugs in Chromium, Google internal codebase and various open-source projects (Firefox, OpenSSL, gcc, clang, ffmpeg, MySQL and lots of others): [2] [3] [4]. The tools are part of both gcc and clang compilers. We have not yet done massive testing under the Kernel AddressSanitizer (it's kind of chicken and egg problem, you need it to be upstream to start applying it extensively). To date it has found about 50 bugs. Bugs that we've found in upstream kernel are listed in [5]. We've also found ~20 bugs in out internal version of the kernel. Also people from Samsung and Oracle have found some. [...] As others noted, the main feature of AddressSanitizer is its performance due to inline compiler instrumentation and simple linear shadow memory. User-space Asan has ~2x slowdown on computational programs and ~2x memory consumption increase. Taking into account that kernel usually consumes only small fraction of CPU and memory when running real user-space programs, I would expect that kernel Asan will have ~10-30% slowdown and similar memory consumption increase (when we finish all tuning). I agree that Asan can well replace kmemcheck. We have plans to start working on Kernel MemorySanitizer that finds uses of unitialized memory. Asan+Msan will provide feature-parity with kmemcheck. As others noted, Asan will unlikely replace debug slab and pagealloc that can be enabled at runtime. Asan uses compiler instrumentation, so even if it is disabled, it still incurs visible overheads. Asan technology is easily portable to other architectures. Compiler instrumentation is fully portable. Runtime has some arch-dependent parts like shadow mapping and atomic operation interception. They are relatively easy to port." Comparison with other debugging features: ======================================== KMEMCHECK: - KASan can do almost everything that kmemcheck can. KASan uses compile-time instrumentation, which makes it significantly faster than kmemcheck. The only advantage of kmemcheck over KASan is detection of uninitialized memory reads. Some brief performance testing showed that kasan could be x500-x600 times faster than kmemcheck: $ netperf -l 30 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec no debug: 87380 16384 16384 30.00 41624.72 kasan inline: 87380 16384 16384 30.00 12870.54 kasan outline: 87380 16384 16384 30.00 10586.39 kmemcheck: 87380 16384 16384 30.03 20.23 - Also kmemcheck couldn't work on several CPUs. It always sets number of CPUs to 1. KASan doesn't have such limitation. DEBUG_PAGEALLOC: - KASan is slower than DEBUG_PAGEALLOC, but KASan works on sub-page granularity level, so it able to find more bugs. SLUB_DEBUG (poisoning, redzones): - SLUB_DEBUG has lower overhead than KASan. - SLUB_DEBUG in most cases are not able to detect bad reads, KASan able to detect both reads and writes. - In some cases (e.g. redzone overwritten) SLUB_DEBUG detect bugs only on allocation/freeing of object. KASan catch bugs right before it will happen, so we always know exact place of first bad read/write. [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel [2] https://code.google.com/p/address-sanitizer/wiki/FoundBugs [3] https://code.google.com/p/thread-sanitizer/wiki/FoundBugs [4] https://code.google.com/p/memory-sanitizer/wiki/FoundBugs [5] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel#Trophies Based on work by Andrey Konovalov. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-14 06:39:17 +08:00
endmenu # "Memory Debugging"
config DEBUG_SHIRQ
bool "Debug shared IRQ handlers"
depends on DEBUG_KERNEL
help
Enable this to generate a spurious interrupt just before a shared
interrupt handler is deregistered (generating one when registering
is currently disabled). Drivers need to handle this correctly. Some
don't and need to be caught.
menu "Debug Oops, Lockups and Hangs"
config PANIC_ON_OOPS
bool "Panic on Oops"
help
Say Y here to enable the kernel to panic when it oopses. This
has the same effect as setting oops=panic on the kernel command
line.
This feature is useful to ensure that the kernel does not do
anything erroneous after an oops which could result in data
corruption or other issues.
Say N if unsure.
config PANIC_ON_OOPS_VALUE
int
range 0 1
default 0 if !PANIC_ON_OOPS
default 1 if PANIC_ON_OOPS
config PANIC_TIMEOUT
int "panic timeout"
default 0
help
Set the timeout value (in seconds) until a reboot occurs when
the kernel panics. If n = 0, then we wait forever. A timeout
value n > 0 will wait n seconds before rebooting, while a timeout
value n < 0 will reboot immediately. This setting can be overridden
with the kernel command line option panic=, and from userspace via
/proc/sys/kernel/panic.
lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-08 05:11:44 +08:00
config LOCKUP_DETECTOR
bool
config SOFTLOCKUP_DETECTOR
bool "Detect Soft Lockups"
depends on DEBUG_KERNEL && !S390
select LOCKUP_DETECTOR
help
lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-08 05:11:44 +08:00
Say Y here to enable the kernel to act as a watchdog to detect
soft lockups.
lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-08 05:11:44 +08:00
Softlockups are bugs that cause the kernel to loop in kernel
mode for more than 20 seconds, without giving other tasks a
lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-08 05:11:44 +08:00
chance to run. The current stack trace is displayed upon
detection and the system will stay locked up.
watchdog/softlockup: Low-overhead detection of interrupt storm The following softlockup is caused by interrupt storm, but it cannot be identified from the call tree. Because the call tree is just a snapshot and doesn't fully capture the behavior of the CPU during the soft lockup. watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] ... Call trace: __do_softirq+0xa0/0x37c __irq_exit_rcu+0x108/0x140 irq_exit+0x14/0x20 __handle_domain_irq+0x84/0xe0 gic_handle_irq+0x80/0x108 el0_irq_naked+0x50/0x58 Therefore, it is necessary to report CPU utilization during the softlockup_threshold period (report once every sample_period, for a total of 5 reportings), like this: watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] CPU#28 Utilization every 4s during lockup: #1: 0% system, 0% softirq, 100% hardirq, 0% idle #2: 0% system, 0% softirq, 100% hardirq, 0% idle #3: 0% system, 0% softirq, 100% hardirq, 0% idle #4: 0% system, 0% softirq, 100% hardirq, 0% idle #5: 0% system, 0% softirq, 100% hardirq, 0% idle ... This is helpful in determining whether an interrupt storm has occurred or in identifying the cause of the softlockup. The criteria for determination are as follows: a. If the hardirq utilization is high, then interrupt storm should be considered and the root cause cannot be determined from the call tree. b. If the softirq utilization is high, then the call might not necessarily point at the root cause. c. If the system utilization is high, then analyzing the root cause from the call tree is possible in most cases. The mechanism requires a considerable amount of global storage space when configured for the maximum number of CPUs. Therefore, adding a SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes" if the max number of CPUs is <= 128. Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240411074134.30922-5-yaoma@linux.alibaba.com
2024-04-11 15:41:33 +08:00
config SOFTLOCKUP_DETECTOR_INTR_STORM
bool "Detect Interrupt Storm in Soft Lockups"
depends on SOFTLOCKUP_DETECTOR && IRQ_TIME_ACCOUNTING
select GENERIC_IRQ_STAT_SNAPSHOT
default y if NR_CPUS <= 128
help
Say Y here to enable the kernel to detect interrupt storm
during "soft lockups".
"soft lockups" can be caused by a variety of reasons. If one is
caused by an interrupt storm, then the storming interrupts will not
be on the callstack. To detect this case, it is necessary to report
the CPU stats and the interrupt counts during the "soft lockups".
config BOOTPARAM_SOFTLOCKUP_PANIC
bool "Panic (Reboot) On Soft Lockups"
depends on SOFTLOCKUP_DETECTOR
help
Say Y here to enable the kernel to panic on "soft lockups",
which are bugs that cause the kernel to loop in kernel
mode for more than 20 seconds (configurable using the watchdog_thresh
sysctl), without giving other tasks a chance to run.
The panic can be used in combination with panic_timeout,
to cause the system to reboot automatically after a
lockup has been detected. This feature is useful for
high-availability systems that have uptime guarantees and
where a lockup must be resolved ASAP.
Say N if unsure.
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
config HAVE_HARDLOCKUP_DETECTOR_BUDDY
bool
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
depends on SMP
default y
kernel/watchdog: Prevent false positives with turbo modes The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-15 15:50:13 +08:00
#
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
# Global switch whether to build a hardlockup detector at all. It is available
# only when the architecture supports at least one implementation. There are
# two exceptions. The hardlockup detector is never enabled on:
kernel/watchdog: Prevent false positives with turbo modes The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-15 15:50:13 +08:00
#
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
# s390: it reported many false positives there
#
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
# sparc64: has a custom implementation which is not using the common
# hardlockup command line options and sysctl interface.
#
config HARDLOCKUP_DETECTOR
bool "Detect Hard Lockups"
depends on DEBUG_KERNEL && !S390 && !HARDLOCKUP_DETECTOR_SPARC64
watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific There are several hardlockup detector implementations and several Kconfig values which allow selection and build of the preferred one. CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c1f53acb ("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36. It was a preparation step for introducing the new generic perf hardlockup detector. The existing arch-specific variants did not support the to-be-created generic build configurations, sysctl interface, etc. This distinction was made explicit by the commit 4a7863cc2eb5f98 ("x86, nmi_watchdog: Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR") in v2.6.38. CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c695f967e105 ("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used by three architectures, namely blackfin, mn10300, and sparc. The support for blackfin and mn10300 architectures has been completely dropped some time ago. And sparc is the only architecture with the historic NMI watchdog at the moment. And the old sparc implementation is really special. It is always built on sparc64. It used to be always enabled until the commit 7a5c8b57cec93196b ("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added in v4.10-rc1. There are only few locations where the sparc64 NMI watchdog interacts with the generic hardlockup detectors code: + implements arch_touch_nmi_watchdog() which is called from the generic touch_nmi_watchdog() + implements watchdog_hardlockup_enable()/disable() to support /proc/sys/kernel/nmi_watchdog + is always preferred over other generic watchdogs, see CONFIG_HARDLOCKUP_DETECTOR + includes asm/nmi.h into linux/nmi.h because some sparc-specific functions are needed in sparc-specific code which includes only linux/nmi.h. The situation became more complicated after the commit 05a4a95279311c3 ("kernel/watchdog: split up config options") and commit 2104180a53698df5 ("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1. They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc specific hardlockup detector. It was compatible with the perf one regarding the general boot, sysctl, and programming interfaces. HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of HAVE_NMI_WATCHDOG. It made some sense because all arch-specific detectors had some common requirements, namely: + implemented arch_touch_nmi_watchdog() + included asm/nmi.h into linux/nmi.h + defined the default value for /proc/sys/kernel/nmi_watchdog But it actually has made things pretty complicated when the generic buddy hardlockup detector was added. Before the generic perf detector was newer supported together with an arch-specific one. But the buddy detector could work on any SMP system. It means that an architecture could support both the arch-specific and buddy detector. As a result, there are few tricky dependencies. For example, CONFIG_HARDLOCKUP_DETECTOR depends on: ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH The problem is that the very special sparc implementation is defined as: HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear without reading understanding the history. Make the logic less tricky and more self-explanatory by making HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to HAVE_HARDLOCKUP_DETECTOR_SPARC64. Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF, and HARDLOCKUP_DETECTOR_BUDDY may conflict only with HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR and it is not longer enabled when HAVE_NMI_WATCHDOG is set. Link: https://lkml.kernel.org/r/20230616150618.6073-5-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:16 +08:00
depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
imply HARDLOCKUP_DETECTOR_PERF
imply HARDLOCKUP_DETECTOR_BUDDY
imply HARDLOCKUP_DETECTOR_ARCH
select LOCKUP_DETECTOR
watchdog/hardlockup: sort hardlockup detector related config values a logical way Patch series "watchdog/hardlockup: Cleanup configuration of hardlockup detectors", v2. Clean up watchdog Kconfig after introducing the buddy detector. This patch (of 6): There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. Only one hardlockup detector can be compiled in. The selection is done using quite complex dependencies between several CONFIG variables. The following patches will try to make it more straightforward. As a first step, reorder the definitions of the various CONFIG variables. The logical order is: 1. HAVE_* variables define available variants. They are typically defined in the arch/ config files. 2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup detector is enabled at all. 3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether the buddy detector should be preferred over the perf one. Note that the arch specific variants are always preferred when available. 4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given detector is enabled in the end. 5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH are temporary variables that are going to be removed in a followup patch. This is a preparation step for further cleanup. It will change the logic without shuffling the definitions. This change temporary breaks the C-like ordering where the variables are declared or defined before they are used. It is not really needed for Kconfig. Also the following patches will rework the logic so that the ordering will be C-like in the end. The patch just shuffles the definitions. It should not change the existing behavior. Link: https://lkml.kernel.org/r/20230616150618.6073-1-pmladek@suse.com Link: https://lkml.kernel.org/r/20230616150618.6073-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:13 +08:00
help
Say Y here to enable the kernel to act as a watchdog to detect
hard lockups.
lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-08 05:11:44 +08:00
Hardlockups are bugs that cause the CPU to loop in kernel mode
for more than 10 seconds, without letting other interrupts have a
lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-08 05:11:44 +08:00
chance to run. The current stack trace is displayed upon detection
and the system will stay locked up.
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
#
# Note that arch-specific variants are always preferred.
#
watchdog/hardlockup: sort hardlockup detector related config values a logical way Patch series "watchdog/hardlockup: Cleanup configuration of hardlockup detectors", v2. Clean up watchdog Kconfig after introducing the buddy detector. This patch (of 6): There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. Only one hardlockup detector can be compiled in. The selection is done using quite complex dependencies between several CONFIG variables. The following patches will try to make it more straightforward. As a first step, reorder the definitions of the various CONFIG variables. The logical order is: 1. HAVE_* variables define available variants. They are typically defined in the arch/ config files. 2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup detector is enabled at all. 3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether the buddy detector should be preferred over the perf one. Note that the arch specific variants are always preferred when available. 4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given detector is enabled in the end. 5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH are temporary variables that are going to be removed in a followup patch. This is a preparation step for further cleanup. It will change the logic without shuffling the definitions. This change temporary breaks the C-like ordering where the variables are declared or defined before they are used. It is not really needed for Kconfig. Also the following patches will rework the logic so that the ordering will be C-like in the end. The patch just shuffles the definitions. It should not change the existing behavior. Link: https://lkml.kernel.org/r/20230616150618.6073-1-pmladek@suse.com Link: https://lkml.kernel.org/r/20230616150618.6073-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:13 +08:00
config HARDLOCKUP_DETECTOR_PREFER_BUDDY
bool "Prefer the buddy CPU hardlockup detector"
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
depends on HARDLOCKUP_DETECTOR
depends on HAVE_HARDLOCKUP_DETECTOR_PERF && HAVE_HARDLOCKUP_DETECTOR_BUDDY
depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
watchdog/hardlockup: sort hardlockup detector related config values a logical way Patch series "watchdog/hardlockup: Cleanup configuration of hardlockup detectors", v2. Clean up watchdog Kconfig after introducing the buddy detector. This patch (of 6): There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. Only one hardlockup detector can be compiled in. The selection is done using quite complex dependencies between several CONFIG variables. The following patches will try to make it more straightforward. As a first step, reorder the definitions of the various CONFIG variables. The logical order is: 1. HAVE_* variables define available variants. They are typically defined in the arch/ config files. 2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup detector is enabled at all. 3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether the buddy detector should be preferred over the perf one. Note that the arch specific variants are always preferred when available. 4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given detector is enabled in the end. 5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH are temporary variables that are going to be removed in a followup patch. This is a preparation step for further cleanup. It will change the logic without shuffling the definitions. This change temporary breaks the C-like ordering where the variables are declared or defined before they are used. It is not really needed for Kconfig. Also the following patches will rework the logic so that the ordering will be C-like in the end. The patch just shuffles the definitions. It should not change the existing behavior. Link: https://lkml.kernel.org/r/20230616150618.6073-1-pmladek@suse.com Link: https://lkml.kernel.org/r/20230616150618.6073-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:13 +08:00
help
Say Y here to prefer the buddy hardlockup detector over the perf one.
With the buddy detector, each CPU uses its softlockup hrtimer
to check that the next CPU is processing hrtimer interrupts by
verifying that a counter is increasing.
This hardlockup detector is useful on systems that don't have
an arch-specific hardlockup detector or if resources needed
for the hardlockup detector are better used for other things.
watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs Implement a hardlockup detector that doesn't doesn't need any extra arch-specific support code to detect lockups. Instead of using something arch-specific we will use the buddy system, where each CPU watches out for another one. Specifically, each CPU will use its softlockup hrtimer to check that the next CPU is processing hrtimer interrupts by verifying that a counter is increasing. NOTE: unlike the other hard lockup detectors, the buddy one can't easily show what's happening on the CPU that locked up just by doing a simple backtrace. It relies on some other mechanism in the system to get information about the locked up CPUs. This could be support for NMI backtraces like [1], it could be a mechanism for printing the PC of locked CPUs at panic time like [2] / [3], or it could be something else. Even though that means we still rely on arch-specific code, this arch-specific code seems to often be implemented even on architectures that don't have a hardlockup detector. This style of hardlockup detector originated in some downstream Android trees and has been rebased on / carried in ChromeOS trees for quite a long time for use on arm and arm64 boards. Historically on these boards we've leveraged mechanism [2] / [3] to get information about hung CPUs, but we could move to [1]. Although the original motivation for the buddy system was for use on systems without an arch-specific hardlockup detector, it can still be useful to use even on systems that _do_ have an arch-specific hardlockup detector. On x86, for instance, there is a 24-part patch series [4] in progress switching the arch-specific hard lockup detector from a scarce perf counter to a less-scarce hardware resource. Potentially the buddy system could be a simpler alternative to free up the perf counter but still get hard lockup detection. Overall, pros (+) and cons (-) of the buddy system compared to an arch-specific hardlockup detector (which might be implemented using perf): + The buddy system is usable on systems that don't have an arch-specific hardlockup detector, like arm32 and arm64 (though it's being worked on for arm64 [5]). + The buddy system may free up scarce hardware resources. + If a CPU totally goes out to lunch (can't process NMIs) the buddy system could still detect the problem (though it would be unlikely to be able to get a stack trace). + The buddy system uses the same timer function to pet the hardlockup detector on the running CPU as it uses to detect hardlockups on other CPUs. Compared to other hardlockup detectors, this means it generates fewer interrupts and thus is likely better able to let CPUs stay idle longer. - If all CPUs are hard locked up at the same time the buddy system can't detect it. - If we don't have SMP we can't use the buddy system. - The buddy system needs an arch-specific mechanism (possibly NMI backtrace) to get info about the locked up CPU. [1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org [2] https://issuetracker.google.com/172213129 [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html [4] https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calderon@linux.intel.com/ [5] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/ Link: https://lkml.kernel.org/r/20230519101840.v5.14.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Tzung-Bi Shih <tzungbi@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ian Rogers <irogers@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Sumit Garg <sumit.garg@linaro.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-20 01:18:38 +08:00
config HARDLOCKUP_DETECTOR_PERF
bool
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
depends on HARDLOCKUP_DETECTOR
depends on HAVE_HARDLOCKUP_DETECTOR_PERF && !HARDLOCKUP_DETECTOR_PREFER_BUDDY
watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific There are several hardlockup detector implementations and several Kconfig values which allow selection and build of the preferred one. CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c1f53acb ("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36. It was a preparation step for introducing the new generic perf hardlockup detector. The existing arch-specific variants did not support the to-be-created generic build configurations, sysctl interface, etc. This distinction was made explicit by the commit 4a7863cc2eb5f98 ("x86, nmi_watchdog: Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR") in v2.6.38. CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c695f967e105 ("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used by three architectures, namely blackfin, mn10300, and sparc. The support for blackfin and mn10300 architectures has been completely dropped some time ago. And sparc is the only architecture with the historic NMI watchdog at the moment. And the old sparc implementation is really special. It is always built on sparc64. It used to be always enabled until the commit 7a5c8b57cec93196b ("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added in v4.10-rc1. There are only few locations where the sparc64 NMI watchdog interacts with the generic hardlockup detectors code: + implements arch_touch_nmi_watchdog() which is called from the generic touch_nmi_watchdog() + implements watchdog_hardlockup_enable()/disable() to support /proc/sys/kernel/nmi_watchdog + is always preferred over other generic watchdogs, see CONFIG_HARDLOCKUP_DETECTOR + includes asm/nmi.h into linux/nmi.h because some sparc-specific functions are needed in sparc-specific code which includes only linux/nmi.h. The situation became more complicated after the commit 05a4a95279311c3 ("kernel/watchdog: split up config options") and commit 2104180a53698df5 ("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1. They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc specific hardlockup detector. It was compatible with the perf one regarding the general boot, sysctl, and programming interfaces. HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of HAVE_NMI_WATCHDOG. It made some sense because all arch-specific detectors had some common requirements, namely: + implemented arch_touch_nmi_watchdog() + included asm/nmi.h into linux/nmi.h + defined the default value for /proc/sys/kernel/nmi_watchdog But it actually has made things pretty complicated when the generic buddy hardlockup detector was added. Before the generic perf detector was newer supported together with an arch-specific one. But the buddy detector could work on any SMP system. It means that an architecture could support both the arch-specific and buddy detector. As a result, there are few tricky dependencies. For example, CONFIG_HARDLOCKUP_DETECTOR depends on: ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH The problem is that the very special sparc implementation is defined as: HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear without reading understanding the history. Make the logic less tricky and more self-explanatory by making HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to HAVE_HARDLOCKUP_DETECTOR_SPARC64. Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF, and HARDLOCKUP_DETECTOR_BUDDY may conflict only with HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR and it is not longer enabled when HAVE_NMI_WATCHDOG is set. Link: https://lkml.kernel.org/r/20230616150618.6073-5-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:16 +08:00
depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs Implement a hardlockup detector that doesn't doesn't need any extra arch-specific support code to detect lockups. Instead of using something arch-specific we will use the buddy system, where each CPU watches out for another one. Specifically, each CPU will use its softlockup hrtimer to check that the next CPU is processing hrtimer interrupts by verifying that a counter is increasing. NOTE: unlike the other hard lockup detectors, the buddy one can't easily show what's happening on the CPU that locked up just by doing a simple backtrace. It relies on some other mechanism in the system to get information about the locked up CPUs. This could be support for NMI backtraces like [1], it could be a mechanism for printing the PC of locked CPUs at panic time like [2] / [3], or it could be something else. Even though that means we still rely on arch-specific code, this arch-specific code seems to often be implemented even on architectures that don't have a hardlockup detector. This style of hardlockup detector originated in some downstream Android trees and has been rebased on / carried in ChromeOS trees for quite a long time for use on arm and arm64 boards. Historically on these boards we've leveraged mechanism [2] / [3] to get information about hung CPUs, but we could move to [1]. Although the original motivation for the buddy system was for use on systems without an arch-specific hardlockup detector, it can still be useful to use even on systems that _do_ have an arch-specific hardlockup detector. On x86, for instance, there is a 24-part patch series [4] in progress switching the arch-specific hard lockup detector from a scarce perf counter to a less-scarce hardware resource. Potentially the buddy system could be a simpler alternative to free up the perf counter but still get hard lockup detection. Overall, pros (+) and cons (-) of the buddy system compared to an arch-specific hardlockup detector (which might be implemented using perf): + The buddy system is usable on systems that don't have an arch-specific hardlockup detector, like arm32 and arm64 (though it's being worked on for arm64 [5]). + The buddy system may free up scarce hardware resources. + If a CPU totally goes out to lunch (can't process NMIs) the buddy system could still detect the problem (though it would be unlikely to be able to get a stack trace). + The buddy system uses the same timer function to pet the hardlockup detector on the running CPU as it uses to detect hardlockups on other CPUs. Compared to other hardlockup detectors, this means it generates fewer interrupts and thus is likely better able to let CPUs stay idle longer. - If all CPUs are hard locked up at the same time the buddy system can't detect it. - If we don't have SMP we can't use the buddy system. - The buddy system needs an arch-specific mechanism (possibly NMI backtrace) to get info about the locked up CPU. [1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org [2] https://issuetracker.google.com/172213129 [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html [4] https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calderon@linux.intel.com/ [5] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/ Link: https://lkml.kernel.org/r/20230519101840.v5.14.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Tzung-Bi Shih <tzungbi@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ian Rogers <irogers@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Sumit Garg <sumit.garg@linaro.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-20 01:18:38 +08:00
select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
config HARDLOCKUP_DETECTOR_BUDDY
bool
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
depends on HARDLOCKUP_DETECTOR
depends on HAVE_HARDLOCKUP_DETECTOR_BUDDY
depends on !HAVE_HARDLOCKUP_DETECTOR_PERF || HARDLOCKUP_DETECTOR_PREFER_BUDDY
watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific There are several hardlockup detector implementations and several Kconfig values which allow selection and build of the preferred one. CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c1f53acb ("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36. It was a preparation step for introducing the new generic perf hardlockup detector. The existing arch-specific variants did not support the to-be-created generic build configurations, sysctl interface, etc. This distinction was made explicit by the commit 4a7863cc2eb5f98 ("x86, nmi_watchdog: Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR") in v2.6.38. CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c695f967e105 ("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used by three architectures, namely blackfin, mn10300, and sparc. The support for blackfin and mn10300 architectures has been completely dropped some time ago. And sparc is the only architecture with the historic NMI watchdog at the moment. And the old sparc implementation is really special. It is always built on sparc64. It used to be always enabled until the commit 7a5c8b57cec93196b ("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added in v4.10-rc1. There are only few locations where the sparc64 NMI watchdog interacts with the generic hardlockup detectors code: + implements arch_touch_nmi_watchdog() which is called from the generic touch_nmi_watchdog() + implements watchdog_hardlockup_enable()/disable() to support /proc/sys/kernel/nmi_watchdog + is always preferred over other generic watchdogs, see CONFIG_HARDLOCKUP_DETECTOR + includes asm/nmi.h into linux/nmi.h because some sparc-specific functions are needed in sparc-specific code which includes only linux/nmi.h. The situation became more complicated after the commit 05a4a95279311c3 ("kernel/watchdog: split up config options") and commit 2104180a53698df5 ("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1. They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc specific hardlockup detector. It was compatible with the perf one regarding the general boot, sysctl, and programming interfaces. HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of HAVE_NMI_WATCHDOG. It made some sense because all arch-specific detectors had some common requirements, namely: + implemented arch_touch_nmi_watchdog() + included asm/nmi.h into linux/nmi.h + defined the default value for /proc/sys/kernel/nmi_watchdog But it actually has made things pretty complicated when the generic buddy hardlockup detector was added. Before the generic perf detector was newer supported together with an arch-specific one. But the buddy detector could work on any SMP system. It means that an architecture could support both the arch-specific and buddy detector. As a result, there are few tricky dependencies. For example, CONFIG_HARDLOCKUP_DETECTOR depends on: ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH The problem is that the very special sparc implementation is defined as: HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear without reading understanding the history. Make the logic less tricky and more self-explanatory by making HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to HAVE_HARDLOCKUP_DETECTOR_SPARC64. Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF, and HARDLOCKUP_DETECTOR_BUDDY may conflict only with HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR and it is not longer enabled when HAVE_NMI_WATCHDOG is set. Link: https://lkml.kernel.org/r/20230616150618.6073-5-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:16 +08:00
depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs Implement a hardlockup detector that doesn't doesn't need any extra arch-specific support code to detect lockups. Instead of using something arch-specific we will use the buddy system, where each CPU watches out for another one. Specifically, each CPU will use its softlockup hrtimer to check that the next CPU is processing hrtimer interrupts by verifying that a counter is increasing. NOTE: unlike the other hard lockup detectors, the buddy one can't easily show what's happening on the CPU that locked up just by doing a simple backtrace. It relies on some other mechanism in the system to get information about the locked up CPUs. This could be support for NMI backtraces like [1], it could be a mechanism for printing the PC of locked CPUs at panic time like [2] / [3], or it could be something else. Even though that means we still rely on arch-specific code, this arch-specific code seems to often be implemented even on architectures that don't have a hardlockup detector. This style of hardlockup detector originated in some downstream Android trees and has been rebased on / carried in ChromeOS trees for quite a long time for use on arm and arm64 boards. Historically on these boards we've leveraged mechanism [2] / [3] to get information about hung CPUs, but we could move to [1]. Although the original motivation for the buddy system was for use on systems without an arch-specific hardlockup detector, it can still be useful to use even on systems that _do_ have an arch-specific hardlockup detector. On x86, for instance, there is a 24-part patch series [4] in progress switching the arch-specific hard lockup detector from a scarce perf counter to a less-scarce hardware resource. Potentially the buddy system could be a simpler alternative to free up the perf counter but still get hard lockup detection. Overall, pros (+) and cons (-) of the buddy system compared to an arch-specific hardlockup detector (which might be implemented using perf): + The buddy system is usable on systems that don't have an arch-specific hardlockup detector, like arm32 and arm64 (though it's being worked on for arm64 [5]). + The buddy system may free up scarce hardware resources. + If a CPU totally goes out to lunch (can't process NMIs) the buddy system could still detect the problem (though it would be unlikely to be able to get a stack trace). + The buddy system uses the same timer function to pet the hardlockup detector on the running CPU as it uses to detect hardlockups on other CPUs. Compared to other hardlockup detectors, this means it generates fewer interrupts and thus is likely better able to let CPUs stay idle longer. - If all CPUs are hard locked up at the same time the buddy system can't detect it. - If we don't have SMP we can't use the buddy system. - The buddy system needs an arch-specific mechanism (possibly NMI backtrace) to get info about the locked up CPU. [1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org [2] https://issuetracker.google.com/172213129 [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html [4] https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calderon@linux.intel.com/ [5] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/ Link: https://lkml.kernel.org/r/20230519101840.v5.14.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Tzung-Bi Shih <tzungbi@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ian Rogers <irogers@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Sumit Garg <sumit.garg@linaro.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-20 01:18:38 +08:00
select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
config HARDLOCKUP_DETECTOR_ARCH
bool
depends on HARDLOCKUP_DETECTOR
depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
help
The arch-specific implementation of the hardlockup detector will
be used.
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
#
watchdog/hardlockup: sort hardlockup detector related config values a logical way Patch series "watchdog/hardlockup: Cleanup configuration of hardlockup detectors", v2. Clean up watchdog Kconfig after introducing the buddy detector. This patch (of 6): There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. Only one hardlockup detector can be compiled in. The selection is done using quite complex dependencies between several CONFIG variables. The following patches will try to make it more straightforward. As a first step, reorder the definitions of the various CONFIG variables. The logical order is: 1. HAVE_* variables define available variants. They are typically defined in the arch/ config files. 2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup detector is enabled at all. 3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether the buddy detector should be preferred over the perf one. Note that the arch specific variants are always preferred when available. 4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given detector is enabled in the end. 5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH are temporary variables that are going to be removed in a followup patch. This is a preparation step for further cleanup. It will change the logic without shuffling the definitions. This change temporary breaks the C-like ordering where the variables are declared or defined before they are used. It is not really needed for Kconfig. Also the following patches will rework the logic so that the ordering will be C-like in the end. The patch just shuffles the definitions. It should not change the existing behavior. Link: https://lkml.kernel.org/r/20230616150618.6073-1-pmladek@suse.com Link: https://lkml.kernel.org/r/20230616150618.6073-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:13 +08:00
# Both the "perf" and "buddy" hardlockup detectors count hrtimer
# interrupts. This config enables functions managing this common code.
watchdog/hardlockup: make the config checks more straightforward There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Link: https://lkml.kernel.org/r/20230616150618.6073-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:14 +08:00
#
watchdog/hardlockup: sort hardlockup detector related config values a logical way Patch series "watchdog/hardlockup: Cleanup configuration of hardlockup detectors", v2. Clean up watchdog Kconfig after introducing the buddy detector. This patch (of 6): There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. Only one hardlockup detector can be compiled in. The selection is done using quite complex dependencies between several CONFIG variables. The following patches will try to make it more straightforward. As a first step, reorder the definitions of the various CONFIG variables. The logical order is: 1. HAVE_* variables define available variants. They are typically defined in the arch/ config files. 2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup detector is enabled at all. 3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether the buddy detector should be preferred over the perf one. Note that the arch specific variants are always preferred when available. 4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given detector is enabled in the end. 5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH are temporary variables that are going to be removed in a followup patch. This is a preparation step for further cleanup. It will change the logic without shuffling the definitions. This change temporary breaks the C-like ordering where the variables are declared or defined before they are used. It is not really needed for Kconfig. Also the following patches will rework the logic so that the ordering will be C-like in the end. The patch just shuffles the definitions. It should not change the existing behavior. Link: https://lkml.kernel.org/r/20230616150618.6073-1-pmladek@suse.com Link: https://lkml.kernel.org/r/20230616150618.6073-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-16 23:06:13 +08:00
config HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
bool
select SOFTLOCKUP_DETECTOR
kernel/watchdog: Prevent false positives with turbo modes The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-15 15:50:13 +08:00
#
# Enables a timestamp based low pass filter to compensate for perf based
# hard lockup detection which runs too fast due to turbo modes.
#
config HARDLOCKUP_CHECK_TIMESTAMP
bool
config BOOTPARAM_HARDLOCKUP_PANIC
bool "Panic (Reboot) On Hard Lockups"
depends on HARDLOCKUP_DETECTOR
help
Say Y here to enable the kernel to panic on "hard lockups",
which are bugs that cause the kernel to loop in kernel
mode with interrupts disabled for more than 10 seconds (configurable
using the watchdog_thresh sysctl).
Say N if unsure.
config DETECT_HUNG_TASK
bool "Detect Hung Tasks"
depends on DEBUG_KERNEL
default SOFTLOCKUP_DETECTOR
help
Say Y here to enable the kernel to detect "hung tasks",
which are bugs that cause the task to be stuck in
uninterruptible "D" state indefinitely.
When a hung task is detected, the kernel will print the
current stack trace (which you should report), but the
task will stay in uninterruptible state. If lockdep is
enabled then all held locks will also be reported. This
feature has negligible overhead.
[PATCH] slab: implement /proc/slab_allocators Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 19:06:39 +08:00
config DEFAULT_HUNG_TASK_TIMEOUT
int "Default timeout for hung task detection (in seconds)"
depends on DETECT_HUNG_TASK
default 120
help
This option controls the default timeout (in seconds) used
to determine when a task has become non-responsive and should
be considered hung.
It can be adjusted at runtime via the kernel.hung_task_timeout_secs
sysctl or by writing a value to
/proc/sys/kernel/hung_task_timeout_secs.
SLUB: Support for performance statistics The statistics provided here allow the monitoring of allocator behavior but at the cost of some (minimal) loss of performance. Counters are placed in SLUB's per cpu data structure. The per cpu structure may be extended by the statistics to grow larger than one cacheline which will increase the cache footprint of SLUB. There is a compile option to enable/disable the inclusion of the runtime statistics and its off by default. The slabinfo tool is enhanced to support these statistics via two options: -D Switches the line of information displayed for a slab from size mode to activity mode. -A Sorts the slabs displayed by activity. This allows the display of the slabs most important to the performance of a certain load. -r Report option will report detailed statistics on Example (tbench load): slabinfo -AD ->Shows the most active slabs Name Objects Alloc Free %Fast skbuff_fclone_cache 33 111953835 111953835 99 99 :0000192 2666 5283688 5281047 99 99 :0001024 849 5247230 5246389 83 83 vm_area_struct 1349 119642 118355 91 22 :0004096 15 66753 66751 98 98 :0000064 2067 25297 23383 98 78 dentry 10259 28635 18464 91 45 :0000080 11004 18950 8089 98 98 :0000096 1703 12358 10784 99 98 :0000128 762 10582 9875 94 18 :0000512 184 9807 9647 95 81 :0002048 479 9669 9195 83 65 anon_vma 777 9461 9002 99 71 kmalloc-8 6492 9981 5624 99 97 :0000768 258 7174 6931 58 15 So the skbuff_fclone_cache is of highest importance for the tbench load. Pretty high load on the 192 sized slab. Look for the aliases slabinfo -a | grep 000192 :0000192 <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili Likely skbuff_head_cache. Looking into the statistics of the skbuff_fclone_cache is possible through slabinfo skbuff_fclone_cache ->-r option implied if cache name is mentioned .... Usual output ... Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 111953360 111946981 99 99 Slowpath 1044 7423 0 0 Page Alloc 272 264 0 0 Add partial 25 325 0 0 Remove partial 86 264 0 0 RemoteObj/SlabFrozen 350 4832 0 0 Total 111954404 111954404 Flushes 49 Refill 0 Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%) Looks good because the fastpath is overwhelmingly taken. skbuff_head_cache: Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 5297262 5259882 99 99 Slowpath 4477 39586 0 0 Page Alloc 937 824 0 0 Add partial 0 2515 0 0 Remove partial 1691 824 0 0 RemoteObj/SlabFrozen 2621 9684 0 0 Total 5301739 5299468 Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%) Descriptions of the output: Total: The total number of allocation and frees that occurred for a slab Fastpath: The number of allocations/frees that used the fastpath. Slowpath: Other allocations Page Alloc: Number of calls to the page allocator as a result of slowpath processing Add Partial: Number of slabs added to the partial list through free or alloc (occurs during cpuslab flushes) Remove Partial: Number of slabs removed from the partial list as a result of allocations retrieving a partial slab or by a free freeing the last object of a slab. RemoteObj/Froz: How many times were remotely freed object encountered when a slab was about to be deactivated. Frozen: How many times was free able to skip list processing because the slab was in use as the cpuslab of another processor. Flushes: Number of times the cpuslab was flushed on request (kmem_cache_shrink, may result from races in __slab_alloc) Refill: Number of times we were able to refill the cpuslab from remotely freed objects for the same slab. Deactivate: Statistics how slabs were deactivated. Shows how they were put onto the partial list. In general fastpath is very good. Slowpath without partial list processing is also desirable. Any touching of partial list uses node specific locks which may potentially cause list lock contention. Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-08 09:47:41 +08:00
A timeout of 0 disables the check. The default is two minutes.
Keeping the default should be fine in most cases.
config BOOTPARAM_HUNG_TASK_PANIC
bool "Panic (Reboot) On Hung Tasks"
depends on DETECT_HUNG_TASK
help
Say Y here to enable the kernel to panic on "hung tasks",
which are bugs that cause the kernel to leave a task stuck
in uninterruptible "D" state.
The panic can be used in combination with panic_timeout,
to cause the system to reboot automatically after a
hung task has been detected. This feature is useful for
high-availability systems that have uptime guarantees and
where a hung tasks must be resolved ASAP.
Say N if unsure.
workqueue: implement lockup detector Workqueue stalls can happen from a variety of usage bugs such as missing WQ_MEM_RECLAIM flag or concurrency managed work item indefinitely staying RUNNING. These stalls can be extremely difficult to hunt down because the usual warning mechanisms can't detect workqueue stalls and the internal state is pretty opaque. To alleviate the situation, this patch implements workqueue lockup detector. It periodically monitors all worker_pools periodically and, if any pool failed to make forward progress longer than the threshold duration, triggers warning and dumps workqueue state as follows. BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256 pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent workqueue events_power_efficient: flags=0x80 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 pending: check_lifetime, neigh_periodic_work workqueue cgroup_pidlist_destroy: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1 pending: cgroup_pidlist_destroy_work_fn ... The detection mechanism is controller through kernel parameter workqueue.watchdog_thresh and can be updated at runtime through the sysfs module parameter file. v2: Decoupled from softlockup control knobs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-09 00:28:04 +08:00
config WQ_WATCHDOG
bool "Detect Workqueue Stalls"
depends on DEBUG_KERNEL
help
Say Y here to enable stall detection on workqueues. If a
worker pool doesn't make forward progress on a pending work
item for over a given amount of time, 30s by default, a
warning message is printed along with dump of workqueue
state. This can be configured through kernel parameter
"workqueue.watchdog_thresh" and its sysfs counterpart.
config WQ_CPU_INTENSIVE_REPORT
bool "Report per-cpu work items which hog CPU for too long"
depends on DEBUG_KERNEL
help
Say Y here to enable reporting of concurrency-managed per-cpu work
items that hog CPUs for longer than
workqueue.cpu_intensive_thresh_us. Workqueue automatically
detects and excludes them from concurrency management to prevent
them from stalling other per-cpu work items. Occassional
triggering may not necessarily indicate a problem. Repeated
triggering likely indicates that the work item should be switched
to use an unbound workqueue.
lib/test_lockup: test module to generate lockups CONFIG_TEST_LOCKUP=m adds module "test_lockup" that helps to make sure that watchdogs and lockup detectors are working properly. Depending on module parameters test_lockup could emulate soft or hard lockup, "hung task", hold arbitrary lock, allocate bunch of pages. Also it could generate series of lockups with cooling-down periods, in this way it could be used as "ping" for locks or page allocator. Loop checks signals between iteration thus could be stopped by ^C. # modinfo test_lockup ... parm: time_secs:lockup time in seconds, default 0 (uint) parm: time_nsecs:nanoseconds part of lockup time, default 0 (uint) parm: cooldown_secs:cooldown time between iterations in seconds, default 0 (uint) parm: cooldown_nsecs:nanoseconds part of cooldown, default 0 (uint) parm: iterations:lockup iterations, default 1 (uint) parm: all_cpus:trigger lockup at all cpus at once (bool) parm: state:wait in 'R' running (default), 'D' uninterruptible, 'K' killable, 'S' interruptible state (charp) parm: use_hrtimer:use high-resolution timer for sleeping (bool) parm: iowait:account sleep time as iowait (bool) parm: lock_read:lock read-write locks for read (bool) parm: lock_single:acquire locks only at one cpu (bool) parm: reacquire_locks:release and reacquire locks/irq/preempt between iterations (bool) parm: touch_softlockup:touch soft-lockup watchdog between iterations (bool) parm: touch_hardlockup:touch hard-lockup watchdog between iterations (bool) parm: call_cond_resched:call cond_resched() between iterations (bool) parm: measure_lock_wait:measure lock wait time (bool) parm: lock_wait_threshold:print lock wait time longer than this in nanoseconds, default off (ulong) parm: disable_irq:disable interrupts: generate hard-lockups (bool) parm: disable_softirq:disable bottom-half irq handlers (bool) parm: disable_preempt:disable preemption: generate soft-lockups (bool) parm: lock_rcu:grab rcu_read_lock: generate rcu stalls (bool) parm: lock_mmap_sem:lock mm->mmap_sem: block procfs interfaces (bool) parm: lock_rwsem_ptr:lock rw_semaphore at address (ulong) parm: lock_mutex_ptr:lock mutex at address (ulong) parm: lock_spinlock_ptr:lock spinlock at address (ulong) parm: lock_rwlock_ptr:lock rwlock at address (ulong) parm: alloc_pages_nr:allocate and free pages under locks (uint) parm: alloc_pages_order:page order to allocate (uint) parm: alloc_pages_gfp:allocate pages with this gfp_mask, default GFP_KERNEL (uint) parm: alloc_pages_atomic:allocate pages with GFP_ATOMIC (bool) parm: reallocate_pages:free and allocate pages between iterations (bool) Parameters for locking by address are unsafe and taints kernel. With CONFIG_DEBUG_SPINLOCK=y they at least check magics for embedded spinlocks. Examples: task hang in D-state: modprobe test_lockup time_secs=1 iterations=60 state=D task hang in io-wait D-state: modprobe test_lockup time_secs=1 iterations=60 state=D iowait softlockup: modprobe test_lockup time_secs=1 iterations=60 state=R hardlockup: modprobe test_lockup time_secs=1 iterations=60 state=R disable_irq system-wide hardlockup: modprobe test_lockup time_secs=1 iterations=60 state=R \ disable_irq all_cpus rcu stall: modprobe test_lockup time_secs=1 iterations=60 state=R \ lock_rcu touch_softlockup lock mmap_sem / block procfs interfaces: modprobe test_lockup time_secs=1 iterations=60 state=S lock_mmap_sem lock tasklist_lock for read / block forks: TASKLIST_LOCK=$(awk '$3 == "tasklist_lock" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=R \ disable_irq lock_read lock_rwlock_ptr=$TASKLIST_LOCK lock namespace_sem / block vfs mount operations: NAMESPACE_SEM=$(awk '$3 == "namespace_sem" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=S \ lock_rwsem_ptr=$NAMESPACE_SEM lock cgroup mutex / block cgroup operations: CGROUP_MUTEX=$(awk '$3 == "cgroup_mutex" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=S \ lock_mutex_ptr=$CGROUP_MUTEX ping cgroup_mutex every second and measure maximum lock wait time: modprobe test_lockup cooldown_secs=1 iterations=60 state=S \ lock_mutex_ptr=$CGROUP_MUTEX reacquire_locks measure_lock_wait [linux@roeck-us.net: rename disable_irq to fix build error] Link: http://lkml.kernel.org/r/20200317133614.23152-1-linux@roeck-us.net Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru Cc: Colin Ian King <colin.king@canonical.com> Cc: Guenter Roeck <linux@roeck-us.net> Link: http://lkml.kernel.org/r/158132859146.2797.525923171323227836.stgit@buzz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 11:09:47 +08:00
config TEST_LOCKUP
tristate "Test module to generate lockups"
depends on m
lib/test_lockup: test module to generate lockups CONFIG_TEST_LOCKUP=m adds module "test_lockup" that helps to make sure that watchdogs and lockup detectors are working properly. Depending on module parameters test_lockup could emulate soft or hard lockup, "hung task", hold arbitrary lock, allocate bunch of pages. Also it could generate series of lockups with cooling-down periods, in this way it could be used as "ping" for locks or page allocator. Loop checks signals between iteration thus could be stopped by ^C. # modinfo test_lockup ... parm: time_secs:lockup time in seconds, default 0 (uint) parm: time_nsecs:nanoseconds part of lockup time, default 0 (uint) parm: cooldown_secs:cooldown time between iterations in seconds, default 0 (uint) parm: cooldown_nsecs:nanoseconds part of cooldown, default 0 (uint) parm: iterations:lockup iterations, default 1 (uint) parm: all_cpus:trigger lockup at all cpus at once (bool) parm: state:wait in 'R' running (default), 'D' uninterruptible, 'K' killable, 'S' interruptible state (charp) parm: use_hrtimer:use high-resolution timer for sleeping (bool) parm: iowait:account sleep time as iowait (bool) parm: lock_read:lock read-write locks for read (bool) parm: lock_single:acquire locks only at one cpu (bool) parm: reacquire_locks:release and reacquire locks/irq/preempt between iterations (bool) parm: touch_softlockup:touch soft-lockup watchdog between iterations (bool) parm: touch_hardlockup:touch hard-lockup watchdog between iterations (bool) parm: call_cond_resched:call cond_resched() between iterations (bool) parm: measure_lock_wait:measure lock wait time (bool) parm: lock_wait_threshold:print lock wait time longer than this in nanoseconds, default off (ulong) parm: disable_irq:disable interrupts: generate hard-lockups (bool) parm: disable_softirq:disable bottom-half irq handlers (bool) parm: disable_preempt:disable preemption: generate soft-lockups (bool) parm: lock_rcu:grab rcu_read_lock: generate rcu stalls (bool) parm: lock_mmap_sem:lock mm->mmap_sem: block procfs interfaces (bool) parm: lock_rwsem_ptr:lock rw_semaphore at address (ulong) parm: lock_mutex_ptr:lock mutex at address (ulong) parm: lock_spinlock_ptr:lock spinlock at address (ulong) parm: lock_rwlock_ptr:lock rwlock at address (ulong) parm: alloc_pages_nr:allocate and free pages under locks (uint) parm: alloc_pages_order:page order to allocate (uint) parm: alloc_pages_gfp:allocate pages with this gfp_mask, default GFP_KERNEL (uint) parm: alloc_pages_atomic:allocate pages with GFP_ATOMIC (bool) parm: reallocate_pages:free and allocate pages between iterations (bool) Parameters for locking by address are unsafe and taints kernel. With CONFIG_DEBUG_SPINLOCK=y they at least check magics for embedded spinlocks. Examples: task hang in D-state: modprobe test_lockup time_secs=1 iterations=60 state=D task hang in io-wait D-state: modprobe test_lockup time_secs=1 iterations=60 state=D iowait softlockup: modprobe test_lockup time_secs=1 iterations=60 state=R hardlockup: modprobe test_lockup time_secs=1 iterations=60 state=R disable_irq system-wide hardlockup: modprobe test_lockup time_secs=1 iterations=60 state=R \ disable_irq all_cpus rcu stall: modprobe test_lockup time_secs=1 iterations=60 state=R \ lock_rcu touch_softlockup lock mmap_sem / block procfs interfaces: modprobe test_lockup time_secs=1 iterations=60 state=S lock_mmap_sem lock tasklist_lock for read / block forks: TASKLIST_LOCK=$(awk '$3 == "tasklist_lock" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=R \ disable_irq lock_read lock_rwlock_ptr=$TASKLIST_LOCK lock namespace_sem / block vfs mount operations: NAMESPACE_SEM=$(awk '$3 == "namespace_sem" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=S \ lock_rwsem_ptr=$NAMESPACE_SEM lock cgroup mutex / block cgroup operations: CGROUP_MUTEX=$(awk '$3 == "cgroup_mutex" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=S \ lock_mutex_ptr=$CGROUP_MUTEX ping cgroup_mutex every second and measure maximum lock wait time: modprobe test_lockup cooldown_secs=1 iterations=60 state=S \ lock_mutex_ptr=$CGROUP_MUTEX reacquire_locks measure_lock_wait [linux@roeck-us.net: rename disable_irq to fix build error] Link: http://lkml.kernel.org/r/20200317133614.23152-1-linux@roeck-us.net Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru Cc: Colin Ian King <colin.king@canonical.com> Cc: Guenter Roeck <linux@roeck-us.net> Link: http://lkml.kernel.org/r/158132859146.2797.525923171323227836.stgit@buzz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 11:09:47 +08:00
help
This builds the "test_lockup" module that helps to make sure
that watchdogs and lockup detectors are working properly.
Depending on module parameters it could emulate soft or hard
lockup, "hung task", or locking arbitrary lock for a long time.
Also it could generate series of lockups with cooling-down periods.
If unsure, say N.
endmenu # "Debug lockups and hangs"
menu "Scheduler Debugging"
config SCHED_DEBUG
bool "Collect scheduler debugging info"
depends on DEBUG_KERNEL && DEBUG_FS
default y
help
If you say Y here, the /sys/kernel/debug/sched file will be provided
that can help debug the scheduler. The runtime overhead of this
option is minimal.
config SCHED_INFO
bool
default n
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on PROC_FS
select SCHED_INFO
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
endmenu
sched: Add default-disabled option to BUG() when stack end location is overwritten Currently in the event of a stack overrun a call to schedule() does not check for this type of corruption. This corruption is often silent and can go unnoticed. However once the corrupted region is examined at a later stage, the outcome is undefined and often results in a sporadic page fault which cannot be handled. This patch checks for a stack overrun and takes appropriate action since the damage is already done, there is no point in continuing. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dzickus@redhat.com Cc: bmr@redhat.com Cc: jcastillo@redhat.com Cc: oleg@redhat.com Cc: riel@redhat.com Cc: prarit@redhat.com Cc: jgh@redhat.com Cc: minchan@kernel.org Cc: mpe@ellerman.id.au Cc: tglx@linutronix.de Cc: rostedt@goodmis.org Cc: hannes@cmpxchg.org Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David S. Miller <davem@davemloft.net> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lubomir Rintel <lkundrak@v3.sk> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1410527779-8133-4-git-send-email-atomlin@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-12 21:16:19 +08:00
config DEBUG_TIMEKEEPING
bool "Enable extra timekeeping sanity checking"
help
This option will enable additional timekeeping sanity checks
which may be helpful when diagnosing issues where timekeeping
problems are suspected.
This may include checks in the timekeeping hotpaths, so this
option may have a (very small) performance impact to some
workloads.
If unsure, say N.
config DEBUG_PREEMPT
bool "Debug preemptible kernel"
depends on DEBUG_KERNEL && PREEMPTION && TRACE_IRQFLAGS_SUPPORT
help
If you say Y here then the kernel will use a debug variant of the
commonly used smp_processor_id() function and will print warnings
if kernel code uses it in a preemption-unsafe way. Also, the kernel
will detect preemption count underflows.
lib/Kconfig.debug: do not enable DEBUG_PREEMPT by default In workloads where this_cpu operations are frequently performed, enabling DEBUG_PREEMPT may result in significant increase in runtime overhead due to frequent invocation of __this_cpu_preempt_check() function. This can be demonstrated through benchmarks such as hackbench where this configuration results in a 10% reduction in performance, primarily due to the added overhead within memcg charging path. Therefore, do not to enable DEBUG_PREEMPT by default and make users aware of its potential impact on performance in some workloads. hackbench-process-sockets debug_preempt no_debug_preempt Amean 1 0.4743 ( 0.00%) 0.4295 * 9.45%* Amean 4 1.4191 ( 0.00%) 1.2650 * 10.86%* Amean 7 2.2677 ( 0.00%) 2.0094 * 11.39%* Amean 12 3.6821 ( 0.00%) 3.2115 * 12.78%* Amean 21 6.6752 ( 0.00%) 5.7956 * 13.18%* Amean 30 9.6646 ( 0.00%) 8.5197 * 11.85%* Amean 48 15.3363 ( 0.00%) 13.5559 * 11.61%* Amean 79 24.8603 ( 0.00%) 22.0597 * 11.27%* Amean 96 30.1240 ( 0.00%) 26.8073 * 11.01%* Link: https://lkml.kernel.org/r/20230121033942.350387-1-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Ben Segall <bsegall@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-21 11:39:42 +08:00
This option has potential to introduce high runtime overhead,
depending on workload as it triggers debugging routines for each
this_cpu operation. It should only be used for debugging purposes.
menu "Lock Debugging (spinlocks, mutexes, etc...)"
config LOCK_DEBUGGING_SUPPORT
bool
depends on TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
default y
config PROVE_LOCKING
bool "Lock debugging: prove locking correctness"
depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
select LOCKDEP
select DEBUG_SPINLOCK
select DEBUG_MUTEXES if !PREEMPT_RT
select DEBUG_RT_MUTEXES if RT_MUTEXES
select DEBUG_RWSEMS if !PREEMPT_RT
select DEBUG_WW_MUTEX_SLOWPATH
select DEBUG_LOCK_ALLOC
select PREEMPT_COUNT if !ARCH_NO_PREEMPT
select TRACE_IRQFLAGS
default n
help
This feature enables the kernel to prove that all locking
that occurs in the kernel runtime is mathematically
correct: that under no circumstance could an arbitrary (and
not yet triggered) combination of observed locking
sequences (on an arbitrary number of CPUs, running an
arbitrary number of tasks and interrupt contexts) cause a
deadlock.
In short, this feature enables the kernel to report locking
related deadlocks before they actually occur.
The proof does not depend on how hard and complex a
deadlock scenario would be to trigger: how many
participant CPUs, tasks and irq-contexts would be needed
for it to trigger. The proof also does not depend on
timing: if a race and a resulting deadlock is possible
theoretically (no matter how unlikely the race scenario
is), it will be proven so and will immediately be
reported by the kernel (once the event is observed that
makes the deadlock theoretically possible).
If a deadlock is impossible (i.e. the locking rules, as
observed by the kernel, are mathematically correct), the
kernel reports nothing.
NOTE: this feature can also be enabled for rwlocks, mutexes
and rwsems - in which case all dependencies between these
different locking variants are observed and mapped too, and
the proof of observed correctness is also maintained for an
arbitrary combination of these separate locking variants.
For more details, see Documentation/locking/lockdep-design.rst.
lockdep: Introduce wait-type checks Extend lockdep to validate lock wait-type context. The current wait-types are: LD_WAIT_FREE, /* wait free, rcu etc.. */ LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */ LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */ LD_WAIT_SLEEP, /* sleeping locks, mutex_t etc.. */ Where lockdep validates that the current lock (the one being acquired) fits in the current wait-context (as generated by the held stack). This ensures that there is no attempt to acquire mutexes while holding spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In other words, its a more fancy might_sleep(). Obviously RCU made the entire ordeal more complex than a simple single value test because RCU can be acquired in (pretty much) any context and while it presents a context to nested locks it is not the same as it got acquired in. Therefore its necessary to split the wait_type into two values, one representing the acquire (outer) and one representing the nested context (inner). For most 'normal' locks these two are the same. [ To make static initialization easier we have the rule that: .outer == INV means .outer == .inner; because INV == 0. ] It further means that its required to find the minimal .inner of the held stack to compare against the outer of the new lock; because while 'normal' RCU presents a CONFIG type to nested locks, if it is taken while already holding a SPIN type it obviously doesn't relax the rules. Below is an example output generated by the trivial test code: raw_spin_lock(&foo); spin_lock(&bar); spin_unlock(&bar); raw_spin_unlock(&foo); [ BUG: Invalid wait context ] ----------------------------- swapper/0/1 is trying to lock: ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187 other info that might help us debug this: 1 lock held by swapper/0/1: #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187 The way to read it is to look at the new -{n,m} part in the lock description; -{3:3} for the attempted lock, and try and match that up to the held locks, which in this case is the one: -{2,2}. This tells that the acquiring lock requires a more relaxed environment than presented by the lock stack. Currently only the normal locks and RCU are converted, the rest of the lockdep users defaults to .inner = INV which is ignored. More conversions can be done when desired. The check for spinlock_t nesting is not enabled by default. It's a separate config option for now as there are known problems which are currently addressed. The config option allows to identify these problems and to verify that the solutions found are indeed solving them. The config switch will be removed and the checks will permanently enabled once the vast majority of issues has been addressed. [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP] [ tglx: Add the config option ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
2020-03-21 19:26:01 +08:00
config PROVE_RAW_LOCK_NESTING
bool "Enable raw_spinlock - spinlock nesting checks"
depends on PROVE_LOCKING
default n
help
Enable the raw_spinlock vs. spinlock nesting checks which ensure
that the lock nesting rules for PREEMPT_RT enabled kernels are
not violated.
NOTE: There are known nesting problems. So if you enable this
option expect lockdep splats until these problems have been fully
addressed which is work in progress. This config switch allows to
identify and analyze these problems. It will be removed and the
check permanently enabled once the main issues have been fixed.
lockdep: Introduce wait-type checks Extend lockdep to validate lock wait-type context. The current wait-types are: LD_WAIT_FREE, /* wait free, rcu etc.. */ LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */ LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */ LD_WAIT_SLEEP, /* sleeping locks, mutex_t etc.. */ Where lockdep validates that the current lock (the one being acquired) fits in the current wait-context (as generated by the held stack). This ensures that there is no attempt to acquire mutexes while holding spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In other words, its a more fancy might_sleep(). Obviously RCU made the entire ordeal more complex than a simple single value test because RCU can be acquired in (pretty much) any context and while it presents a context to nested locks it is not the same as it got acquired in. Therefore its necessary to split the wait_type into two values, one representing the acquire (outer) and one representing the nested context (inner). For most 'normal' locks these two are the same. [ To make static initialization easier we have the rule that: .outer == INV means .outer == .inner; because INV == 0. ] It further means that its required to find the minimal .inner of the held stack to compare against the outer of the new lock; because while 'normal' RCU presents a CONFIG type to nested locks, if it is taken while already holding a SPIN type it obviously doesn't relax the rules. Below is an example output generated by the trivial test code: raw_spin_lock(&foo); spin_lock(&bar); spin_unlock(&bar); raw_spin_unlock(&foo); [ BUG: Invalid wait context ] ----------------------------- swapper/0/1 is trying to lock: ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187 other info that might help us debug this: 1 lock held by swapper/0/1: #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187 The way to read it is to look at the new -{n,m} part in the lock description; -{3:3} for the attempted lock, and try and match that up to the held locks, which in this case is the one: -{2,2}. This tells that the acquiring lock requires a more relaxed environment than presented by the lock stack. Currently only the normal locks and RCU are converted, the rest of the lockdep users defaults to .inner = INV which is ignored. More conversions can be done when desired. The check for spinlock_t nesting is not enabled by default. It's a separate config option for now as there are known problems which are currently addressed. The config option allows to identify these problems and to verify that the solutions found are indeed solving them. The config switch will be removed and the checks will permanently enabled once the vast majority of issues has been addressed. [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP] [ tglx: Add the config option ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
2020-03-21 19:26:01 +08:00
If unsure, select N.
config LOCK_STAT
bool "Lock usage statistics"
depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
select LOCKDEP
select DEBUG_SPINLOCK
select DEBUG_MUTEXES if !PREEMPT_RT
select DEBUG_RT_MUTEXES if RT_MUTEXES
select DEBUG_LOCK_ALLOC
default n
help
This feature enables tracking lock contention points
For more details, see Documentation/locking/lockstat.rst
This also enables lock events required by "perf lock",
subcommand of perf.
If you want to use "perf lock", you also need to turn on
CONFIG_EVENT_TRACING.
CONFIG_LOCK_STAT defines "contended" and "acquired" lock events.
(CONFIG_LOCKDEP defines "acquire" and "release" events.)
config DEBUG_RT_MUTEXES
bool "RT Mutex debugging, deadlock detection"
depends on DEBUG_KERNEL && RT_MUTEXES
help
This allows rt mutex semantics violations and rt mutex related
deadlocks (lockups) to be detected and reported automatically.
config DEBUG_SPINLOCK
bool "Spinlock and rw-lock debugging: basic checks"
depends on DEBUG_KERNEL
select UNINLINE_SPIN_UNLOCK
help
Say Y here and build SMP to catch missing spinlock initialization
and certain other kinds of spinlock errors commonly made. This is
best used in conjunction with the NMI watchdog so that spinlock
deadlocks are also debuggable.
config DEBUG_MUTEXES
bool "Mutex debugging: basic checks"
depends on DEBUG_KERNEL && !PREEMPT_RT
help
This feature allows mutex semantics violations to be detected and
reported.
mutex: Add w/w mutex slowpath debugging Injects EDEADLK conditions at pseudo-random interval, with exponential backoff up to UINT_MAX (to ensure that every lock operation still completes in a reasonable time). This way we can test the wound slowpath even for ww mutex users where contention is never expected, and the ww deadlock avoidance algorithm is only needed for correctness against malicious userspace. An example would be protecting kernel modesetting properties, which thanks to single-threaded X isn't really expected to contend, ever. I've looked into using the CONFIG_FAULT_INJECTION infrastructure, but decided against it for two reasons: - EDEADLK handling is mandatory for ww mutex users and should never affect the outcome of a syscall. This is in contrast to -ENOMEM injection. So fine configurability isn't required. - The fault injection framework only allows to set a simple probability for failure. Now the probability that a ww mutex acquire stage with N locks will never complete (due to too many injected EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock operations for the completely uncontended case would be O(exp(N)). The per-acuiqire ctx exponential backoff solution choosen here only results in O(log N) overhead due to injection and so O(log N * N) lock operations. This way we can fail with high probability (and so have good test coverage even for fancy backoff and lock acquisition paths) without running into patalogical cases. Note that EDEADLK will only ever be injected when we managed to acquire the lock. This prevents any behaviour changes for users which rely on the EALREADY semantics. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: rostedt@goodmis.org Cc: daniel@ffwll.ch Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 19:31:17 +08:00
config DEBUG_WW_MUTEX_SLOWPATH
bool "Wait/wound mutex debugging: Slowpath testing"
depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
mutex: Add w/w mutex slowpath debugging Injects EDEADLK conditions at pseudo-random interval, with exponential backoff up to UINT_MAX (to ensure that every lock operation still completes in a reasonable time). This way we can test the wound slowpath even for ww mutex users where contention is never expected, and the ww deadlock avoidance algorithm is only needed for correctness against malicious userspace. An example would be protecting kernel modesetting properties, which thanks to single-threaded X isn't really expected to contend, ever. I've looked into using the CONFIG_FAULT_INJECTION infrastructure, but decided against it for two reasons: - EDEADLK handling is mandatory for ww mutex users and should never affect the outcome of a syscall. This is in contrast to -ENOMEM injection. So fine configurability isn't required. - The fault injection framework only allows to set a simple probability for failure. Now the probability that a ww mutex acquire stage with N locks will never complete (due to too many injected EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock operations for the completely uncontended case would be O(exp(N)). The per-acuiqire ctx exponential backoff solution choosen here only results in O(log N) overhead due to injection and so O(log N * N) lock operations. This way we can fail with high probability (and so have good test coverage even for fancy backoff and lock acquisition paths) without running into patalogical cases. Note that EDEADLK will only ever be injected when we managed to acquire the lock. This prevents any behaviour changes for users which rely on the EALREADY semantics. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: rostedt@goodmis.org Cc: daniel@ffwll.ch Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 19:31:17 +08:00
select DEBUG_LOCK_ALLOC
select DEBUG_SPINLOCK
select DEBUG_MUTEXES if !PREEMPT_RT
select DEBUG_RT_MUTEXES if PREEMPT_RT
mutex: Add w/w mutex slowpath debugging Injects EDEADLK conditions at pseudo-random interval, with exponential backoff up to UINT_MAX (to ensure that every lock operation still completes in a reasonable time). This way we can test the wound slowpath even for ww mutex users where contention is never expected, and the ww deadlock avoidance algorithm is only needed for correctness against malicious userspace. An example would be protecting kernel modesetting properties, which thanks to single-threaded X isn't really expected to contend, ever. I've looked into using the CONFIG_FAULT_INJECTION infrastructure, but decided against it for two reasons: - EDEADLK handling is mandatory for ww mutex users and should never affect the outcome of a syscall. This is in contrast to -ENOMEM injection. So fine configurability isn't required. - The fault injection framework only allows to set a simple probability for failure. Now the probability that a ww mutex acquire stage with N locks will never complete (due to too many injected EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock operations for the completely uncontended case would be O(exp(N)). The per-acuiqire ctx exponential backoff solution choosen here only results in O(log N) overhead due to injection and so O(log N * N) lock operations. This way we can fail with high probability (and so have good test coverage even for fancy backoff and lock acquisition paths) without running into patalogical cases. Note that EDEADLK will only ever be injected when we managed to acquire the lock. This prevents any behaviour changes for users which rely on the EALREADY semantics. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: rostedt@goodmis.org Cc: daniel@ffwll.ch Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 19:31:17 +08:00
help
This feature enables slowpath testing for w/w mutex users by
injecting additional -EDEADLK wound/backoff cases. Together with
the full mutex checks enabled with (CONFIG_PROVE_LOCKING) this
will test all possible w/w mutex interface abuse with the
exception of simply not acquiring all the required locks.
Note that this feature can introduce significant overhead, so
it really should not be enabled in a production or distro kernel,
even a debug kernel. If you are a driver writer, enable it. If
you are a distro, do not.
mutex: Add w/w mutex slowpath debugging Injects EDEADLK conditions at pseudo-random interval, with exponential backoff up to UINT_MAX (to ensure that every lock operation still completes in a reasonable time). This way we can test the wound slowpath even for ww mutex users where contention is never expected, and the ww deadlock avoidance algorithm is only needed for correctness against malicious userspace. An example would be protecting kernel modesetting properties, which thanks to single-threaded X isn't really expected to contend, ever. I've looked into using the CONFIG_FAULT_INJECTION infrastructure, but decided against it for two reasons: - EDEADLK handling is mandatory for ww mutex users and should never affect the outcome of a syscall. This is in contrast to -ENOMEM injection. So fine configurability isn't required. - The fault injection framework only allows to set a simple probability for failure. Now the probability that a ww mutex acquire stage with N locks will never complete (due to too many injected EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock operations for the completely uncontended case would be O(exp(N)). The per-acuiqire ctx exponential backoff solution choosen here only results in O(log N) overhead due to injection and so O(log N * N) lock operations. This way we can fail with high probability (and so have good test coverage even for fancy backoff and lock acquisition paths) without running into patalogical cases. Note that EDEADLK will only ever be injected when we managed to acquire the lock. This prevents any behaviour changes for users which rely on the EALREADY semantics. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: rostedt@goodmis.org Cc: daniel@ffwll.ch Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 19:31:17 +08:00
config DEBUG_RWSEMS
bool "RW Semaphore debugging: basic checks"
depends on DEBUG_KERNEL && !PREEMPT_RT
help
This debugging feature allows mismatched rw semaphore locks
and unlocks to be detected and reported.
config DEBUG_LOCK_ALLOC
bool "Lock debugging: detect incorrect freeing of live locks"
depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
select DEBUG_SPINLOCK
select DEBUG_MUTEXES if !PREEMPT_RT
select DEBUG_RT_MUTEXES if RT_MUTEXES
select LOCKDEP
help
This feature will check whether any held lock (spinlock, rwlock,
mutex or rwsem) is incorrectly freed by the kernel, via any of the
memory-freeing routines (kfree(), kmem_cache_free(), free_pages(),
vfree(), etc.), whether a live lock is incorrectly reinitialized via
spin_lock_init()/mutex_init()/etc., or whether there is any lock
held during task exit.
config LOCKDEP
bool
depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
select STACKTRACE
select KALLSYMS
select KALLSYMS_ALL
config LOCKDEP_SMALL
bool
config LOCKDEP_BITS
int "Bitsize for MAX_LOCKDEP_ENTRIES"
depends on LOCKDEP && !LOCKDEP_SMALL
range 10 30
default 15
help
Try increasing this value if you hit "BUG: MAX_LOCKDEP_ENTRIES too low!" message.
config LOCKDEP_CHAINS_BITS
int "Bitsize for MAX_LOCKDEP_CHAINS"
depends on LOCKDEP && !LOCKDEP_SMALL
range 10 30
default 16
help
Try increasing this value if you hit "BUG: MAX_LOCKDEP_CHAINS too low!" message.
config LOCKDEP_STACK_TRACE_BITS
int "Bitsize for MAX_STACK_TRACE_ENTRIES"
depends on LOCKDEP && !LOCKDEP_SMALL
range 10 30
default 19
help
Try increasing this value if you hit "BUG: MAX_STACK_TRACE_ENTRIES too low!" message.
config LOCKDEP_STACK_TRACE_HASH_BITS
int "Bitsize for STACK_TRACE_HASH_SIZE"
depends on LOCKDEP && !LOCKDEP_SMALL
range 10 30
default 14
help
Try increasing this value if you need large STACK_TRACE_HASH_SIZE.
config LOCKDEP_CIRCULAR_QUEUE_BITS
int "Bitsize for elements in circular_queue struct"
depends on LOCKDEP
range 10 30
default 12
help
Try increasing this value if you hit "lockdep bfs error:-1" warning due to __cq_enqueue() failure.
config DEBUG_LOCKDEP
bool "Lock dependency engine debugging"
depends on DEBUG_KERNEL && LOCKDEP
lockdep: report broken irq restoration We generally expect local_irq_save() and local_irq_restore() to be paired and sanely nested, and so local_irq_restore() expects to be called with irqs disabled. Thus, within local_irq_restore() we only trace irq flag changes when unmasking irqs. This means that a sequence such as: | local_irq_disable(); | local_irq_save(flags); | local_irq_enable(); | local_irq_restore(flags); ... is liable to break things, as the local_irq_restore() would mask irqs without tracing this change. Similar problems may exist for architectures whose arch_irq_restore() function depends on being called with irqs disabled. We don't consider such sequences to be a good idea, so let's define those as forbidden, and add tooling to detect such broken cases. This patch adds debug code to WARN() when raw_local_irq_restore() is called with irqs enabled. As raw_local_irq_restore() is expected to pair with raw_local_irq_save(), it should never be called with irqs enabled. To avoid the possibility of circular header dependencies between irqflags.h and bug.h, the warning is handled in a separate C file. The new code is all conditional on a new CONFIG_DEBUG_IRQFLAGS symbol which is independent of CONFIG_TRACE_IRQFLAGS. As noted above such cases will confuse lockdep, so CONFIG_DEBUG_LOCKDEP now selects CONFIG_DEBUG_IRQFLAGS. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210111153707.10071-1-mark.rutland@arm.com
2021-01-11 23:37:07 +08:00
select DEBUG_IRQFLAGS
help
If you say Y here, the lock dependency engine will do
additional runtime checks to debug itself, at the price
of more runtime overhead.
config DEBUG_ATOMIC_SLEEP
bool "Sleep inside atomic section checking"
select PREEMPT_COUNT
depends on DEBUG_KERNEL
depends on !ARCH_NO_PREEMPT
help
If you say Y here, various routines which may sleep will become very
noisy if they are called inside atomic sections: when a spinlock is
held, inside an rcu read side critical section, inside preempt disabled
sections, inside an interrupt, etc...
[PATCH] lockdep: locking API self tests Introduce DEBUG_LOCKING_API_SELFTESTS, which uses the generic lock debugging code's silent-failure feature to run a matrix of testcases. There are 210 testcases currently: +----------------------- | Locking API testsuite: +------------------------------+------+------+------+------+------+------+ | spin |wlock |rlock |mutex | wsem | rsem | -------------------------------+------+------+------+------+------+------+ A-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | double unlock: ok | ok | ok | ok | ok | ok | bad unlock order: ok | ok | ok | ok | ok | ok | --------------------------------------+------+------+------+------+------+ recursive read-lock: | ok | | ok | --------------------------------------+------+------+------+------+------+ non-nested unlock: ok | ok | ok | ok | --------------------------------------+------+------+------+ hard-irqs-on + irq-safe-A/12: ok | ok | ok | soft-irqs-on + irq-safe-A/12: ok | ok | ok | hard-irqs-on + irq-safe-A/21: ok | ok | ok | soft-irqs-on + irq-safe-A/21: ok | ok | ok | sirq-safe-A => hirqs-on/12: ok | ok | ok | sirq-safe-A => hirqs-on/21: ok | ok | ok | hard-safe-A + irqs-on/12: ok | ok | ok | soft-safe-A + irqs-on/12: ok | ok | ok | hard-safe-A + irqs-on/21: ok | ok | ok | soft-safe-A + irqs-on/21: ok | ok | ok | hard-safe-A + unsafe-B #1/123: ok | ok | ok | soft-safe-A + unsafe-B #1/123: ok | ok | ok | hard-safe-A + unsafe-B #1/132: ok | ok | ok | soft-safe-A + unsafe-B #1/132: ok | ok | ok | hard-safe-A + unsafe-B #1/213: ok | ok | ok | soft-safe-A + unsafe-B #1/213: ok | ok | ok | hard-safe-A + unsafe-B #1/231: ok | ok | ok | soft-safe-A + unsafe-B #1/231: ok | ok | ok | hard-safe-A + unsafe-B #1/312: ok | ok | ok | soft-safe-A + unsafe-B #1/312: ok | ok | ok | hard-safe-A + unsafe-B #1/321: ok | ok | ok | soft-safe-A + unsafe-B #1/321: ok | ok | ok | hard-safe-A + unsafe-B #2/123: ok | ok | ok | soft-safe-A + unsafe-B #2/123: ok | ok | ok | hard-safe-A + unsafe-B #2/132: ok | ok | ok | soft-safe-A + unsafe-B #2/132: ok | ok | ok | hard-safe-A + unsafe-B #2/213: ok | ok | ok | soft-safe-A + unsafe-B #2/213: ok | ok | ok | hard-safe-A + unsafe-B #2/231: ok | ok | ok | soft-safe-A + unsafe-B #2/231: ok | ok | ok | hard-safe-A + unsafe-B #2/312: ok | ok | ok | soft-safe-A + unsafe-B #2/312: ok | ok | ok | hard-safe-A + unsafe-B #2/321: ok | ok | ok | soft-safe-A + unsafe-B #2/321: ok | ok | ok | hard-irq lock-inversion/123: ok | ok | ok | soft-irq lock-inversion/123: ok | ok | ok | hard-irq lock-inversion/132: ok | ok | ok | soft-irq lock-inversion/132: ok | ok | ok | hard-irq lock-inversion/213: ok | ok | ok | soft-irq lock-inversion/213: ok | ok | ok | hard-irq lock-inversion/231: ok | ok | ok | soft-irq lock-inversion/231: ok | ok | ok | hard-irq lock-inversion/312: ok | ok | ok | soft-irq lock-inversion/312: ok | ok | ok | hard-irq lock-inversion/321: ok | ok | ok | soft-irq lock-inversion/321: ok | ok | ok | hard-irq read-recursion/123: ok | soft-irq read-recursion/123: ok | hard-irq read-recursion/132: ok | soft-irq read-recursion/132: ok | hard-irq read-recursion/213: ok | soft-irq read-recursion/213: ok | hard-irq read-recursion/231: ok | soft-irq read-recursion/231: ok | hard-irq read-recursion/312: ok | soft-irq read-recursion/312: ok | hard-irq read-recursion/321: ok | soft-irq read-recursion/321: ok | --------------------------------+-----+---------------- Good, all 210 testcases passed! | --------------------------------+ Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:48 +08:00
config DEBUG_LOCKING_API_SELFTESTS
bool "Locking API boot-time self-tests"
depends on DEBUG_KERNEL
help
Say Y here if you want the kernel to run a short self-test during
bootup. The self-test checks whether common types of locking bugs
are detected by debugging mechanisms or not. (if you disable
lock debugging then those bugs won't be detected of course.)
[PATCH] lockdep: locking API self tests Introduce DEBUG_LOCKING_API_SELFTESTS, which uses the generic lock debugging code's silent-failure feature to run a matrix of testcases. There are 210 testcases currently: +----------------------- | Locking API testsuite: +------------------------------+------+------+------+------+------+------+ | spin |wlock |rlock |mutex | wsem | rsem | -------------------------------+------+------+------+------+------+------+ A-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | double unlock: ok | ok | ok | ok | ok | ok | bad unlock order: ok | ok | ok | ok | ok | ok | --------------------------------------+------+------+------+------+------+ recursive read-lock: | ok | | ok | --------------------------------------+------+------+------+------+------+ non-nested unlock: ok | ok | ok | ok | --------------------------------------+------+------+------+ hard-irqs-on + irq-safe-A/12: ok | ok | ok | soft-irqs-on + irq-safe-A/12: ok | ok | ok | hard-irqs-on + irq-safe-A/21: ok | ok | ok | soft-irqs-on + irq-safe-A/21: ok | ok | ok | sirq-safe-A => hirqs-on/12: ok | ok | ok | sirq-safe-A => hirqs-on/21: ok | ok | ok | hard-safe-A + irqs-on/12: ok | ok | ok | soft-safe-A + irqs-on/12: ok | ok | ok | hard-safe-A + irqs-on/21: ok | ok | ok | soft-safe-A + irqs-on/21: ok | ok | ok | hard-safe-A + unsafe-B #1/123: ok | ok | ok | soft-safe-A + unsafe-B #1/123: ok | ok | ok | hard-safe-A + unsafe-B #1/132: ok | ok | ok | soft-safe-A + unsafe-B #1/132: ok | ok | ok | hard-safe-A + unsafe-B #1/213: ok | ok | ok | soft-safe-A + unsafe-B #1/213: ok | ok | ok | hard-safe-A + unsafe-B #1/231: ok | ok | ok | soft-safe-A + unsafe-B #1/231: ok | ok | ok | hard-safe-A + unsafe-B #1/312: ok | ok | ok | soft-safe-A + unsafe-B #1/312: ok | ok | ok | hard-safe-A + unsafe-B #1/321: ok | ok | ok | soft-safe-A + unsafe-B #1/321: ok | ok | ok | hard-safe-A + unsafe-B #2/123: ok | ok | ok | soft-safe-A + unsafe-B #2/123: ok | ok | ok | hard-safe-A + unsafe-B #2/132: ok | ok | ok | soft-safe-A + unsafe-B #2/132: ok | ok | ok | hard-safe-A + unsafe-B #2/213: ok | ok | ok | soft-safe-A + unsafe-B #2/213: ok | ok | ok | hard-safe-A + unsafe-B #2/231: ok | ok | ok | soft-safe-A + unsafe-B #2/231: ok | ok | ok | hard-safe-A + unsafe-B #2/312: ok | ok | ok | soft-safe-A + unsafe-B #2/312: ok | ok | ok | hard-safe-A + unsafe-B #2/321: ok | ok | ok | soft-safe-A + unsafe-B #2/321: ok | ok | ok | hard-irq lock-inversion/123: ok | ok | ok | soft-irq lock-inversion/123: ok | ok | ok | hard-irq lock-inversion/132: ok | ok | ok | soft-irq lock-inversion/132: ok | ok | ok | hard-irq lock-inversion/213: ok | ok | ok | soft-irq lock-inversion/213: ok | ok | ok | hard-irq lock-inversion/231: ok | ok | ok | soft-irq lock-inversion/231: ok | ok | ok | hard-irq lock-inversion/312: ok | ok | ok | soft-irq lock-inversion/312: ok | ok | ok | hard-irq lock-inversion/321: ok | ok | ok | soft-irq lock-inversion/321: ok | ok | ok | hard-irq read-recursion/123: ok | soft-irq read-recursion/123: ok | hard-irq read-recursion/132: ok | soft-irq read-recursion/132: ok | hard-irq read-recursion/213: ok | soft-irq read-recursion/213: ok | hard-irq read-recursion/231: ok | soft-irq read-recursion/231: ok | hard-irq read-recursion/312: ok | soft-irq read-recursion/312: ok | hard-irq read-recursion/321: ok | soft-irq read-recursion/321: ok | --------------------------------+-----+---------------- Good, all 210 testcases passed! | --------------------------------+ Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:48 +08:00
The following locking APIs are covered: spinlocks, rwlocks,
mutexes and rwsems.
config LOCK_TORTURE_TEST
tristate "torture tests for locking"
depends on DEBUG_KERNEL
select TORTURE_TEST
help
This option provides a kernel module that runs torture tests
on kernel locking primitives. The kernel module may be built
after the fact on the running kernel to be tested, if desired.
Say Y here if you want kernel locking-primitive torture tests
to be built into the kernel.
Say M if you want these torture tests to build as a module.
Say N if you are unsure.
config WW_MUTEX_SELFTEST
tristate "Wait/wound mutex selftests"
help
This option provides a kernel module that runs tests on the
on the struct ww_mutex locking API.
It is recommended to enable DEBUG_WW_MUTEX_SLOWPATH in conjunction
with this test harness.
Say M if you want these self tests to build as a module.
Say N if you are unsure.
config SCF_TORTURE_TEST
tristate "torture tests for smp_call_function*()"
depends on DEBUG_KERNEL
select TORTURE_TEST
help
This option provides a kernel module that runs torture tests
on the smp_call_function() family of primitives. The kernel
module may be built after the fact on the running kernel to
be tested, if desired.
config CSD_LOCK_WAIT_DEBUG
bool "Debugging for csd_lock_wait(), called from smp_call_function*()"
depends on DEBUG_KERNEL
depends on 64BIT
default n
help
This option enables debug prints when CPUs are slow to respond
to the smp_call_function*() IPI wrappers. These debug prints
include the IPI handler function currently executing (if any)
and relevant stack traces.
config CSD_LOCK_WAIT_DEBUG_DEFAULT
bool "Default csd_lock_wait() debugging on at boot time"
depends on CSD_LOCK_WAIT_DEBUG
depends on 64BIT
default n
help
This option causes the csdlock_debug= kernel boot parameter to
default to 1 (basic debugging) instead of 0 (no debugging).
endmenu # lock debugging
config TRACE_IRQFLAGS
depends on TRACE_IRQFLAGS_SUPPORT
bool
help
Enables hooks to interrupt enabling and disabling for
either tracing or lock debugging.
config TRACE_IRQFLAGS_NMI
def_bool y
depends on TRACE_IRQFLAGS
depends on TRACE_IRQFLAGS_NMI_SUPPORT
config NMI_CHECK_CPU
bool "Debugging for CPUs failing to respond to backtrace requests"
depends on DEBUG_KERNEL
depends on X86
default n
help
Enables debug prints when a CPU fails to respond to a given
backtrace NMI. These prints provide some reasons why a CPU
might legitimately be failing to respond, for example, if it
is offline of if ignore_nmis is set.
lockdep: report broken irq restoration We generally expect local_irq_save() and local_irq_restore() to be paired and sanely nested, and so local_irq_restore() expects to be called with irqs disabled. Thus, within local_irq_restore() we only trace irq flag changes when unmasking irqs. This means that a sequence such as: | local_irq_disable(); | local_irq_save(flags); | local_irq_enable(); | local_irq_restore(flags); ... is liable to break things, as the local_irq_restore() would mask irqs without tracing this change. Similar problems may exist for architectures whose arch_irq_restore() function depends on being called with irqs disabled. We don't consider such sequences to be a good idea, so let's define those as forbidden, and add tooling to detect such broken cases. This patch adds debug code to WARN() when raw_local_irq_restore() is called with irqs enabled. As raw_local_irq_restore() is expected to pair with raw_local_irq_save(), it should never be called with irqs enabled. To avoid the possibility of circular header dependencies between irqflags.h and bug.h, the warning is handled in a separate C file. The new code is all conditional on a new CONFIG_DEBUG_IRQFLAGS symbol which is independent of CONFIG_TRACE_IRQFLAGS. As noted above such cases will confuse lockdep, so CONFIG_DEBUG_LOCKDEP now selects CONFIG_DEBUG_IRQFLAGS. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210111153707.10071-1-mark.rutland@arm.com
2021-01-11 23:37:07 +08:00
config DEBUG_IRQFLAGS
bool "Debug IRQ flag manipulation"
help
Enables checks for potentially unsafe enabling or disabling of
interrupts, such as calling raw_local_irq_restore() when interrupts
are enabled.
config STACKTRACE
bool "Stack backtrace support"
depends on STACKTRACE_SUPPORT
help
This option causes the kernel to create a /proc/pid/stack for
every process, showing its current stack trace.
It is also used by various kernel debugging features that require
stack trace generation.
config WARN_ALL_UNSEEDED_RANDOM
bool "Warn for all uses of unseeded randomness"
default n
help
Some parts of the kernel contain bugs relating to their use of
cryptographically secure random numbers before it's actually possible
to generate those numbers securely. This setting ensures that these
flaws don't go unnoticed, by enabling a message, should this ever
occur. This will allow people with obscure setups to know when things
are going wrong, so that they might contact developers about fixing
it.
Unfortunately, on some models of some architectures getting
a fully seeded CRNG is extremely difficult, and so this can
result in dmesg getting spammed for a surprisingly long
time. This is really bad from a security perspective, and
so architecture maintainers really need to do what they can
to get the CRNG seeded sooner after the system is booted.
However, since users cannot do anything actionable to
random: remove ratelimiting for in-kernel unseeded randomness The CONFIG_WARN_ALL_UNSEEDED_RANDOM debug option controls whether the kernel warns about all unseeded randomness or just the first instance. There's some complicated rate limiting and comparison to the previous caller, such that even with CONFIG_WARN_ALL_UNSEEDED_RANDOM enabled, developers still don't see all the messages or even an accurate count of how many were missed. This is the result of basically parallel mechanisms aimed at accomplishing more or less the same thing, added at different points in random.c history, which sort of compete with the first-instance-only limiting we have now. It turns out, however, that nobody cares about the first unseeded randomness instance of in-kernel users. The same first user has been there for ages now, and nobody is doing anything about it. It isn't even clear that anybody _can_ do anything about it. Most places that can do something about it have switched over to using get_random_bytes_wait() or wait_for_random_bytes(), which is the right thing to do, but there is still much code that needs randomness sometimes during init, and as a geeneral rule, if you're not using one of the _wait functions or the readiness notifier callback, you're bound to be doing it wrong just based on that fact alone. So warning about this same first user that can't easily change is simply not an effective mechanism for anything at all. Users can't do anything about it, as the Kconfig text points out -- the problem isn't in userspace code -- and kernel developers don't or more often can't react to it. Instead, show the warning for all instances when CONFIG_WARN_ALL_UNSEEDED_RANDOM is set, so that developers can debug things need be, or if it isn't set, don't show a warning at all. At the same time, CONFIG_WARN_ALL_UNSEEDED_RANDOM now implies setting random.ratelimit_disable=1 on by default, since if you care about one you probably care about the other too. And we can clean up usage around the related urandom_warning ratelimiter as well (whose behavior isn't changing), so that it properly counts missed messages after the 10 message threshold is reached. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-09 22:13:18 +08:00
address this, by default this option is disabled.
Say Y here if you want to receive warnings for all uses of
unseeded randomness. This will be of use primarily for
those developers interested in improving the security of
Linux kernels running on their architecture (or
subarchitecture).
config DEBUG_KOBJECT
bool "kobject debugging"
depends on DEBUG_KERNEL
help
If you say Y here, some extra kobject debugging messages will be sent
mm: remove CONFIG_HAVE_MEMBLOCK All architecures use memblock for early memory management. There is no need for the CONFIG_HAVE_MEMBLOCK configuration option. [rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs] Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx [rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal] Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx [rppt@linux.vnet.ibm.com: remove stale #else and the code it protects] Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:07:44 +08:00
to the syslog.
config DEBUG_KOBJECT_RELEASE
bool "kobject release debugging"
depends on DEBUG_OBJECTS_TIMERS
help
kobjects are reference counted objects. This means that their
last reference count put is not predictable, and the kobject can
live on past the point at which a driver decides to drop its
initial reference to the kobject gained on allocation. An
example of this would be a struct device which has just been
unregistered.
However, some buggy drivers assume that after such an operation,
the memory backing the kobject can be immediately freed. This
goes completely against the principles of a refcounted object.
If you say Y here, the kernel will delay the release of kobjects
on the last reference count to improve the visibility of this
kind of kobject release bug.
config HAVE_DEBUG_BUGVERBOSE
bool
menu "Debug kernel data structures"
config DEBUG_LIST
bool "Debug linked list manipulation"
depends on DEBUG_KERNEL
list: Introduce CONFIG_LIST_HARDENED Numerous production kernel configs (see [1, 2]) are choosing to enable CONFIG_DEBUG_LIST, which is also being recommended by KSPP for hardened configs [3]. The motivation behind this is that the option can be used as a security hardening feature (e.g. CVE-2019-2215 and CVE-2019-2025 are mitigated by the option [4]). The feature has never been designed with performance in mind, yet common list manipulation is happening across hot paths all over the kernel. Introduce CONFIG_LIST_HARDENED, which performs list pointer checking inline, and only upon list corruption calls the reporting slow path. To generate optimal machine code with CONFIG_LIST_HARDENED: 1. Elide checking for pointer values which upon dereference would result in an immediate access fault (i.e. minimal hardening checks). The trade-off is lower-quality error reports. 2. Use the __preserve_most function attribute (available with Clang, but not yet with GCC) to minimize the code footprint for calling the reporting slow path. As a result, function size of callers is reduced by avoiding saving registers before calling the rarely called reporting slow path. Note that all TUs in lib/Makefile already disable function tracing, including list_debug.c, and __preserve_most's implied notrace has no effect in this case. 3. Because the inline checks are a subset of the full set of checks in __list_*_valid_or_report(), always return false if the inline checks failed. This avoids redundant compare and conditional branch right after return from the slow path. As a side-effect of the checks being inline, if the compiler can prove some condition to always be true, it can completely elide some checks. Since DEBUG_LIST is functionally a superset of LIST_HARDENED, the Kconfig variables are changed to reflect that: DEBUG_LIST selects LIST_HARDENED, whereas LIST_HARDENED itself has no dependency on DEBUG_LIST. Running netperf with CONFIG_LIST_HARDENED (using a Clang compiler with "preserve_most") shows throughput improvements, in my case of ~7% on average (up to 20-30% on some test cases). Link: https://r.android.com/1266735 [1] Link: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/main/config [2] Link: https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings [3] Link: https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html [4] Signed-off-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20230811151847.1594958-3-elver@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-11 23:18:40 +08:00
select LIST_HARDENED
help
list: Introduce CONFIG_LIST_HARDENED Numerous production kernel configs (see [1, 2]) are choosing to enable CONFIG_DEBUG_LIST, which is also being recommended by KSPP for hardened configs [3]. The motivation behind this is that the option can be used as a security hardening feature (e.g. CVE-2019-2215 and CVE-2019-2025 are mitigated by the option [4]). The feature has never been designed with performance in mind, yet common list manipulation is happening across hot paths all over the kernel. Introduce CONFIG_LIST_HARDENED, which performs list pointer checking inline, and only upon list corruption calls the reporting slow path. To generate optimal machine code with CONFIG_LIST_HARDENED: 1. Elide checking for pointer values which upon dereference would result in an immediate access fault (i.e. minimal hardening checks). The trade-off is lower-quality error reports. 2. Use the __preserve_most function attribute (available with Clang, but not yet with GCC) to minimize the code footprint for calling the reporting slow path. As a result, function size of callers is reduced by avoiding saving registers before calling the rarely called reporting slow path. Note that all TUs in lib/Makefile already disable function tracing, including list_debug.c, and __preserve_most's implied notrace has no effect in this case. 3. Because the inline checks are a subset of the full set of checks in __list_*_valid_or_report(), always return false if the inline checks failed. This avoids redundant compare and conditional branch right after return from the slow path. As a side-effect of the checks being inline, if the compiler can prove some condition to always be true, it can completely elide some checks. Since DEBUG_LIST is functionally a superset of LIST_HARDENED, the Kconfig variables are changed to reflect that: DEBUG_LIST selects LIST_HARDENED, whereas LIST_HARDENED itself has no dependency on DEBUG_LIST. Running netperf with CONFIG_LIST_HARDENED (using a Clang compiler with "preserve_most") shows throughput improvements, in my case of ~7% on average (up to 20-30% on some test cases). Link: https://r.android.com/1266735 [1] Link: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/main/config [2] Link: https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings [3] Link: https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html [4] Signed-off-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20230811151847.1594958-3-elver@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-11 23:18:40 +08:00
Enable this to turn on extended checks in the linked-list walking
routines.
This option trades better quality error reports for performance, and
is more suitable for kernel debugging. If you care about performance,
you should only enable CONFIG_LIST_HARDENED instead.
If unsure, say N.
config DEBUG_PLIST
bool "Debug priority linked list manipulation"
depends on DEBUG_KERNEL
help
Enable this to turn on extended checks in the priority-ordered
linked-list (plist) walking routines. This checks the entire
list multiple times during each manipulation.
If unsure, say N.
config DEBUG_SG
bool "Debug SG table operations"
depends on DEBUG_KERNEL
help
Enable this to turn on checks on scatter-gather tables. This can
help find problems with drivers that do not properly initialize
their sg tables.
If unsure, say N.
config DEBUG_NOTIFIERS
bool "Debug notifier call chains"
depends on DEBUG_KERNEL
help
Enable this to turn on sanity checking for notifier call chains.
This is most useful for kernel developers to make sure that
modules properly unregister themselves from notifier chains.
This is a relatively cheap check but if you care about maximum
performance, say N.
config DEBUG_CLOSURES
bool "Debug closures (bcache async widgits)"
depends on CLOSURES
select DEBUG_FS
help
Keeps all active closures in a linked list and provides a debugfs
interface to list them, which makes it possible to see asynchronous
operations that get stuck.
Maple Tree: add new data structure Patch series "Introducing the Maple Tree" The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. Davidlor said : Yes I like the maple tree, and at this stage I don't think we can ask for : more from this series wrt the MM - albeit there seems to still be some : folks reporting breakage. Fundamentally I see Liam's work to (re)move : complexity out of the MM (not to say that the actual maple tree is not : complex) by consolidating the three complimentary data structures very : much worth it considering performance does not take a hit. This was very : much a turn off with the range locking approach, which worst case scenario : incurred in prohibitive overhead. Also as Liam and Matthew have : mentioned, RCU opens up a lot of nice performance opportunities, and in : addition academia[1] has shown outstanding scalability of address spaces : with the foundation of replacing the locked rbtree with RCU aware trees. A similar work has been discovered in the academic press https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Sheer coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. This patch (of 70): The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. There is additional BUG_ON() calls added within the tree, most of which are in debug code. These will be replaced with a WARN_ON() call in the future. There is also additional BUG_ON() calls within the code which will also be reduced in number at a later date. These exist to catch things such as out-of-range accesses which would crash anyways. Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-07 03:48:39 +08:00
config DEBUG_MAPLE_TREE
bool "Debug maple trees"
depends on DEBUG_KERNEL
help
Enable maple tree debugging information and extra validations.
If unsure, say N.
endmenu
source "kernel/rcu/Kconfig.debug"
config DEBUG_WQ_FORCE_RR_CPU
bool "Force round-robin CPU selection for unbound work items"
depends on DEBUG_KERNEL
default n
help
Workqueue used to implicitly guarantee that work items queued
without explicit CPU specified are put on the local CPU. This
guarantee is no longer true and while local CPU is still
preferred work items may be put on foreign CPUs. Kernel
parameter "workqueue.debug_force_rr_cpu" is added to force
round-robin CPU selection to flush out usages which depend on the
now broken guarantee. This config option enables the debug
feature by default. When enabled, memory and cache locality will
be impacted.
config CPU_HOTPLUG_STATE_CONTROL
bool "Enable CPU hotplug state control"
depends on DEBUG_KERNEL
depends on HOTPLUG_CPU
default n
help
Allows to write steps between "offline" and "online" to the CPUs
sysfs target file so states can be stepped granular. This is a debug
option for now as the hotplug machinery cannot be stopped and
restarted at arbitrary points yet.
Say N if your are unsure.
config LATENCYTOP
bool "Latency measuring infrastructure"
depends on DEBUG_KERNEL
depends on STACKTRACE_SUPPORT
depends on PROC_FS
depends on FRAME_POINTER || MIPS || PPC || S390 || MICROBLAZE || ARM || ARC || X86
select KALLSYMS
select KALLSYMS_ALL
select STACKTRACE
select SCHEDSTATS
help
Enable this option if you want to use the LatencyTOP tool
to find out which userspace is blocking on what kernel operations.
config DEBUG_CGROUP_REF
bool "Disable inlining of cgroup css reference count functions"
depends on DEBUG_KERNEL
depends on CGROUPS
depends on KPROBES
default n
help
Force cgroup css reference count functions to not be inlined so
that they can be kprobed for debugging.
source "kernel/trace/Kconfig"
config PROVIDE_OHCI1394_DMA_INIT
bool "Remote debugging over FireWire early on boot"
depends on PCI && X86
help
If you want to debug problems which hang or crash the kernel early
on boot and the crashing machine has a FireWire port, you can use
this feature to remotely access the memory of the crashed machine
over FireWire. This employs remote DMA as part of the OHCI1394
specification which is now the standard for FireWire controllers.
With remote DMA, you can monitor the printk buffer remotely using
firescope and access all memory below 4GB using fireproxy from gdb.
Even controlling a kernel debugger is possible using remote DMA.
Usage:
If ohci1394_dma=early is used as boot parameter, it will initialize
all OHCI1394 controllers which are found in the PCI config space.
As all changes to the FireWire bus such as enabling and disabling
devices cause a bus reset and thereby disable remote DMA for all
devices, be sure to have the cable plugged and FireWire enabled on
the debugging host before booting the debug target for debugging.
This code (~1k) is freed after boot. By then, the firewire stack
in charge of the OHCI-1394 controllers should be used instead.
See Documentation/core-api/debugging-via-ohci1394.rst for more information.
source "samples/Kconfig"
config ARCH_HAS_DEVMEM_IS_ALLOWED
bool
config STRICT_DEVMEM
bool "Filter access to /dev/mem"
depends on MMU && DEVMEM
depends on ARCH_HAS_DEVMEM_IS_ALLOWED || GENERIC_LIB_DEVMEM_IS_ALLOWED
default y if PPC || X86 || ARM64
help
If this option is disabled, you allow userspace (root) access to all
of memory, including kernel and userspace memory. Accidental
access to this is obviously disastrous, but specific access can
be used by people debugging the kernel. Note that with PAT support
enabled, even in this case there are restrictions on /dev/mem
use due to the cache aliasing requirements.
If this option is switched on, and IO_STRICT_DEVMEM=n, the /dev/mem
file only allows userspace access to PCI space and the BIOS code and
data regions. This is sufficient for dosemu and X and all common
users of /dev/mem.
If in doubt, say Y.
config IO_STRICT_DEVMEM
bool "Filter I/O access to /dev/mem"
depends on STRICT_DEVMEM
help
If this option is disabled, you allow userspace (root) access to all
io-memory regardless of whether a driver is actively using that
range. Accidental access to this is obviously disastrous, but
specific access can be used by people debugging kernel drivers.
If this option is switched on, the /dev/mem file only allows
userspace access to *idle* io-memory ranges (see /proc/iomem) This
may break traditional users of /dev/mem (dosemu, legacy X, etc...)
if the driver using a given range cannot be disabled.
If in doubt, say Y.
menu "$(SRCARCH) Debugging"
source "arch/$(SRCARCH)/Kconfig.debug"
endmenu
menu "Kernel Testing and Coverage"
source "lib/kunit/Kconfig"
fault-injection: notifier error injection This patchset provides kernel modules that can be used to test the error handling of notifier call chain failures by injecting artifical errors to the following notifier chain callbacks. * CPU notifier * PM notifier * memory hotplug notifier * powerpc pSeries reconfig notifier Example: Inject CPU offline error (-1 == -EPERM) # cd /sys/kernel/debug/notifier-error-inject/cpu # echo -1 > actions/CPU_DOWN_PREPARE/error # echo 0 > /sys/devices/system/cpu/cpu1/online bash: echo: write error: Operation not permitted The patchset also adds cpu and memory hotplug tests to tools/testing/selftests These tests first do simple online and offline test and then do fault injection tests if notifier error injection module is available. This patch: The notifier error injection provides the ability to inject artifical errors to specified notifier chain callbacks. It is useful to test the error handling of notifier call chain failures. This adds common basic functions to define which type of events can be fail and to initialize the debugfs interface to control what error code should be returned and which event should be failed. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 05:43:02 +08:00
config NOTIFIER_ERROR_INJECTION
tristate "Notifier error injection"
depends on DEBUG_KERNEL
select DEBUG_FS
help
This option provides the ability to inject artificial errors to
fault-injection: notifier error injection This patchset provides kernel modules that can be used to test the error handling of notifier call chain failures by injecting artifical errors to the following notifier chain callbacks. * CPU notifier * PM notifier * memory hotplug notifier * powerpc pSeries reconfig notifier Example: Inject CPU offline error (-1 == -EPERM) # cd /sys/kernel/debug/notifier-error-inject/cpu # echo -1 > actions/CPU_DOWN_PREPARE/error # echo 0 > /sys/devices/system/cpu/cpu1/online bash: echo: write error: Operation not permitted The patchset also adds cpu and memory hotplug tests to tools/testing/selftests These tests first do simple online and offline test and then do fault injection tests if notifier error injection module is available. This patch: The notifier error injection provides the ability to inject artifical errors to specified notifier chain callbacks. It is useful to test the error handling of notifier call chain failures. This adds common basic functions to define which type of events can be fail and to initialize the debugfs interface to control what error code should be returned and which event should be failed. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 05:43:02 +08:00
specified notifier chain callbacks. It is useful to test the error
handling of notifier call chain failures.
Say N if unsure.
config PM_NOTIFIER_ERROR_INJECT
tristate "PM notifier error injection module"
depends on PM && NOTIFIER_ERROR_INJECTION
default m if PM_DEBUG
help
This option provides the ability to inject artificial errors to
PM notifier chain callbacks. It is controlled through debugfs
interface /sys/kernel/debug/notifier-error-inject/pm
If the notifier call chain should be failed with some events
notified, write the error code to "actions/<notifier event>/error".
Example: Inject PM suspend error (-12 = -ENOMEM)
# cd /sys/kernel/debug/notifier-error-inject/pm/
# echo -12 > actions/PM_SUSPEND_PREPARE/error
# echo mem > /sys/power/state
bash: echo: write error: Cannot allocate memory
To compile this code as a module, choose M here: the module will
be called pm-notifier-error-inject.
If unsure, say N.
config OF_RECONFIG_NOTIFIER_ERROR_INJECT
tristate "OF reconfig notifier error injection module"
depends on OF_DYNAMIC && NOTIFIER_ERROR_INJECTION
help
This option provides the ability to inject artificial errors to
OF reconfig notifier chain callbacks. It is controlled
through debugfs interface under
/sys/kernel/debug/notifier-error-inject/OF-reconfig/
If the notifier call chain should be failed with some events
notified, write the error code to "actions/<notifier event>/error".
To compile this code as a module, choose M here: the module will
be called of-reconfig-notifier-error-inject.
If unsure, say N.
config NETDEV_NOTIFIER_ERROR_INJECT
tristate "Netdev notifier error injection module"
depends on NET && NOTIFIER_ERROR_INJECTION
help
This option provides the ability to inject artificial errors to
netdevice notifier chain callbacks. It is controlled through debugfs
interface /sys/kernel/debug/notifier-error-inject/netdev
If the notifier call chain should be failed with some events
notified, write the error code to "actions/<notifier event>/error".
Example: Inject netdevice mtu change error (-22 = -EINVAL)
# cd /sys/kernel/debug/notifier-error-inject/netdev
# echo -22 > actions/NETDEV_CHANGEMTU/error
# ip link set eth0 mtu 1024
RTNETLINK answers: Invalid argument
To compile this code as a module, choose M here: the module will
be called netdev-notifier-error-inject.
If unsure, say N.
config FUNCTION_ERROR_INJECTION
bool "Fault-injections of functions"
depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
help
Add fault injections into various functions that are annotated with
ALLOW_ERROR_INJECTION() in the kernel. BPF may also modify the return
value of these functions. This is useful to test error paths of code.
If unsure, say N
config FAULT_INJECTION
bool "Fault-injection framework"
depends on DEBUG_KERNEL
help
Provide fault-injection framework.
For more details, see Documentation/fault-injection/.
config FAILSLAB
bool "Fault-injection capability for kmalloc"
depends on FAULT_INJECTION
help
Provide fault-injection capability for kmalloc.
config FAIL_PAGE_ALLOC
bool "Fault-injection capability for alloc_pages()"
depends on FAULT_INJECTION
help
Provide fault-injection capability for alloc_pages().
lib, include/linux: add usercopy failure capability Patch series "add fault injection to user memory access", v3. The goal of this series is to improve testing of fault-tolerance in usages of user memory access functions, by adding support for fault injection. syzkaller/syzbot are using the existing fault injection modes and will use this particular feature also. The first patch adds failure injection capability for usercopy functions. The second changes usercopy functions to use this new failure capability (copy_from_user, ...). The third patch adds get/put/clear_user failures to x86. This patch (of 3): Add a failure injection capability to improve testing of fault-tolerance in usages of user memory access functions. Add CONFIG_FAULT_INJECTION_USERCOPY to enable faults in usercopy functions. The should_fail_usercopy function is to be called by these functions (copy_from_user, get_user, ...) in order to fail or not. Signed-off-by: Albert van der Linde <alinde@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20200831171733.955393-1-alinde@google.com Link: http://lkml.kernel.org/r/20200831171733.955393-2-alinde@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:13:46 +08:00
config FAULT_INJECTION_USERCOPY
bool "Fault injection capability for usercopy functions"
depends on FAULT_INJECTION
help
Provides fault-injection capability to inject failures
in usercopy functions (copy_from_user(), get_user(), ...).
config FAIL_MAKE_REQUEST
bool "Fault-injection capability for disk IO"
depends on FAULT_INJECTION && BLOCK
help
Provide fault-injection capability for disk IO.
config FAIL_IO_TIMEOUT
bool "Fault-injection capability for faking disk interrupts"
depends on FAULT_INJECTION && BLOCK
help
Provide fault-injection capability on end IO handling. This
will make the block layer "forget" an interrupt as configured,
thus exercising the error handling.
Only works with drivers that use the generic timeout handling,
for others it won't do anything.
config FAIL_FUTEX
bool "Fault-injection capability for futexes"
select DEBUG_FS
depends on FAULT_INJECTION && FUTEX
help
Provide fault-injection capability for futexes.
config FAULT_INJECTION_DEBUG_FS
bool "Debugfs entries for fault-injection capabilities"
depends on FAULT_INJECTION && SYSFS && DEBUG_FS
help
Enable configuration of fault-injection capabilities via debugfs.
config FAIL_FUNCTION
bool "Fault-injection capability for functions"
depends on FAULT_INJECTION_DEBUG_FS && FUNCTION_ERROR_INJECTION
help
Provide function-based fault-injection capability.
This will allow you to override a specific function with a return
with given return value. As a result, function caller will see
an error value and have to handle it. This is useful to test the
error handling in various subsystems.
config FAIL_MMC_REQUEST
bool "Fault-injection capability for MMC IO"
depends on FAULT_INJECTION_DEBUG_FS && MMC
help
Provide fault-injection capability for MMC IO.
This will make the mmc core return data errors. This is
useful to test the error handling in the mmc block device
and to test how the mmc host driver handles retries from
the block device.
config FAIL_SUNRPC
bool "Fault-injection capability for SunRPC"
depends on FAULT_INJECTION_DEBUG_FS && SUNRPC_DEBUG
help
Provide fault-injection capability for SunRPC and
its consumers.
config FAULT_INJECTION_CONFIGFS
bool "Configfs interface for fault-injection capabilities"
depends on FAULT_INJECTION
select CONFIGFS_FS
help
This option allows configfs-based drivers to dynamically configure
fault-injection via configfs. Each parameter for driver-specific
fault-injection can be made visible as a configfs attribute in a
configfs group.
config FAULT_INJECTION_STACKTRACE_FILTER
bool "stacktrace filter for fault-injection capabilities"
depends on FAULT_INJECTION
depends on (FAULT_INJECTION_DEBUG_FS || FAULT_INJECTION_CONFIGFS) && STACKTRACE_SUPPORT
select STACKTRACE
depends on FRAME_POINTER || MIPS || PPC || S390 || MICROBLAZE || ARM || ARC || X86
help
Provide stacktrace filter for fault-injection capabilities
config ARCH_HAS_KCOV
bool
help
An architecture should select this when it can successfully
build and run with CONFIG_KCOV. This typically requires
disabling instrumentation for some early boot code.
config CC_HAS_SANCOV_TRACE_PC
def_bool $(cc-option,-fsanitize-coverage=trace-pc)
config KCOV
bool "Code coverage for fuzzing"
depends on ARCH_HAS_KCOV
depends on CC_HAS_SANCOV_TRACE_PC || GCC_PLUGINS
depends on !ARCH_WANTS_NO_INSTR || HAVE_NOINSTR_HACK || \
GCC_VERSION >= 120000 || CC_IS_CLANG
select DEBUG_FS
select GCC_PLUGIN_SANCOV if !CC_HAS_SANCOV_TRACE_PC
select OBJTOOL if HAVE_NOINSTR_HACK
help
KCOV exposes kernel code coverage information in a form suitable
for coverage-guided fuzzing (randomized testing).
For more details, see Documentation/dev-tools/kcov.rst.
config KCOV_ENABLE_COMPARISONS
bool "Enable comparison operands collection by KCOV"
depends on KCOV
depends on $(cc-option,-fsanitize-coverage=trace-cmp)
help
KCOV also exposes operands of every comparison in the instrumented
code along with operand sizes and PCs of the comparison instructions.
These operands can be used by fuzzing engines to improve the quality
of fuzzing coverage.
config KCOV_INSTRUMENT_ALL
bool "Instrument all code by default"
depends on KCOV
default y
help
If you are doing generic system call fuzzing (like e.g. syzkaller),
then you will want to instrument the whole kernel and you should
say y here. If you are doing more targeted fuzzing (like e.g.
filesystem fuzzing with AFL) then you will want to enable coverage
for more specific subsets of files, and should say n here.
kcov: collect coverage from interrupts This change extends kcov remote coverage support to allow collecting coverage from soft interrupts in addition to kernel background threads. To collect coverage from code that is executed in softirq context, a part of that code has to be annotated with kcov_remote_start/stop() in a similar way as how it is done for global kernel background threads. Then the handle used for the annotations has to be passed to the KCOV_REMOTE_ENABLE ioctl. Internally this patch adjusts the __sanitizer_cov_trace_pc() compiler inserted callback to not bail out when called from softirq context. kcov_remote_start/stop() are updated to save/restore the current per task kcov state in a per-cpu area (in case the softirq came when the kernel was already collecting coverage in task context). Coverage from softirqs is collected into pre-allocated per-cpu areas, whose size is controlled by the new CONFIG_KCOV_IRQ_AREA_SIZE. [andreyknvl@google.com: turn current->kcov_softirq into unsigned int to fix objtool warning] Link: http://lkml.kernel.org/r/841c778aa3849c5cb8c3761f56b87ce653a88671.1585233617.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Marco Elver <elver@google.com> Link: http://lkml.kernel.org/r/469bd385c431d050bc38a593296eff4baae50666.1584655448.git.andreyknvl@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:46:04 +08:00
config KCOV_IRQ_AREA_SIZE
hex "Size of interrupt coverage collection area in words"
depends on KCOV
default 0x40000
help
KCOV uses preallocated per-cpu areas to collect coverage from
soft interrupts. This specifies the size of those areas in the
number of unsigned long words.
menuconfig RUNTIME_TESTING_MENU
bool "Runtime Testing"
default y
if RUNTIME_TESTING_MENU
lib: add Dhrystone benchmark test When working on SoC bring-up, (a full) userspace may not be available, making it hard to benchmark the CPU performance of the system under development. Still, one may want to have a rough idea of the (relative) performance of one or more CPU cores, especially when working on e.g. the clock driver that controls the CPU core clock(s). Hence make the classical Dhrystone 2.1 benchmark available as a Linux kernel test module, based on[1]. When built-in, this benchmark can be run without any userspace present. Parallel runs (run on multiple CPU cores) are supported, just kick the "run" file multiple times. Note that the actual figures depend on the configuration options that control compiler optimization (e.g. CONFIG_CC_OPTIMIZE_FOR_SIZE vs. CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE), and on the compiler options used when building the kernel in general. Hence numbers may differ from those obtained by running similar benchmarks in userspace. [1] https://github.com/qris/dhrystone-deb.git Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lkml.kernel.org/r/4d07ad990740a5f1e426ce4566fb514f60ec9bdd.1670509558.git.geert+renesas@glider.be Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> [geert+renesas@glider.be: fix uninitialized use of ret] Link: https://lkml.kernel.org/r/alpine.DEB.2.22.394.2212190857310.137329@ramsan.of.borg Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-08 22:31:28 +08:00
config TEST_DHRY
tristate "Dhrystone benchmark test"
help
Enable this to include the Dhrystone 2.1 benchmark. This test
calculates the number of Dhrystones per second, and the number of
DMIPS (Dhrystone MIPS) obtained when the Dhrystone score is divided
by 1757 (the number of Dhrystones per second obtained on the VAX
11/780, nominally a 1 MIPS machine).
To run the benchmark, it needs to be enabled explicitly, either from
the kernel command line (when built-in), or from userspace (when
built-in or modular).
lib: add Dhrystone benchmark test When working on SoC bring-up, (a full) userspace may not be available, making it hard to benchmark the CPU performance of the system under development. Still, one may want to have a rough idea of the (relative) performance of one or more CPU cores, especially when working on e.g. the clock driver that controls the CPU core clock(s). Hence make the classical Dhrystone 2.1 benchmark available as a Linux kernel test module, based on[1]. When built-in, this benchmark can be run without any userspace present. Parallel runs (run on multiple CPU cores) are supported, just kick the "run" file multiple times. Note that the actual figures depend on the configuration options that control compiler optimization (e.g. CONFIG_CC_OPTIMIZE_FOR_SIZE vs. CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE), and on the compiler options used when building the kernel in general. Hence numbers may differ from those obtained by running similar benchmarks in userspace. [1] https://github.com/qris/dhrystone-deb.git Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lkml.kernel.org/r/4d07ad990740a5f1e426ce4566fb514f60ec9bdd.1670509558.git.geert+renesas@glider.be Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> [geert+renesas@glider.be: fix uninitialized use of ret] Link: https://lkml.kernel.org/r/alpine.DEB.2.22.394.2212190857310.137329@ramsan.of.borg Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-08 22:31:28 +08:00
Run once during kernel boot:
test_dhry.run
Set number of iterations from kernel command line:
test_dhry.iterations=<n>
Set number of iterations from userspace:
echo <n> > /sys/module/test_dhry/parameters/iterations
Trigger manual run from userspace:
echo y > /sys/module/test_dhry/parameters/run
If the number of iterations is <= 0, the test will devise a suitable
number of iterations (test runs for at least 2s) automatically.
This process takes ca. 4s.
If unsure, say N.
config LKDTM
tristate "Linux Kernel Dump Test Tool Module"
depends on DEBUG_FS
help
This module enables testing of the different dumping mechanisms by
inducing system failures at predefined crash points.
If you don't need it: say N
Choose M here to compile this code as a module. The module will be
called lkdtm.
Documentation on how to use the module can be found in
Documentation/fault-injection/provoke-crashes.rst
config CPUMASK_KUNIT_TEST
tristate "KUnit test for cpumask" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Enable to turn on cpumask tests, running at boot or module load time.
For more information on KUnit and unit tests in general, please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config TEST_LIST_SORT
lib/test: convert lib/test_list_sort.c to use KUnit Functionally, this just means that the test output will be slightly changed and it'll now depend on CONFIG_KUNIT=y/m. It'll still run at boot time and can still be built as a loadable module. There was a pre-existing patch to convert this test that I found later, here [1]. Compared to [1], this patch doesn't rename files and uses KUnit features more heavily (i.e. does more than converting pr_err() calls to KUNIT_FAIL()). What this conversion gives us: * a shorter test thanks to KUnit's macros * a way to run this a bit more easily via kunit.py (and CONFIG_KUNIT_ALL_TESTS=y) [2] * a structured way of reporting pass/fail * uses kunit-managed allocations to avoid the risk of memory leaks * more descriptive error messages: * i.e. it prints out which fields are invalid, what the expected values are, etc. What this conversion does not do: * change the name of the file (and thus the name of the module) * change the name of the config option Leaving these as-is for now to minimize the impact to people wanting to run this test. IMO, that concern trumps following KUnit's style guide for both names, at least for now. [1] https://lore.kernel.org/linux-kselftest/20201015014616.309000-1-vitor@massaru.org/ [2] Can be run via $ ./tools/testing/kunit/kunit.py run --kunitconfig /dev/stdin <<EOF CONFIG_KUNIT=y CONFIG_TEST_LIST_SORT=y EOF [16:55:56] Configuring KUnit Kernel ... [16:55:56] Building KUnit Kernel ... [16:56:29] Starting KUnit Kernel ... [16:56:32] ============================================================ [16:56:32] ======== [PASSED] list_sort ======== [16:56:32] [PASSED] list_sort_test [16:56:32] ============================================================ [16:56:32] Testing complete. 1 tests run. 0 failed. 0 crashed. [16:56:32] Elapsed time: 35.668s total, 0.001s configuring, 32.725s building, 0.000s running Note: the build time is as after a `make mrproper`. Signed-off-by: Daniel Latypov <dlatypov@google.com> Tested-by: David Gow <davidgow@google.com> Acked-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2021-05-04 04:58:35 +08:00
tristate "Linked list sorting test" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Enable this to turn on 'list_sort()' function test. This test is
executed only once during system boot (so affects only boot time),
or at module load time.
If unsure, say N.
config TEST_MIN_HEAP
tristate "Min heap test"
depends on DEBUG_KERNEL || m
help
Enable this to turn on min heap function tests. This test is
executed only once during system boot (so affects only boot time),
or at module load time.
If unsure, say N.
config TEST_SORT
lib/test: convert test_sort.c to use KUnit This follows up commit ebd09577be6c ("lib/test: convert lib/test_list_sort.c to use KUnit"). Converting this test to KUnit makes the test a bit shorter, standardizes how it reports pass/fail, and adds an easier way to run the test [1]. Like ebd09577be6c, this leaves the file and Kconfig option name the same, but slightly changes their dependencies (needs CONFIG_KUNIT). [1] Can be run via $ ./tools/testing/kunit/kunit.py run --kunitconfig /dev/stdin <<EOF CONFIG_KUNIT=y CONFIG_TEST_SORT=y EOF [11:30:27] Starting KUnit Kernel ... [11:30:30] ============================================================ [11:30:30] ======== [PASSED] lib_sort ======== [11:30:30] [PASSED] test_sort [11:30:30] ============================================================ [11:30:30] Testing complete. 1 tests run. 0 failed. 0 crashed. 0 skipped. [11:30:30] Elapsed time: 37.032s total, 0.001s configuring, 34.090s building, 0.000s running Note: this is the time it took after a `make mrproper`. With an incremental rebuild, this looks more like: [11:38:58] Elapsed time: 6.444s total, 0.001s configuring, 3.416s building, 0.000s running Since the test has no dependencies, it can also be run (with some other tests) with just: $ ./tools/testing/kunit/kunit.py run Link: https://lkml.kernel.org/r/20210715232441.1380885-1-dlatypov@google.com Signed-off-by: Daniel Latypov <dlatypov@google.com> Cc: Pravin Shedge <pravin.shedge4linux@gmail.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 10:58:48 +08:00
tristate "Array-based sort test" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
This option enables the self-test function of 'sort()' at boot,
or at module load time.
If unsure, say N.
config TEST_DIV64
tristate "64bit/32bit division and modulo test"
depends on DEBUG_KERNEL || m
help
Enable this to turn on 'do_div()' function test. This test is
executed only once during system boot (so affects only boot time),
or at module load time.
If unsure, say N.
config TEST_IOV_ITER
tristate "Test iov_iter operation" if !KUNIT_ALL_TESTS
depends on KUNIT
depends on MMU
default KUNIT_ALL_TESTS
help
Enable this to turn on testing of the operation of the I/O iterator
(iov_iter). This test is executed only once during system boot (so
affects only boot time), or at module load time.
If unsure, say N.
config KPROBES_SANITY_TEST
tristate "Kprobes sanity tests" if !KUNIT_ALL_TESTS
depends on DEBUG_KERNEL
depends on KPROBES
depends on KUNIT
select STACKTRACE if ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
default KUNIT_ALL_TESTS
help
This option provides for testing basic kprobes functionality on
boot. Samples of kprobe and kretprobe are inserted and
verified for functionality.
Say N if you are unsure.
config FPROBE_SANITY_TEST
bool "Self test for fprobe"
depends on DEBUG_KERNEL
depends on FPROBE
depends on KUNIT=y
help
This option will enable testing the fprobe when the system boot.
A series of tests are made to verify that the fprobe is functioning
properly.
Say N if you are unsure.
config BACKTRACE_SELF_TEST
tristate "Self test for the backtrace code"
depends on DEBUG_KERNEL
help
This option provides a kernel module that can be used to test
the kernel stack backtrace code. This option is not useful
for distributions or general kernels, but only for kernel
developers working on architecture code.
Note that if you want to also test saved backtraces, you will
have to enable STACKTRACE as well.
Say N if you are unsure.
lib: add tests for reference tracker This module uses reference tracker, forcing two issues. 1) Double free of a tracker 2) leak of two trackers, one being allocated from softirq context. "modprobe test_ref_tracker" would emit the following traces. (Use scripts/decode_stacktrace.sh if necessary) [ 171.648681] reference already released. [ 171.653213] allocated in: [ 171.656523] alloctest_ref_tracker_alloc2+0x1c/0x20 [test_ref_tracker] [ 171.656526] init_module+0x86/0x1000 [test_ref_tracker] [ 171.656528] do_one_initcall+0x9c/0x220 [ 171.656532] do_init_module+0x60/0x240 [ 171.656536] load_module+0x32b5/0x3610 [ 171.656538] __do_sys_init_module+0x148/0x1a0 [ 171.656540] __x64_sys_init_module+0x1d/0x20 [ 171.656542] do_syscall_64+0x4a/0xb0 [ 171.656546] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 171.656549] freed in: [ 171.659520] alloctest_ref_tracker_free+0x13/0x20 [test_ref_tracker] [ 171.659522] init_module+0xec/0x1000 [test_ref_tracker] [ 171.659523] do_one_initcall+0x9c/0x220 [ 171.659525] do_init_module+0x60/0x240 [ 171.659527] load_module+0x32b5/0x3610 [ 171.659529] __do_sys_init_module+0x148/0x1a0 [ 171.659532] __x64_sys_init_module+0x1d/0x20 [ 171.659534] do_syscall_64+0x4a/0xb0 [ 171.659536] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 171.659575] ------------[ cut here ]------------ [ 171.659576] WARNING: CPU: 5 PID: 13016 at lib/ref_tracker.c:112 ref_tracker_free+0x224/0x270 [ 171.659581] Modules linked in: test_ref_tracker(+) [ 171.659591] CPU: 5 PID: 13016 Comm: modprobe Tainted: G S 5.16.0-smp-DEV #290 [ 171.659595] RIP: 0010:ref_tracker_free+0x224/0x270 [ 171.659599] Code: 5e 41 5f 5d c3 48 c7 c7 04 9c 74 a6 31 c0 e8 62 ee 67 00 83 7b 14 00 75 1a 83 7b 18 00 75 30 4c 89 ff 4c 89 f6 e8 9c 00 69 00 <0f> 0b bb ea ff ff ff eb ae 48 c7 c7 3a 0a 77 a6 31 c0 e8 34 ee 67 [ 171.659601] RSP: 0018:ffff89058ba0bbd0 EFLAGS: 00010286 [ 171.659603] RAX: 0000000000000029 RBX: ffff890586b19780 RCX: 08895bff57c7d100 [ 171.659604] RDX: c0000000ffff7fff RSI: 0000000000000282 RDI: ffffffffc0407000 [ 171.659606] RBP: ffff89058ba0bc88 R08: 0000000000000000 R09: ffffffffa6f342e0 [ 171.659607] R10: 00000000ffff7fff R11: 0000000000000000 R12: 000000008f000000 [ 171.659608] R13: 0000000000000014 R14: 0000000000000282 R15: ffffffffc0407000 [ 171.659609] FS: 00007f97ea29d740(0000) GS:ffff8923ff940000(0000) knlGS:0000000000000000 [ 171.659611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 171.659613] CR2: 00007f97ea299000 CR3: 0000000186b4a004 CR4: 00000000001706e0 [ 171.659614] Call Trace: [ 171.659615] <TASK> [ 171.659631] ? alloctest_ref_tracker_free+0x13/0x20 [test_ref_tracker] [ 171.659633] ? init_module+0x105/0x1000 [test_ref_tracker] [ 171.659636] ? do_one_initcall+0x9c/0x220 [ 171.659638] ? do_init_module+0x60/0x240 [ 171.659641] ? load_module+0x32b5/0x3610 [ 171.659644] ? __do_sys_init_module+0x148/0x1a0 [ 171.659646] ? __x64_sys_init_module+0x1d/0x20 [ 171.659649] ? do_syscall_64+0x4a/0xb0 [ 171.659652] ? entry_SYSCALL_64_after_hwframe+0x44/0xae [ 171.659656] ? 0xffffffffc040a000 [ 171.659658] alloctest_ref_tracker_free+0x13/0x20 [test_ref_tracker] [ 171.659660] init_module+0x105/0x1000 [test_ref_tracker] [ 171.659663] do_one_initcall+0x9c/0x220 [ 171.659666] do_init_module+0x60/0x240 [ 171.659669] load_module+0x32b5/0x3610 [ 171.659672] __do_sys_init_module+0x148/0x1a0 [ 171.659676] __x64_sys_init_module+0x1d/0x20 [ 171.659678] do_syscall_64+0x4a/0xb0 [ 171.659694] ? exc_page_fault+0x6e/0x140 [ 171.659696] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 171.659698] RIP: 0033:0x7f97ea3dbe7a [ 171.659700] Code: 48 8b 0d 61 8d 06 00 f7 d8 64 89 01 48 83 c8 ff c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2e 8d 06 00 f7 d8 64 89 01 48 [ 171.659701] RSP: 002b:00007ffea67ce608 EFLAGS: 00000246 ORIG_RAX: 00000000000000af [ 171.659703] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f97ea3dbe7a [ 171.659704] RDX: 00000000013a0ba0 RSI: 0000000000002808 RDI: 00007f97ea299000 [ 171.659705] RBP: 00007ffea67ce670 R08: 0000000000000003 R09: 0000000000000000 [ 171.659706] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000013a1048 [ 171.659707] R13: 00000000013a0ba0 R14: 0000000001399930 R15: 00000000013a1030 [ 171.659709] </TASK> [ 171.659710] ---[ end trace f5dbd6afa41e60a9 ]--- [ 171.659712] leaked reference. [ 171.663393] alloctest_ref_tracker_alloc0+0x1c/0x20 [test_ref_tracker] [ 171.663395] test_ref_tracker_timer_func+0x9/0x20 [test_ref_tracker] [ 171.663397] call_timer_fn+0x31/0x140 [ 171.663401] expire_timers+0x46/0x110 [ 171.663403] __run_timers+0x16f/0x1b0 [ 171.663404] run_timer_softirq+0x1d/0x40 [ 171.663406] __do_softirq+0x148/0x2d3 [ 171.663408] leaked reference. [ 171.667101] alloctest_ref_tracker_alloc1+0x1c/0x20 [test_ref_tracker] [ 171.667103] init_module+0x81/0x1000 [test_ref_tracker] [ 171.667104] do_one_initcall+0x9c/0x220 [ 171.667106] do_init_module+0x60/0x240 [ 171.667108] load_module+0x32b5/0x3610 [ 171.667111] __do_sys_init_module+0x148/0x1a0 [ 171.667113] __x64_sys_init_module+0x1d/0x20 [ 171.667115] do_syscall_64+0x4a/0xb0 [ 171.667117] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 171.667131] ------------[ cut here ]------------ [ 171.667132] WARNING: CPU: 5 PID: 13016 at lib/ref_tracker.c:30 ref_tracker_dir_exit+0x104/0x130 [ 171.667136] Modules linked in: test_ref_tracker(+) [ 171.667144] CPU: 5 PID: 13016 Comm: modprobe Tainted: G S W 5.16.0-smp-DEV #290 [ 171.667147] RIP: 0010:ref_tracker_dir_exit+0x104/0x130 [ 171.667150] Code: 01 00 00 00 00 ad de 48 89 03 4c 89 63 08 48 89 df e8 20 a0 d5 ff 4c 89 f3 4d 39 ee 75 a8 4c 89 ff 48 8b 75 d0 e8 7c 05 69 00 <0f> 0b eb 0c 4c 89 ff 48 8b 75 d0 e8 6c 05 69 00 41 8b 47 08 83 f8 [ 171.667151] RSP: 0018:ffff89058ba0bc68 EFLAGS: 00010286 [ 171.667154] RAX: 08895bff57c7d100 RBX: ffffffffc0407010 RCX: 000000000000003b [ 171.667156] RDX: 000000000000003c RSI: 0000000000000282 RDI: ffffffffc0407000 [ 171.667157] RBP: ffff89058ba0bc98 R08: 0000000000000000 R09: ffffffffa6f342e0 [ 171.667159] R10: 00000000ffff7fff R11: 0000000000000000 R12: dead000000000122 [ 171.667160] R13: ffffffffc0407010 R14: ffffffffc0407010 R15: ffffffffc0407000 [ 171.667162] FS: 00007f97ea29d740(0000) GS:ffff8923ff940000(0000) knlGS:0000000000000000 [ 171.667164] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 171.667166] CR2: 00007f97ea299000 CR3: 0000000186b4a004 CR4: 00000000001706e0 [ 171.667169] Call Trace: [ 171.667170] <TASK> [ 171.667171] ? 0xffffffffc040a000 [ 171.667173] init_module+0x126/0x1000 [test_ref_tracker] [ 171.667175] do_one_initcall+0x9c/0x220 [ 171.667179] do_init_module+0x60/0x240 [ 171.667182] load_module+0x32b5/0x3610 [ 171.667186] __do_sys_init_module+0x148/0x1a0 [ 171.667189] __x64_sys_init_module+0x1d/0x20 [ 171.667192] do_syscall_64+0x4a/0xb0 [ 171.667194] ? exc_page_fault+0x6e/0x140 [ 171.667196] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 171.667199] RIP: 0033:0x7f97ea3dbe7a [ 171.667200] Code: 48 8b 0d 61 8d 06 00 f7 d8 64 89 01 48 83 c8 ff c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2e 8d 06 00 f7 d8 64 89 01 48 [ 171.667201] RSP: 002b:00007ffea67ce608 EFLAGS: 00000246 ORIG_RAX: 00000000000000af [ 171.667203] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f97ea3dbe7a [ 171.667204] RDX: 00000000013a0ba0 RSI: 0000000000002808 RDI: 00007f97ea299000 [ 171.667205] RBP: 00007ffea67ce670 R08: 0000000000000003 R09: 0000000000000000 [ 171.667206] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000013a1048 [ 171.667207] R13: 00000000013a0ba0 R14: 0000000001399930 R15: 00000000013a1030 [ 171.667209] </TASK> [ 171.667210] ---[ end trace f5dbd6afa41e60aa ]--- Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-05 12:21:56 +08:00
config TEST_REF_TRACKER
tristate "Self test for reference tracker"
depends on DEBUG_KERNEL && STACKTRACE_SUPPORT
select REF_TRACKER
help
This option provides a kernel module performing tests
using reference tracker infrastructure.
Say N if you are unsure.
config RBTREE_TEST
tristate "Red-Black tree test"
depends on DEBUG_KERNEL
help
A benchmark measuring the performance of the rbtree library.
Also includes rbtree invariant checks.
rslib: Add tests for the encoder and decoder A Reed-Solomon code with minimum distance d can correct any error and erasure pattern that satisfies 2 * #error + #erasures < d. If the error correction capacity is exceeded, then correct decoding cannot be guaranteed. The decoder must, however, return a valid codeword or report failure. There are two main tests: - Check for correct behaviour up to the error correction capacity - Check for correct behaviour beyond error corrupted capacity Both tests are simple: 1. Generate random data 2. Encode data with the chosen code 3. Add errors and erasures to data 4. Decode the corrupted word 5. Check for correct behaviour When testing up to capacity we test for: - Correct decoding - Correct return value (i.e. the number of corrected symbols) - That the returned error positions are correct There are two kinds of erasures; the erased symbol can be corrupted or not. When counting the number of corrected symbols, erasures without symbol corruption should not be counted. Similarly, the returned error positions should only include positions where a correction is necessary. We run the up to capacity tests for three different interfaces of decode_rs: - Use the correction buffers - Use the correction buffers with syndromes provided by the caller - Error correction in place (does not check the error positions) When testing beyond capacity test for silent failures. A silent failure is when the decoder returns success but the returned word is not a valid codeword. There are a couple of options for the tests: - Verbosity. - Whether to test for correct behaviour beyond capacity. Default is to test beyond capacity. - Whether to allow erasures without symbol corruption. Defaults to yes. Note that the tests take a couple of minutes to complete. Signed-off-by: Ferdinand Blomqvist <ferdinand.blomqvist@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190620141039.9874-2-ferdinand.blomqvist@gmail.com
2019-06-20 22:10:33 +08:00
config REED_SOLOMON_TEST
tristate "Reed-Solomon library test"
depends on DEBUG_KERNEL || m
select REED_SOLOMON
select REED_SOLOMON_ENC16
select REED_SOLOMON_DEC16
help
This option enables the self-test function of rslib at boot,
or at module load time.
If unsure, say N.
rbtree: add prio tree and interval tree tests Patch 1 implements support for interval trees, on top of the augmented rbtree API. It also adds synthetic tests to compare the performance of interval trees vs prio trees. Short answers is that interval trees are slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x) on search. It is debatable how realistic the synthetic test is, and I have not made such measurements yet, but my impression is that interval trees would still come out faster. Patch 2 uses a preprocessor template to make the interval tree generic, and uses it as a replacement for the vma prio_tree. Patch 3 takes the other prio_tree user, kmemleak, and converts it to use a basic rbtree. We don't actually need the augmented rbtree support here because the intervals are always non-overlapping. Patch 4 removes the now-unused prio tree library. Patch 5 proposes an additional optimization to rb_erase_augmented, now providing it as an inline function so that the augmented callbacks can be inlined in. This provides an additional 5-10% performance improvement for the interval tree insert/erase benchmark. There is a maintainance cost as it exposes augmented rbtree users to some of the rbtree library internals; however I think this cost shouldn't be too high as I expect the augmented rbtree will always have much less users than the base rbtree. I should probably add a quick summary of why I think it makes sense to replace prio trees with augmented rbtree based interval trees now. One of the drivers is that we need augmented rbtrees for Rik's vma gap finding code, and once you have them, it just makes sense to use them for interval trees as well, as this is the simpler and more well known algorithm. prio trees, in comparison, seem *too* clever: they impose an additional 'heap' constraint on the tree, which they use to guarantee a faster worst-case complexity of O(k+log N) for stabbing queries in a well-balanced prio tree, vs O(k*log N) for interval trees (where k=number of matches, N=number of intervals). Now this sounds great, but in practice prio trees don't realize this theorical benefit. First, the additional constraint makes them harder to update, so that the kernel implementation has to simplify things by balancing them like a radix tree, which is not always ideal. Second, the fact that there are both index and heap properties makes both tree manipulation and search more complex, which results in a higher multiplicative time constant. As it turns out, the simple interval tree algorithm ends up running faster than the more clever prio tree. This patch: Add two test modules: - prio_tree_test measures the performance of lib/prio_tree.c, both for insertion/removal and for stabbing searches - interval_tree_test measures the performance of a library of equivalent functionality, built using the augmented rbtree support. In order to support the second test module, lib/interval_tree.c is introduced. It is kept separate from the interval_tree_test main file for two reasons: first we don't want to provide an unfair advantage over prio_tree_test by having everything in a single compilation unit, and second there is the possibility that the interval tree functionality could get some non-test users in kernel over time. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 07:31:23 +08:00
config INTERVAL_TREE_TEST
tristate "Interval tree test"
depends on DEBUG_KERNEL
select INTERVAL_TREE
rbtree: add prio tree and interval tree tests Patch 1 implements support for interval trees, on top of the augmented rbtree API. It also adds synthetic tests to compare the performance of interval trees vs prio trees. Short answers is that interval trees are slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x) on search. It is debatable how realistic the synthetic test is, and I have not made such measurements yet, but my impression is that interval trees would still come out faster. Patch 2 uses a preprocessor template to make the interval tree generic, and uses it as a replacement for the vma prio_tree. Patch 3 takes the other prio_tree user, kmemleak, and converts it to use a basic rbtree. We don't actually need the augmented rbtree support here because the intervals are always non-overlapping. Patch 4 removes the now-unused prio tree library. Patch 5 proposes an additional optimization to rb_erase_augmented, now providing it as an inline function so that the augmented callbacks can be inlined in. This provides an additional 5-10% performance improvement for the interval tree insert/erase benchmark. There is a maintainance cost as it exposes augmented rbtree users to some of the rbtree library internals; however I think this cost shouldn't be too high as I expect the augmented rbtree will always have much less users than the base rbtree. I should probably add a quick summary of why I think it makes sense to replace prio trees with augmented rbtree based interval trees now. One of the drivers is that we need augmented rbtrees for Rik's vma gap finding code, and once you have them, it just makes sense to use them for interval trees as well, as this is the simpler and more well known algorithm. prio trees, in comparison, seem *too* clever: they impose an additional 'heap' constraint on the tree, which they use to guarantee a faster worst-case complexity of O(k+log N) for stabbing queries in a well-balanced prio tree, vs O(k*log N) for interval trees (where k=number of matches, N=number of intervals). Now this sounds great, but in practice prio trees don't realize this theorical benefit. First, the additional constraint makes them harder to update, so that the kernel implementation has to simplify things by balancing them like a radix tree, which is not always ideal. Second, the fact that there are both index and heap properties makes both tree manipulation and search more complex, which results in a higher multiplicative time constant. As it turns out, the simple interval tree algorithm ends up running faster than the more clever prio tree. This patch: Add two test modules: - prio_tree_test measures the performance of lib/prio_tree.c, both for insertion/removal and for stabbing searches - interval_tree_test measures the performance of a library of equivalent functionality, built using the augmented rbtree support. In order to support the second test module, lib/interval_tree.c is introduced. It is kept separate from the interval_tree_test main file for two reasons: first we don't want to provide an unfair advantage over prio_tree_test by having everything in a single compilation unit, and second there is the possibility that the interval tree functionality could get some non-test users in kernel over time. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 07:31:23 +08:00
help
A benchmark measuring the performance of the interval tree library
config PERCPU_TEST
tristate "Per cpu operations test"
depends on m && DEBUG_KERNEL
help
Enable this option to build test module which validates per-cpu
operations.
If unsure, say N.
config ATOMIC64_SELFTEST
tristate "Perform an atomic64_t self-test"
help
Enable this option to test the atomic64_t functions at boot or
at module load time.
If unsure, say N.
config ASYNC_RAID6_TEST
tristate "Self test for hardware accelerated raid6 recovery"
depends on ASYNC_RAID6_RECOV
select ASYNC_MEMCPY
help
This is a one-shot self test that permutes through the
recovery of all the possible two disk failure scenarios for a
N-disk array. Recovery is performed with the asynchronous
raid6 recovery routines, and will optionally use an offload
engine if one is available.
If unsure, say N.
config TEST_HEXDUMP
tristate "Test functions located in the hexdump module at runtime"
config STRING_KUNIT_TEST
tristate "KUnit test string functions at runtime" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
config STRING_HELPERS_KUNIT_TEST
tristate "KUnit test string helpers at runtime" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
config TEST_KSTRTOX
tristate "Test kstrto*() family of functions at runtime"
config TEST_PRINTF
tristate "Test printf() family of functions at runtime"
config TEST_SCANF
tristate "Test scanf() family of functions at runtime"
config TEST_BITMAP
tristate "Test bitmap_*() family of functions at runtime"
help
Enable this option to test the bitmap functions at boot.
If unsure, say N.
config TEST_UUID
tristate "Test functions located in the uuid module at runtime"
config TEST_XARRAY
tristate "Test the XArray code at runtime"
config TEST_MAPLE_TREE
tristate "Test the Maple Tree code at runtime or module load"
help
Enable this option to test the maple tree code functions at boot, or
when the module is loaded. Enable "Debug Maple Trees" will enable
more verbose output on failures.
If unsure, say N.
config TEST_RHASHTABLE
tristate "Perform selftest on resizable hash table"
help
Enable this option to test the rhashtable functions at boot.
If unsure, say N.
config TEST_IDA
tristate "Perform selftest on IDA functions"
config TEST_PARMAN
tristate "Perform selftest on priority array manager"
depends on PARMAN
help
Enable this option to test priority array manager on boot
(or module load).
If unsure, say N.
config TEST_IRQ_TIMINGS
bool "IRQ timings selftest"
depends on IRQ_TIMINGS
help
Enable this option to test the irq timings code on boot.
If unsure, say N.
config TEST_LKM
test: add minimal module for verification testing This is a pair of test modules I'd like to see in the tree. Instead of putting these in lkdtm, where I've been adding various tests that trigger crashes, these don't make sense there since they need to be either distinctly separate, or their pass/fail state don't need to crash the machine. These live in lib/ for now, along with a few other in-kernel test modules, and use the slightly more common "test_" naming convention, instead of "test-". We should likely standardize on the former: $ find . -name 'test_*.c' | grep -v /tools/ | wc -l 4 $ find . -name 'test-*.c' | grep -v /tools/ | wc -l 2 The first is entirely a no-op module, designed to allow simple testing of the module loading and verification interface. It's useful to have a module that has no other uses or dependencies so it can be reliably used for just testing module loading and verification. The second is a module that exercises the user memory access functions, in an effort to make sure that we can quickly catch any regressions in boundary checking (e.g. like what was recently fixed on ARM). This patch (of 2): When doing module loading verification tests (for example, with module signing, or LSM hooks), it is very handy to have a module that can be built on all systems under test, isn't auto-loaded at boot, and has no device or similar dependencies. This creates the "test_module.ko" module for that purpose, which only reports its load and unload to printk. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-24 07:54:37 +08:00
tristate "Test module loading with 'hello world' module"
depends on m
help
This builds the "test_module" module that emits "Hello, world"
on printk when loaded. It is designed to be used for basic
evaluation of the module loading subsystem (for example when
validating module verification). It lacks any extra dependencies,
and will not normally be loaded by the system unless explicitly
requested by name.
If unsure, say N.
lib: make a test module with set/clear bit Test some bit clears/sets to make sure assembly doesn't change, and that the set_bit and clear_bit functions work and don't cause sparse warnings. Instruct Kbuild to build this file with extra warning level -Wextra, to catch new issues, and also doesn't hurt to build with C=1. This was used to test changes to arch/x86/include/asm/bitops.h. In particular, sparse (C=1) was very concerned when the last bit before a natural boundary, like 7, or 31, was being tested, as this causes sign extension (0xffffff7f) for instance when clearing bit 7. Recommended usage: make defconfig scripts/config -m CONFIG_TEST_BITOPS make modules_prepare make C=1 W=1 lib/test_bitops.ko objdump -S -d lib/test_bitops.ko insmod lib/test_bitops.ko rmmod lib/test_bitops.ko <check dmesg>, there should be no compiler/sparse warnings and no error messages in log. Link: http://lkml.kernel.org/r/20200310221747.2848474-2-jesse.brandeburg@intel.com Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> CcL Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:50:27 +08:00
config TEST_BITOPS
tristate "Test module for compilation of bitops operations"
lib: make a test module with set/clear bit Test some bit clears/sets to make sure assembly doesn't change, and that the set_bit and clear_bit functions work and don't cause sparse warnings. Instruct Kbuild to build this file with extra warning level -Wextra, to catch new issues, and also doesn't hurt to build with C=1. This was used to test changes to arch/x86/include/asm/bitops.h. In particular, sparse (C=1) was very concerned when the last bit before a natural boundary, like 7, or 31, was being tested, as this causes sign extension (0xffffff7f) for instance when clearing bit 7. Recommended usage: make defconfig scripts/config -m CONFIG_TEST_BITOPS make modules_prepare make C=1 W=1 lib/test_bitops.ko objdump -S -d lib/test_bitops.ko insmod lib/test_bitops.ko rmmod lib/test_bitops.ko <check dmesg>, there should be no compiler/sparse warnings and no error messages in log. Link: http://lkml.kernel.org/r/20200310221747.2848474-2-jesse.brandeburg@intel.com Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> CcL Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:50:27 +08:00
help
This builds the "test_bitops" module that is much like the
TEST_LKM module except that it does a basic exercise of the
set/clear_bit macros and get_count_order/long to make sure there are
no compiler warnings from C=1 sparse checker or -Wextra
compilations. It has no dependencies and doesn't run or load unless
explicitly requested by name. for example: modprobe test_bitops.
lib: make a test module with set/clear bit Test some bit clears/sets to make sure assembly doesn't change, and that the set_bit and clear_bit functions work and don't cause sparse warnings. Instruct Kbuild to build this file with extra warning level -Wextra, to catch new issues, and also doesn't hurt to build with C=1. This was used to test changes to arch/x86/include/asm/bitops.h. In particular, sparse (C=1) was very concerned when the last bit before a natural boundary, like 7, or 31, was being tested, as this causes sign extension (0xffffff7f) for instance when clearing bit 7. Recommended usage: make defconfig scripts/config -m CONFIG_TEST_BITOPS make modules_prepare make C=1 W=1 lib/test_bitops.ko objdump -S -d lib/test_bitops.ko insmod lib/test_bitops.ko rmmod lib/test_bitops.ko <check dmesg>, there should be no compiler/sparse warnings and no error messages in log. Link: http://lkml.kernel.org/r/20200310221747.2848474-2-jesse.brandeburg@intel.com Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> CcL Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:50:27 +08:00
If unsure, say N.
vmalloc: add test driver to analyse vmalloc allocator This adds a new kernel module for analysis of vmalloc allocator. It is only enabled as a module. There are two main reasons this module should be used for: performance evaluation and stressing of vmalloc subsystem. It consists of several test cases. As of now there are 8. The module has five parameters we can specify to change its the behaviour. 1) run_test_mask - set of tests to be run id: 1, name: fix_size_alloc_test id: 2, name: full_fit_alloc_test id: 4, name: long_busy_list_alloc_test id: 8, name: random_size_alloc_test id: 16, name: fix_align_alloc_test id: 32, name: random_size_align_alloc_test id: 64, name: align_shift_alloc_test id: 128, name: pcpu_alloc_test By default all tests are in run test mask. If you want to select some specific tests it is possible to pass the mask. For example for first, second and fourth tests we go 11 value. 2) test_repeat_count - how many times each test should be repeated By default it is one time per test. It is possible to pass any number. As high the value is the test duration gets increased. 3) test_loop_count - internal test loop counter. By default it is set to 1000000. 4) single_cpu_test - use one CPU to run the tests By default this parameter is set to false. It means that all online CPUs execute tests. By setting it to 1, the tests are executed by first online CPU only. 5) sequential_test_order - run tests in sequential order By default this parameter is set to false. It means that before running tests the order is shuffled. It is possible to make it sequential, just set it to 1. Performance analysis: In order to evaluate performance of vmalloc allocations, usually it makes sense to use only one CPU that runs tests, use sequential order, number of repeat tests can be different as well as set of test mask. For example if we want to run all tests, to use one CPU and repeat each test 3 times. Insert the module passing following parameters: single_cpu_test=1 sequential_test_order=1 test_repeat_count=3 with following output: <snip> Summary: fix_size_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 901177 usec Summary: full_fit_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 1039341 usec Summary: long_busy_list_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 11775763 usec Summary: random_size_alloc_test passed 3: failed: 0 repeat: 3 loops: 1000000 avg: 6081992 usec Summary: fix_align_alloc_test passed: 3 failed: 0 repeat: 3, loops: 1000000 avg: 2003712 usec Summary: random_size_align_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 2895689 usec Summary: align_shift_alloc_test passed: 0 failed: 3 repeat: 3 loops: 1000000 avg: 573 usec Summary: pcpu_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 95802 usec All test took CPU0=192945605995 cycles <snip> The align_shift_alloc_test is expected to be failed. Stressing: In order to stress the vmalloc subsystem we run all available test cases on all available CPUs simultaneously. In order to prevent constant behaviour pattern, the test cases array is shuffled by default to randomize the order of test execution. For example if we want to run all tests(default), use all online CPUs(default) with shuffled order(default) and to repeat each test 30 times. The command would be like: modprobe vmalloc_test test_repeat_count=30 Expected results are the system is alive, there are no any BUG_ONs or Kernel Panics the tests are completed, no memory leaks. [urezki@gmail.com: fix 32-bit builds] Link: http://lkml.kernel.org/r/20190106214839.ffvjvmrn52uqog7k@pc636 [urezki@gmail.com: make CONFIG_TEST_VMALLOC depend on CONFIG_MMU] Link: http://lkml.kernel.org/r/20190219085441.s6bg2gpy4esny5vw@pc636 Link: http://lkml.kernel.org/r/20190103142108.20744-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:43:34 +08:00
config TEST_VMALLOC
tristate "Test module for stress/performance analysis of vmalloc allocator"
default n
depends on MMU
depends on m
help
This builds the "test_vmalloc" module that should be used for
stress and performance analysis. So, any new change for vmalloc
subsystem can be evaluated from performance and stability point
of view.
If unsure, say N.
config TEST_BPF
tristate "Test BPF filter functionality"
depends on m && NET
help
This builds the "test_bpf" module that runs various test vectors
against the BPF interpreter or BPF JIT compiler depending on the
current setting. This is in particular useful for BPF JIT compiler
development, but also to run regression tests against changes in
the interpreter code. It also enables test stubs for eBPF maps and
verifier used by user space verifier testsuite.
If unsure, say N.
config TEST_BLACKHOLE_DEV
tristate "Test blackhole netdev functionality"
depends on m && NET
help
This builds the "test_blackhole_dev" module that validates the
data path through this blackhole netdev.
If unsure, say N.
config FIND_BIT_BENCHMARK
tristate "Test find_bit functions"
help
This builds the "test_find_bit" module that measure find_*_bit()
functions performance.
If unsure, say N.
config TEST_FIRMWARE
tristate "Test firmware loading via userspace interface"
depends on FW_LOADER
help
This builds the "test_firmware" module that creates a userspace
interface for testing firmware loading. This can be used to
control the triggering of firmware loading without needing an
actual firmware-using device. The contents can be rechecked by
userspace.
If unsure, say N.
config TEST_SYSCTL
tristate "sysctl test driver"
depends on PROC_SYSCTL
help
This builds the "test_sysctl" module. This driver enables to test the
proc sysctl interfaces available to drivers safely without affecting
production knobs which might alter system functionality.
If unsure, say N.
config BITFIELD_KUNIT
tristate "KUnit test bitfield functions at runtime" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Enable this option to test the bitfield functions at boot.
KUnit tests run during boot and output the results to the debug log
in TAP format (http://testanything.org/). Only useful for kernel devs
running the KUnit test harness, and not intended for inclusion into a
production build.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
x86/csum: Improve performance of `csum_partial` 1) Add special case for len == 40 as that is the hottest value. The nets a ~8-9% latency improvement and a ~30% throughput improvement in the len == 40 case. 2) Use multiple accumulators in the 64-byte loop. This dramatically improves ILP and results in up to a 40% latency/throughput improvement (better for more iterations). Results from benchmarking on Icelake. Times measured with rdtsc() len lat_new lat_old r tput_new tput_old r 8 3.58 3.47 1.032 3.58 3.51 1.021 16 4.14 4.02 1.028 3.96 3.78 1.046 24 4.99 5.03 0.992 4.23 4.03 1.050 32 5.09 5.08 1.001 4.68 4.47 1.048 40 5.57 6.08 0.916 3.05 4.43 0.690 48 6.65 6.63 1.003 4.97 4.69 1.059 56 7.74 7.72 1.003 5.22 4.95 1.055 64 6.65 7.22 0.921 6.38 6.42 0.994 96 9.43 9.96 0.946 7.46 7.54 0.990 128 9.39 12.15 0.773 8.90 8.79 1.012 200 12.65 18.08 0.699 11.63 11.60 1.002 272 15.82 23.37 0.677 14.43 14.35 1.005 440 24.12 36.43 0.662 21.57 22.69 0.951 952 46.20 74.01 0.624 42.98 53.12 0.809 1024 47.12 78.24 0.602 46.36 58.83 0.788 1552 72.01 117.30 0.614 71.92 96.78 0.743 2048 93.07 153.25 0.607 93.28 137.20 0.680 2600 114.73 194.30 0.590 114.28 179.32 0.637 3608 156.34 268.41 0.582 154.97 254.02 0.610 4096 175.01 304.03 0.576 175.89 292.08 0.602 There is no such thing as a free lunch, however, and the special case for len == 40 does add overhead to the len != 40 cases. This seems to amount to be ~5% throughput and slightly less in terms of latency. Testing: Part of this change is a new kunit test. The tests check all alignment X length pairs in [0, 64) X [0, 512). There are three cases. 1) Precomputed random inputs/seed. The expected results where generated use the generic implementation (which is assumed to be non-buggy). 2) An input of all 1s. The goal of this test is to catch any case a carry is missing. 3) An input that never carries. The goal of this test si to catch any case of incorrectly carrying. More exhaustive tests that test all alignment X length pairs in [0, 8192) X [0, 8192] on random data are also available here: https://github.com/goldsteinn/csum-reproduction The reposity also has the code for reproducing the above benchmark numbers. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com
2023-05-11 09:10:02 +08:00
config CHECKSUM_KUNIT
tristate "KUnit test checksum functions at runtime" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Enable this option to test the checksum functions at boot.
KUnit tests run during boot and output the results to the debug log
in TAP format (http://testanything.org/). Only useful for kernel devs
running the KUnit test harness, and not intended for inclusion into a
production build.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config HASH_KUNIT_TEST
tristate "KUnit Test for integer hash functions" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Enable this option to test the kernel's string (<linux/stringhash.h>), and
integer (<linux/hash.h>) hash functions on boot.
KUnit tests run during boot and output the results to the debug log
in TAP format (https://testanything.org/). Only useful for kernel devs
running the KUnit test harness, and not intended for inclusion into a
production build.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
This is intended to help people writing architecture-specific
optimized versions. If unsure, say N.
config RESOURCE_KUNIT_TEST
tristate "KUnit test for resource API" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
This builds the resource API unit test.
Tests the logic of API provided by resource.c and ioport.h.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config SYSCTL_KUNIT_TEST
tristate "KUnit test for sysctl" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
This builds the proc sysctl unit test, which runs on boot.
Tests the API contract and implementation correctness of sysctl.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config LIST_KUNIT_TEST
tristate "KUnit Test for Kernel Linked-list structures" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
This builds the linked list KUnit test suite.
It tests that the API and basic functionality of the list_head type
and associated macros.
KUnit tests run during boot and output the results to the debug log
in TAP format (https://testanything.org/). Only useful for kernel devs
running the KUnit test harness, and not intended for inclusion into a
production build.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config HASHTABLE_KUNIT_TEST
tristate "KUnit Test for Kernel Hashtable structures" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
This builds the hashtable KUnit test suite.
It tests the basic functionality of the API defined in
include/linux/hashtable.h. For more information on KUnit and
unit tests in general please refer to the KUnit documentation
in Documentation/dev-tools/kunit/.
If unsure, say N.
config LINEAR_RANGES_TEST
tristate "KUnit test for linear_ranges"
depends on KUNIT
select LINEAR_RANGES
help
This builds the linear_ranges unit test, which runs on boot.
Tests the linear_ranges logic correctness.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config CMDLINE_KUNIT_TEST
tristate "KUnit test for cmdline API" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
This builds the cmdline API unit test.
Tests the logic of API provided by cmdline.c.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config BITS_TEST
tristate "KUnit test for bits.h" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
This builds the bits unit test.
Tests the logic of macros defined in bits.h.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
mm/slub, kunit: add a KUnit test for SLUB debugging functionality SLUB has resiliency_test() function which is hidden behind #ifdef SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it. KUnit should be a proper replacement for it. Try changing byte in redzone after allocation and changing pointer to next free node, first byte, 50th byte and redzone byte. Check if validation finds errors. There are several differences from the original resiliency test: Tests create own caches with known state instead of corrupting shared kmalloc caches. The corruption of freepointer uses correct offset, the original resiliency test got broken with freepointer changes. Scratch changing random byte test, because it does not have meaning in this form where we need deterministic results. Add new option CONFIG_SLUB_KUNIT_TEST in Kconfig. Tests next_pointer, first_word and clobber_50th_byte do not run with KASAN option on. Because the test deliberately modifies non-allocated objects. Use kunit_resource to count errors in cache and silence bug reports. Count error whenever slab_bug() or slab_fix() is called or when the count of pages is wrong. [glittao@gmail.com: remove unused function test_exit(), from SLUB KUnit test] Link: https://lkml.kernel.org/r/20210512140656.12083-1-glittao@gmail.com [akpm@linux-foundation.org: export kasan_enable/disable_current to modules] Link: https://lkml.kernel.org/r/20210511150734.3492-2-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Daniel Latypov <dlatypov@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:34:33 +08:00
config SLUB_KUNIT_TEST
tristate "KUnit test for SLUB cache error detection" if !KUNIT_ALL_TESTS
depends on SLUB_DEBUG && KUNIT
default KUNIT_ALL_TESTS
help
This builds SLUB allocator unit test.
Tests SLUB cache debugging functionality.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config RATIONAL_KUNIT_TEST
tristate "KUnit test for rational.c" if !KUNIT_ALL_TESTS
depends on KUNIT && RATIONAL
default KUNIT_ALL_TESTS
help
This builds the rational math unit test.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config MEMCPY_KUNIT_TEST
tristate "Test memcpy(), memmove(), and memset() functions at runtime" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Builds unit tests for memcpy(), memmove(), and memset() functions.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
config IS_SIGNED_TYPE_KUNIT_TEST
tristate "Test is_signed_type() macro" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Builds unit tests for the is_signed_type() macro.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
lib: overflow: Convert to Kunit Convert overflow unit tests to KUnit, for better integration into the kernel self test framework. Includes a rename of test_overflow.c to overflow_kunit.c, and CONFIG_TEST_OVERFLOW to CONFIG_OVERFLOW_KUNIT_TEST. $ ./tools/testing/kunit/kunit.py run overflow ... [14:33:51] Starting KUnit Kernel (1/1)... [14:33:51] ============================================================ [14:33:51] ================== overflow (11 subtests) ================== [14:33:51] [PASSED] u8_overflow_test [14:33:51] [PASSED] s8_overflow_test [14:33:51] [PASSED] u16_overflow_test [14:33:51] [PASSED] s16_overflow_test [14:33:51] [PASSED] u32_overflow_test [14:33:51] [PASSED] s32_overflow_test [14:33:51] [PASSED] u64_overflow_test [14:33:51] [PASSED] s64_overflow_test [14:33:51] [PASSED] overflow_shift_test [14:33:51] [PASSED] overflow_allocation_test [14:33:51] [PASSED] overflow_size_helpers_test [14:33:51] ==================== [PASSED] overflow ===================== [14:33:51] ============================================================ [14:33:51] Testing complete. Passed: 11, Failed: 0, Crashed: 0, Skipped: 0, Errors: 0 [14:33:51] Elapsed time: 12.525s total, 0.001s configuring, 12.402s building, 0.101s running Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Nick Desaulniers <ndesaulniers@google.com> Co-developed-by: Vitor Massaru Iha <vitor@massaru.org> Signed-off-by: Vitor Massaru Iha <vitor@massaru.org> Link: https://lore.kernel.org/lkml/20200720224418.200495-1-vitor@massaru.org/ Co-developed-by: Daniel Latypov <dlatypov@google.com> Signed-off-by: Daniel Latypov <dlatypov@google.com> Link: https://lore.kernel.org/linux-kselftest/20210503211536.1384578-1-dlatypov@google.com/ Acked-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/lkml/CAKwvOdm62iA1dNiC6Q11UJ-MnTqtc4kXkm-ubPaFMK824_k0nw@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/lkml/CABVgOS=TWVh649_Vjo3wnMu9gZnq66gkV-LtGgsksAWMqc+MSA@mail.gmail.com
2022-02-17 06:17:49 +08:00
config OVERFLOW_KUNIT_TEST
tristate "Test check_*_overflow() functions at runtime" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Builds unit tests for the check_*_overflow(), size_*(), allocation, and
related functions.
For more information on KUnit and unit tests in general please refer
to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
lib: stackinit: Convert to KUnit Convert stackinit unit tests to KUnit, for better integration into the kernel self test framework. Includes a rename of test_stackinit.c to stackinit_kunit.c, and CONFIG_TEST_STACKINIT to CONFIG_STACKINIT_KUNIT_TEST. Adjust expected test results based on which stack initialization method was chosen: $ CMD="./tools/testing/kunit/kunit.py run stackinit --raw_output \ --arch=x86_64 --kconfig_add" $ $CMD | grep stackinit: # stackinit: pass:36 fail:0 skip:29 total:65 $ $CMD CONFIG_GCC_PLUGIN_STRUCTLEAK_USER=y | grep stackinit: # stackinit: pass:37 fail:0 skip:28 total:65 $ $CMD CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF=y | grep stackinit: # stackinit: pass:55 fail:0 skip:10 total:65 $ $CMD CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y | grep stackinit: # stackinit: pass:62 fail:0 skip:3 total:65 $ $CMD CONFIG_INIT_STACK_ALL_PATTERN=y --make_option LLVM=1 | grep stackinit: # stackinit: pass:60 fail:0 skip:5 total:65 $ $CMD CONFIG_INIT_STACK_ALL_ZERO=y --make_option LLVM=1 | grep stackinit: # stackinit: pass:60 fail:0 skip:5 total:65 Temporarily remove the userspace-build mode, which will be restored in a later patch. Expand the size of the pre-case switch variable so it doesn't get accidentally cleared. Cc: David Gow <davidgow@google.com> Cc: Daniel Latypov <dlatypov@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kees Cook <keescook@chromium.org> --- v1: https://lore.kernel.org/lkml/20220224055145.1853657-1-keescook@chromium.org v2: - split "userspace KUnit stub" into separate header and patch (Daniel) - Improve commit log and comments (David) - Provide mapping of expected XFAIL tests to CONFIGs (David)
2022-02-17 08:03:41 +08:00
config STACKINIT_KUNIT_TEST
tristate "Test level of stack variable initialization" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Test if the kernel is zero-initializing stack variables and
padding. Coverage is controlled by compiler flags,
CONFIG_INIT_STACK_ALL_PATTERN, CONFIG_INIT_STACK_ALL_ZERO,
CONFIG_GCC_PLUGIN_STRUCTLEAK, CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF,
or CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL.
config FORTIFY_KUNIT_TEST
tristate "Test fortified str*() and mem*() function internals at runtime" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Builds unit tests for checking internals of FORTIFY_SOURCE as used
by the str*() and mem*() family of functions. For testing runtime
traps of FORTIFY_SOURCE, see LKDTM's "FORTIFY_*" tests.
config HW_BREAKPOINT_KUNIT_TEST
bool "Test hw_breakpoint constraints accounting" if !KUNIT_ALL_TESTS
depends on HAVE_HW_BREAKPOINT
depends on KUNIT=y
default KUNIT_ALL_TESTS
help
Tests for hw_breakpoint constraints accounting.
If unsure, say N.
config SIPHASH_KUNIT_TEST
tristate "Perform selftest on siphash functions" if !KUNIT_ALL_TESTS
depends on KUNIT
default KUNIT_ALL_TESTS
help
Enable this option to test the kernel's siphash (<linux/siphash.h>) hash
functions on boot (or module load).
This is intended to help people writing architecture-specific
optimized versions. If unsure, say N.
config USERCOPY_KUNIT_TEST
tristate "KUnit Test for user/kernel boundary protections"
depends on KUNIT
default KUNIT_ALL_TESTS
help
This builds the "usercopy_kunit" module that runs sanity checks
on the copy_to/from_user infrastructure, making sure basic
user/kernel boundary testing is working.
config TEST_UDELAY
tristate "udelay test driver"
help
This builds the "udelay_test" module that helps to make sure
that udelay() is working properly.
If unsure, say N.
config TEST_STATIC_KEYS
tristate "Test static keys"
depends on m
help
Test the static key interfaces.
If unsure, say N.
config TEST_DYNAMIC_DEBUG
tristate "Test DYNAMIC_DEBUG"
depends on DYNAMIC_DEBUG
help
This module registers a tracer callback to count enabled
pr_debugs in a 'do_debugging' function, then alters their
enablements, calls the function, and compares counts.
If unsure, say N.
kmod: add test driver to stress test the module loader This adds a new stress test driver for kmod: the kernel module loader. The new stress test driver, test_kmod, is only enabled as a module right now. It should be possible to load this as built-in and load tests early (refer to the force_init_test module parameter), however since a lot of test can get a system out of memory fast we leave this disabled for now. Using a system with 1024 MiB of RAM can *easily* get your kernel OOM fast with this test driver. The test_kmod driver exposes API knobs for us to fine tune simple request_module() and get_fs_type() calls. Since these API calls only allow each one parameter a test driver for these is rather simple. Other factors that can help out test driver though are the number of calls we issue and knowing current limitations of each. This exposes configuration as much as possible through userspace to be able to build tests directly from userspace. Since it allows multiple misc devices its will eventually (once we add a knob to let us create new devices at will) also be possible to perform more tests in parallel, provided you have enough memory. We only enable tests we know work as of right now. Demo screenshots: # tools/testing/selftests/kmod/kmod.sh kmod_test_0001_driver: OK! - loading kmod test kmod_test_0001_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0001_fs: OK! - loading kmod test kmod_test_0001_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL kmod_test_0002_driver: OK! - loading kmod test kmod_test_0002_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0002_fs: OK! - loading kmod test kmod_test_0002_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL kmod_test_0003: OK! - loading kmod test kmod_test_0003: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0004: OK! - loading kmod test kmod_test_0004: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0005: OK! - loading kmod test kmod_test_0005: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0006: OK! - loading kmod test kmod_test_0006: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0005: OK! - loading kmod test kmod_test_0005: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0006: OK! - loading kmod test kmod_test_0006: OK! - Return value: 0 (SUCCESS), expected SUCCESS XXX: add test restult for 0007 Test completed You can also request for specific tests: # tools/testing/selftests/kmod/kmod.sh -t 0001 kmod_test_0001_driver: OK! - loading kmod test kmod_test_0001_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0001_fs: OK! - loading kmod test kmod_test_0001_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL Test completed Lastly, the current available number of tests: # tools/testing/selftests/kmod/kmod.sh --help Usage: tools/testing/selftests/kmod/kmod.sh [ -t <4-number-digit> ] Valid tests: 0001-0009 0001 - Simple test - 1 thread for empty string 0002 - Simple test - 1 thread for modules/filesystems that do not exist 0003 - Simple test - 1 thread for get_fs_type() only 0004 - Simple test - 2 threads for get_fs_type() only 0005 - multithreaded tests with default setup - request_module() only 0006 - multithreaded tests with default setup - get_fs_type() only 0007 - multithreaded tests with default setup test request_module() and get_fs_type() 0008 - multithreaded - push kmod_concurrent over max_modprobes for request_module() 0009 - multithreaded - push kmod_concurrent over max_modprobes for get_fs_type() The following test cases currently fail, as such they are not currently enabled by default: # tools/testing/selftests/kmod/kmod.sh -t 0008 # tools/testing/selftests/kmod/kmod.sh -t 0009 To be sure to run them as intended please unload both of the modules: o test_module o xfs And ensure they are not loaded on your system prior to testing them. If you use these paritions for your rootfs you can change the default test driver used for get_fs_type() by exporting it into your environment. For example of other test defaults you can override refer to kmod.sh allow_user_defaults(). Behind the scenes this is how we fine tune at a test case prior to hitting a trigger to run it: cat /sys/devices/virtual/misc/test_kmod0/config echo -n "2" > /sys/devices/virtual/misc/test_kmod0/config_test_case echo -n "ext4" > /sys/devices/virtual/misc/test_kmod0/config_test_fs echo -n "80" > /sys/devices/virtual/misc/test_kmod0/config_num_threads cat /sys/devices/virtual/misc/test_kmod0/config echo -n "1" > /sys/devices/virtual/misc/test_kmod0/config_num_threads Finally to trigger: echo -n "1" > /sys/devices/virtual/misc/test_kmod0/trigger_config The kmod.sh script uses the above constructs to build different test cases. A bit of interpretation of the current failures follows, first two premises: a) When request_module() is used userspace figures out an optimized version of module order for us. Once it finds the modules it needs, as per depmod symbol dep map, it will finit_module() the respective modules which are needed for the original request_module() request. b) We have an optimization in place whereby if a kernel uses request_module() on a module already loaded we never bother userspace as the module already is loaded. This is all handled by kernel/kmod.c. A few things to consider to help identify root causes of issues: 0) kmod 19 has a broken heuristic for modules being assumed to be built-in to your kernel and will return 0 even though request_module() failed. Upgrade to a newer version of kmod. 1) A get_fs_type() call for "xfs" will request_module() for "fs-xfs", not for "xfs". The optimization in kernel described in b) fails to catch if we have a lot of consecutive get_fs_type() calls. The reason is the optimization in place does not look for aliases. This means two consecutive get_fs_type() calls will bump kmod_concurrent, whereas request_module() will not. This one explanation why test case 0009 fails at least once for get_fs_type(). 2) If a module fails to load --- for whatever reason (kmod_concurrent limit reached, file not yet present due to rootfs switch, out of memory) we have a period of time during which module request for the same name either with request_module() or get_fs_type() will *also* fail to load even if the file for the module is ready. This explains why *multiple* NULLs are possible on test 0009. 3) finit_module() consumes quite a bit of memory. 4) Filesystems typically also have more dependent modules than other modules, its important to note though that even though a get_fs_type() call does not incur additional kmod_concurrent bumps, since userspace loads dependencies it finds it needs via finit_module_fd(), it *will* take much more memory to load a module with a lot of dependencies. Because of 3) and 4) we will easily run into out of memory failures with certain tests. For instance test 0006 fails on qemu with 1024 MiB of RAM. It panics a box after reaping all userspace processes and still not having enough memory to reap. [arnd@arndb.de: add dependencies for test module] Link: http://lkml.kernel.org/r/20170630154834.3689272-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20170628223155.26472-3-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Jessica Yu <jeyu@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-15 05:50:08 +08:00
config TEST_KMOD
tristate "kmod stress tester"
depends on m
depends on NETDEVICES && NET_CORE && INET # for TUN
depends on BLOCK
depends on PAGE_SIZE_LESS_THAN_256KB # for BTRFS
kmod: add test driver to stress test the module loader This adds a new stress test driver for kmod: the kernel module loader. The new stress test driver, test_kmod, is only enabled as a module right now. It should be possible to load this as built-in and load tests early (refer to the force_init_test module parameter), however since a lot of test can get a system out of memory fast we leave this disabled for now. Using a system with 1024 MiB of RAM can *easily* get your kernel OOM fast with this test driver. The test_kmod driver exposes API knobs for us to fine tune simple request_module() and get_fs_type() calls. Since these API calls only allow each one parameter a test driver for these is rather simple. Other factors that can help out test driver though are the number of calls we issue and knowing current limitations of each. This exposes configuration as much as possible through userspace to be able to build tests directly from userspace. Since it allows multiple misc devices its will eventually (once we add a knob to let us create new devices at will) also be possible to perform more tests in parallel, provided you have enough memory. We only enable tests we know work as of right now. Demo screenshots: # tools/testing/selftests/kmod/kmod.sh kmod_test_0001_driver: OK! - loading kmod test kmod_test_0001_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0001_fs: OK! - loading kmod test kmod_test_0001_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL kmod_test_0002_driver: OK! - loading kmod test kmod_test_0002_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0002_fs: OK! - loading kmod test kmod_test_0002_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL kmod_test_0003: OK! - loading kmod test kmod_test_0003: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0004: OK! - loading kmod test kmod_test_0004: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0005: OK! - loading kmod test kmod_test_0005: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0006: OK! - loading kmod test kmod_test_0006: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0005: OK! - loading kmod test kmod_test_0005: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0006: OK! - loading kmod test kmod_test_0006: OK! - Return value: 0 (SUCCESS), expected SUCCESS XXX: add test restult for 0007 Test completed You can also request for specific tests: # tools/testing/selftests/kmod/kmod.sh -t 0001 kmod_test_0001_driver: OK! - loading kmod test kmod_test_0001_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0001_fs: OK! - loading kmod test kmod_test_0001_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL Test completed Lastly, the current available number of tests: # tools/testing/selftests/kmod/kmod.sh --help Usage: tools/testing/selftests/kmod/kmod.sh [ -t <4-number-digit> ] Valid tests: 0001-0009 0001 - Simple test - 1 thread for empty string 0002 - Simple test - 1 thread for modules/filesystems that do not exist 0003 - Simple test - 1 thread for get_fs_type() only 0004 - Simple test - 2 threads for get_fs_type() only 0005 - multithreaded tests with default setup - request_module() only 0006 - multithreaded tests with default setup - get_fs_type() only 0007 - multithreaded tests with default setup test request_module() and get_fs_type() 0008 - multithreaded - push kmod_concurrent over max_modprobes for request_module() 0009 - multithreaded - push kmod_concurrent over max_modprobes for get_fs_type() The following test cases currently fail, as such they are not currently enabled by default: # tools/testing/selftests/kmod/kmod.sh -t 0008 # tools/testing/selftests/kmod/kmod.sh -t 0009 To be sure to run them as intended please unload both of the modules: o test_module o xfs And ensure they are not loaded on your system prior to testing them. If you use these paritions for your rootfs you can change the default test driver used for get_fs_type() by exporting it into your environment. For example of other test defaults you can override refer to kmod.sh allow_user_defaults(). Behind the scenes this is how we fine tune at a test case prior to hitting a trigger to run it: cat /sys/devices/virtual/misc/test_kmod0/config echo -n "2" > /sys/devices/virtual/misc/test_kmod0/config_test_case echo -n "ext4" > /sys/devices/virtual/misc/test_kmod0/config_test_fs echo -n "80" > /sys/devices/virtual/misc/test_kmod0/config_num_threads cat /sys/devices/virtual/misc/test_kmod0/config echo -n "1" > /sys/devices/virtual/misc/test_kmod0/config_num_threads Finally to trigger: echo -n "1" > /sys/devices/virtual/misc/test_kmod0/trigger_config The kmod.sh script uses the above constructs to build different test cases. A bit of interpretation of the current failures follows, first two premises: a) When request_module() is used userspace figures out an optimized version of module order for us. Once it finds the modules it needs, as per depmod symbol dep map, it will finit_module() the respective modules which are needed for the original request_module() request. b) We have an optimization in place whereby if a kernel uses request_module() on a module already loaded we never bother userspace as the module already is loaded. This is all handled by kernel/kmod.c. A few things to consider to help identify root causes of issues: 0) kmod 19 has a broken heuristic for modules being assumed to be built-in to your kernel and will return 0 even though request_module() failed. Upgrade to a newer version of kmod. 1) A get_fs_type() call for "xfs" will request_module() for "fs-xfs", not for "xfs". The optimization in kernel described in b) fails to catch if we have a lot of consecutive get_fs_type() calls. The reason is the optimization in place does not look for aliases. This means two consecutive get_fs_type() calls will bump kmod_concurrent, whereas request_module() will not. This one explanation why test case 0009 fails at least once for get_fs_type(). 2) If a module fails to load --- for whatever reason (kmod_concurrent limit reached, file not yet present due to rootfs switch, out of memory) we have a period of time during which module request for the same name either with request_module() or get_fs_type() will *also* fail to load even if the file for the module is ready. This explains why *multiple* NULLs are possible on test 0009. 3) finit_module() consumes quite a bit of memory. 4) Filesystems typically also have more dependent modules than other modules, its important to note though that even though a get_fs_type() call does not incur additional kmod_concurrent bumps, since userspace loads dependencies it finds it needs via finit_module_fd(), it *will* take much more memory to load a module with a lot of dependencies. Because of 3) and 4) we will easily run into out of memory failures with certain tests. For instance test 0006 fails on qemu with 1024 MiB of RAM. It panics a box after reaping all userspace processes and still not having enough memory to reap. [arnd@arndb.de: add dependencies for test module] Link: http://lkml.kernel.org/r/20170630154834.3689272-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20170628223155.26472-3-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Jessica Yu <jeyu@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-15 05:50:08 +08:00
select TEST_LKM
select XFS_FS
select TUN
select BTRFS_FS
help
Test the kernel's module loading mechanism: kmod. kmod implements
support to load modules using the Linux kernel's usermode helper.
This test provides a series of tests against kmod.
Although technically you can either build test_kmod as a module or
into the kernel we disallow building it into the kernel since
it stress tests request_module() and this will very likely cause
some issues by taking over precious threads available from other
module load requests, ultimately this could be fatal.
To run tests run:
tools/testing/selftests/kmod/kmod.sh --help
If unsure, say N.
config TEST_DEBUG_VIRTUAL
tristate "Test CONFIG_DEBUG_VIRTUAL feature"
depends on DEBUG_VIRTUAL
help
Test the kernel's ability to detect incorrect calls to
virt_to_phys() done against the non-linear part of the
kernel's virtual address map.
If unsure, say N.
config TEST_MEMCAT_P
tristate "Test memcat_p() helper function"
help
Test the memcat_p() helper for correctly merging two
pointer arrays together.
If unsure, say N.
config TEST_OBJAGG
tristate "Perform selftest on object aggreration manager"
default n
depends on OBJAGG
help
Enable this option to test object aggregation manager on boot
(or module load).
config TEST_MEMINIT
tristate "Test heap/page initialization"
help
Test if the kernel is zero-initializing heap and page allocations.
This can be useful to test init_on_alloc and init_on_free features.
If unsure, say N.
config TEST_HMM
tristate "Test HMM (Heterogeneous Memory Management)"
depends on TRANSPARENT_HUGEPAGE
depends on DEVICE_PRIVATE
select HMM_MIRROR
select MMU_NOTIFIER
help
This is a pseudo device driver solely for testing HMM.
Say M here if you want to build the HMM test module.
Doing so will allow you to run tools/testing/selftest/vm/hmm-tests.
If unsure, say N.
config TEST_FREE_PAGES
tristate "Test freeing pages"
help
Test that a memory leak does not occur due to a race between
freeing a block of pages and a speculative page reference.
Loading this module is safe if your kernel has the bug fixed.
If the bug is not fixed, it will leak gigabytes of memory and
probably OOM your system.
config TEST_FPU
tristate "Test floating point operations in kernel space"
depends on ARCH_HAS_KERNEL_FPU_SUPPORT && !KCOV_INSTRUMENT_ALL
help
Enable this option to add /sys/kernel/debug/selftest_helpers/test_fpu
which will trigger a sequence of floating point operations. This is used
for self-testing floating point control register setting in
kernel_fpu_begin().
If unsure, say N.
clocksource: Provide kernel module to test clocksource watchdog When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. It would be good to have a way of testing the clocksource watchdog's ability to distinguish between these two causes of clock skew and instability. Therefore, provide a new clocksource-wdtest module selected by a new TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module parameter named "holdoff" that provides the number of seconds of delay before testing should start, which defaults to zero when built as a module and to 10 seconds when built directly into the kernel. Very large systems that boot slowly may need to increase the value of this module parameter. This module uses hand-crafted clocksource structures to do its testing, thus avoiding messing up timing for the rest of the kernel and for user applications. This module first verifies that the ->uncertainty_margin field of the clocksource structures are set sanely. It then tests the delay-detection capability of the clocksource watchdog, increasing the number of consecutive delays injected, first provoking console messages complaining about the delays and finally forcing a clock-skew event. Unexpected test results cause at least one WARN_ON_ONCE() console splat. If there are no splats, the test has passed. Finally, it fuzzes the value returned from a clocksource to test the clocksource watchdog's ability to detect time skew. This module checks the state of its clocksource after each test, and uses WARN_ON_ONCE() to emit a console splat if there are any failures. This should enable all types of test frameworks to detect any such failures. This facility is intended for diagnostic use only, and should be avoided on production systems. Reported-by: Chris Mason <clm@fb.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-5-paulmck@kernel.org
2021-05-28 03:01:23 +08:00
config TEST_CLOCKSOURCE_WATCHDOG
tristate "Test clocksource watchdog in kernel space"
depends on CLOCKSOURCE_WATCHDOG
help
Enable this option to create a kernel module that will trigger
a test of the clocksource watchdog. This module may be loaded
via modprobe or insmod in which case it will run upon being
loaded, or it may be built in, in which case it will run
shortly after boot.
If unsure, say N.
config TEST_OBJPOOL
tristate "Test module for correctness and stress of objpool"
default n
depends on m && DEBUG_KERNEL
help
This builds the "test_objpool" module that should be used for
correctness verification and concurrent testings of objects
allocation and reclamation.
If unsure, say N.
endif # RUNTIME_TESTING_MENU
config ARCH_USE_MEMTEST
bool
help
An architecture should select this when it uses early_memtest()
during boot process.
config MEMTEST
bool "Memtest"
depends on ARCH_USE_MEMTEST
help
This option adds a kernel parameter 'memtest', which allows memtest
to be set and executed.
memtest=0, mean disabled; -- default
memtest=1, mean do 1 test pattern;
...
memtest=17, mean do 17 test patterns.
If you are unsure how to answer this question, answer N.
config HYPERV_TESTING
bool "Microsoft Hyper-V driver testing"
default n
depends on HYPERV && DEBUG_FS
help
Select this option to enable Hyper-V vmbus testing.
endmenu # "Kernel Testing and Coverage"
Kbuild: add Rust support Having most of the new files in place, we now enable Rust support in the build system, including `Kconfig` entries related to Rust, the Rust configuration printer and a few other bits. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Co-developed-by: Alex Gaynor <alex.gaynor@gmail.com> Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com> Co-developed-by: Finn Behrens <me@kloenk.de> Signed-off-by: Finn Behrens <me@kloenk.de> Co-developed-by: Adam Bratschi-Kaye <ark.email@gmail.com> Signed-off-by: Adam Bratschi-Kaye <ark.email@gmail.com> Co-developed-by: Wedson Almeida Filho <wedsonaf@google.com> Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com> Co-developed-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Co-developed-by: Sven Van Asbroeck <thesven73@gmail.com> Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com> Co-developed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Gary Guo <gary@garyguo.net> Co-developed-by: Boris-Chengbiao Zhou <bobo1239@web.de> Signed-off-by: Boris-Chengbiao Zhou <bobo1239@web.de> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Co-developed-by: Douglas Su <d0u9.su@outlook.com> Signed-off-by: Douglas Su <d0u9.su@outlook.com> Co-developed-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl> Signed-off-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl> Co-developed-by: Antonio Terceiro <antonio.terceiro@linaro.org> Signed-off-by: Antonio Terceiro <antonio.terceiro@linaro.org> Co-developed-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Co-developed-by: Björn Roy Baron <bjorn3_gh@protonmail.com> Signed-off-by: Björn Roy Baron <bjorn3_gh@protonmail.com> Co-developed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Signed-off-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2021-07-03 22:42:57 +08:00
menu "Rust hacking"
config RUST_DEBUG_ASSERTIONS
bool "Debug assertions"
depends on RUST
help
Enables rustc's `-Cdebug-assertions` codegen option.
This flag lets you turn `cfg(debug_assertions)` conditional
compilation on or off. This can be used to enable extra debugging
code in development but not in production. For example, it controls
the behavior of the standard library's `debug_assert!` macro.
Note that this will apply to all Rust code, including `core`.
If unsure, say N.
config RUST_OVERFLOW_CHECKS
bool "Overflow checks"
default y
depends on RUST
help
Enables rustc's `-Coverflow-checks` codegen option.
This flag allows you to control the behavior of runtime integer
overflow. When overflow-checks are enabled, a Rust panic will occur
on overflow.
Note that this will apply to all Rust code, including `core`.
If unsure, say Y.
config RUST_BUILD_ASSERT_ALLOW
bool "Allow unoptimized build-time assertions"
depends on RUST
help
Controls how are `build_error!` and `build_assert!` handled during build.
If calls to them exist in the binary, it may indicate a violated invariant
or that the optimizer failed to verify the invariant during compilation.
This should not happen, thus by default the build is aborted. However,
as an escape hatch, you can choose Y here to ignore them during build
and let the check be carried at runtime (with `panic!` being called if
the check fails).
If unsure, say N.
rust: support running Rust documentation tests as KUnit ones Rust has documentation tests: these are typically examples of usage of any item (e.g. function, struct, module...). They are very convenient because they are just written alongside the documentation. For instance: /// Sums two numbers. /// /// ``` /// assert_eq!(mymod::f(10, 20), 30); /// ``` pub fn f(a: i32, b: i32) -> i32 { a + b } In userspace, the tests are collected and run via `rustdoc`. Using the tool as-is would be useful already, since it allows to compile-test most tests (thus enforcing they are kept in sync with the code they document) and run those that do not depend on in-kernel APIs. However, by transforming the tests into a KUnit test suite, they can also be run inside the kernel. Moreover, the tests get to be compiled as other Rust kernel objects instead of targeting userspace. On top of that, the integration with KUnit means the Rust support gets to reuse the existing testing facilities. For instance, the kernel log would look like: KTAP version 1 1..1 KTAP version 1 # Subtest: rust_doctests_kernel 1..59 # rust_doctest_kernel_build_assert_rs_0.location: rust/kernel/build_assert.rs:13 ok 1 rust_doctest_kernel_build_assert_rs_0 # rust_doctest_kernel_build_assert_rs_1.location: rust/kernel/build_assert.rs:56 ok 2 rust_doctest_kernel_build_assert_rs_1 # rust_doctest_kernel_init_rs_0.location: rust/kernel/init.rs:122 ok 3 rust_doctest_kernel_init_rs_0 ... # rust_doctest_kernel_types_rs_2.location: rust/kernel/types.rs:150 ok 59 rust_doctest_kernel_types_rs_2 # rust_doctests_kernel: pass:59 fail:0 skip:0 total:59 # Totals: pass:59 fail:0 skip:0 total:59 ok 1 rust_doctests_kernel Therefore, add support for running Rust documentation tests in KUnit. Some other notes about the current implementation and support follow. The transformation is performed by a couple scripts written as Rust hostprogs. Tests using the `?` operator are also supported as usual, e.g.: /// ``` /// # use kernel::{spawn_work_item, workqueue}; /// spawn_work_item!(workqueue::system(), || pr_info!("x"))?; /// # Ok::<(), Error>(()) /// ``` The tests are also compiled with Clippy under `CLIPPY=1`, just like normal code, thus also benefitting from extra linting. The names of the tests are currently automatically generated. This allows to reduce the burden for documentation writers, while keeping them fairly stable for bisection. This is an improvement over the `rustdoc`-generated names, which include the line number; but ideally we would like to get `rustdoc` to provide the Rust item path and a number (for multiple examples in a single documented Rust item). In order for developers to easily see from which original line a failed doctests came from, a KTAP diagnostic line is printed to the log, containing the location (file and line) of the original test (i.e. instead of the location in the generated Rust file): # rust_doctest_kernel_types_rs_2.location: rust/kernel/types.rs:150 This line follows the syntax for declaring test metadata in the proposed KTAP v2 spec [1], which may be used for the proposed KUnit test attributes API [2]. Thus hopefully this will make migration easier later on (suggested by David [3]). The original line in that test attribute is figured out by providing an anchor (suggested by Boqun [4]). The original file is found by walking the filesystem, checking directory prefixes to reduce the amount of combinations to check, and it is only done once per file. Ambiguities are detected and reported. A notable difference from KUnit C tests is that the Rust tests appear to assert using the usual `assert!` and `assert_eq!` macros from the Rust standard library (`core`). We provide a custom version that forwards the call to KUnit instead. Importantly, these macros do not require passing context, unlike the KUnit C ones (i.e. `struct kunit *`). This makes them easier to use, and readers of the documentation do not need to care about which testing framework is used. In addition, it may allow us to test third-party code more easily in the future. However, a current limitation is that KUnit does not support assertions in other tasks. Thus we presently simply print an error to the kernel log if an assertion actually failed. This should be revisited to properly fail the test, perhaps saving the context somewhere else, or letting KUnit handle it. Link: https://lore.kernel.org/lkml/20230420205734.1288498-1-rmoar@google.com/ [1] Link: https://lore.kernel.org/linux-kselftest/20230707210947.1208717-1-rmoar@google.com/ [2] Link: https://lore.kernel.org/rust-for-linux/CABVgOSkOLO-8v6kdAGpmYnZUb+LKOX0CtYCo-Bge7r_2YTuXDQ@mail.gmail.com/ [3] Link: https://lore.kernel.org/rust-for-linux/ZIps86MbJF%2FiGIzd@boqun-archlinux/ [4] Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-07-18 13:27:51 +08:00
config RUST_KERNEL_DOCTESTS
bool "Doctests for the `kernel` crate" if !KUNIT_ALL_TESTS
depends on RUST && KUNIT=y
default KUNIT_ALL_TESTS
help
This builds the documentation tests of the `kernel` crate
as KUnit tests.
For more information on KUnit and unit tests in general,
please refer to the KUnit documentation in Documentation/dev-tools/kunit/.
If unsure, say N.
Kbuild: add Rust support Having most of the new files in place, we now enable Rust support in the build system, including `Kconfig` entries related to Rust, the Rust configuration printer and a few other bits. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Co-developed-by: Alex Gaynor <alex.gaynor@gmail.com> Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com> Co-developed-by: Finn Behrens <me@kloenk.de> Signed-off-by: Finn Behrens <me@kloenk.de> Co-developed-by: Adam Bratschi-Kaye <ark.email@gmail.com> Signed-off-by: Adam Bratschi-Kaye <ark.email@gmail.com> Co-developed-by: Wedson Almeida Filho <wedsonaf@google.com> Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com> Co-developed-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Co-developed-by: Sven Van Asbroeck <thesven73@gmail.com> Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com> Co-developed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Gary Guo <gary@garyguo.net> Co-developed-by: Boris-Chengbiao Zhou <bobo1239@web.de> Signed-off-by: Boris-Chengbiao Zhou <bobo1239@web.de> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Co-developed-by: Douglas Su <d0u9.su@outlook.com> Signed-off-by: Douglas Su <d0u9.su@outlook.com> Co-developed-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl> Signed-off-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl> Co-developed-by: Antonio Terceiro <antonio.terceiro@linaro.org> Signed-off-by: Antonio Terceiro <antonio.terceiro@linaro.org> Co-developed-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Co-developed-by: Björn Roy Baron <bjorn3_gh@protonmail.com> Signed-off-by: Björn Roy Baron <bjorn3_gh@protonmail.com> Co-developed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Signed-off-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2021-07-03 22:42:57 +08:00
endmenu # "Rust"
endmenu # Kernel hacking