2006-09-29 16:59:00 +08:00
|
|
|
/*
|
|
|
|
* Copyright 2006, Red Hat, Inc., Dave Jones
|
|
|
|
* Released under the General Public License (GPL).
|
|
|
|
*
|
list: Introduce CONFIG_LIST_HARDENED
Numerous production kernel configs (see [1, 2]) are choosing to enable
CONFIG_DEBUG_LIST, which is also being recommended by KSPP for hardened
configs [3]. The motivation behind this is that the option can be used
as a security hardening feature (e.g. CVE-2019-2215 and CVE-2019-2025
are mitigated by the option [4]).
The feature has never been designed with performance in mind, yet common
list manipulation is happening across hot paths all over the kernel.
Introduce CONFIG_LIST_HARDENED, which performs list pointer checking
inline, and only upon list corruption calls the reporting slow path.
To generate optimal machine code with CONFIG_LIST_HARDENED:
1. Elide checking for pointer values which upon dereference would
result in an immediate access fault (i.e. minimal hardening
checks). The trade-off is lower-quality error reports.
2. Use the __preserve_most function attribute (available with Clang,
but not yet with GCC) to minimize the code footprint for calling
the reporting slow path. As a result, function size of callers is
reduced by avoiding saving registers before calling the rarely
called reporting slow path.
Note that all TUs in lib/Makefile already disable function tracing,
including list_debug.c, and __preserve_most's implied notrace has
no effect in this case.
3. Because the inline checks are a subset of the full set of checks in
__list_*_valid_or_report(), always return false if the inline
checks failed. This avoids redundant compare and conditional
branch right after return from the slow path.
As a side-effect of the checks being inline, if the compiler can prove
some condition to always be true, it can completely elide some checks.
Since DEBUG_LIST is functionally a superset of LIST_HARDENED, the
Kconfig variables are changed to reflect that: DEBUG_LIST selects
LIST_HARDENED, whereas LIST_HARDENED itself has no dependency on
DEBUG_LIST.
Running netperf with CONFIG_LIST_HARDENED (using a Clang compiler with
"preserve_most") shows throughput improvements, in my case of ~7% on
average (up to 20-30% on some test cases).
Link: https://r.android.com/1266735 [1]
Link: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/main/config [2]
Link: https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings [3]
Link: https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html [4]
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20230811151847.1594958-3-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-11 23:18:40 +08:00
|
|
|
* This file contains the linked list validation and error reporting for
|
|
|
|
* LIST_HARDENED and DEBUG_LIST.
|
2006-09-29 16:59:00 +08:00
|
|
|
*/
|
|
|
|
|
2011-11-17 10:29:17 +08:00
|
|
|
#include <linux/export.h>
|
2006-09-29 16:59:00 +08:00
|
|
|
#include <linux/list.h>
|
2012-01-21 07:35:53 +08:00
|
|
|
#include <linux/bug.h>
|
2012-01-21 07:46:49 +08:00
|
|
|
#include <linux/kernel.h>
|
2012-03-15 10:17:39 +08:00
|
|
|
#include <linux/rculist.h>
|
2006-09-29 16:59:00 +08:00
|
|
|
|
|
|
|
/*
|
2016-08-18 05:42:08 +08:00
|
|
|
* Check that the data structures for the list manipulations are reasonably
|
|
|
|
* valid. Failures here indicate memory corruption (and possibly an exploit
|
|
|
|
* attempt).
|
2006-09-29 16:59:00 +08:00
|
|
|
*/
|
|
|
|
|
list: Introduce CONFIG_LIST_HARDENED
Numerous production kernel configs (see [1, 2]) are choosing to enable
CONFIG_DEBUG_LIST, which is also being recommended by KSPP for hardened
configs [3]. The motivation behind this is that the option can be used
as a security hardening feature (e.g. CVE-2019-2215 and CVE-2019-2025
are mitigated by the option [4]).
The feature has never been designed with performance in mind, yet common
list manipulation is happening across hot paths all over the kernel.
Introduce CONFIG_LIST_HARDENED, which performs list pointer checking
inline, and only upon list corruption calls the reporting slow path.
To generate optimal machine code with CONFIG_LIST_HARDENED:
1. Elide checking for pointer values which upon dereference would
result in an immediate access fault (i.e. minimal hardening
checks). The trade-off is lower-quality error reports.
2. Use the __preserve_most function attribute (available with Clang,
but not yet with GCC) to minimize the code footprint for calling
the reporting slow path. As a result, function size of callers is
reduced by avoiding saving registers before calling the rarely
called reporting slow path.
Note that all TUs in lib/Makefile already disable function tracing,
including list_debug.c, and __preserve_most's implied notrace has
no effect in this case.
3. Because the inline checks are a subset of the full set of checks in
__list_*_valid_or_report(), always return false if the inline
checks failed. This avoids redundant compare and conditional
branch right after return from the slow path.
As a side-effect of the checks being inline, if the compiler can prove
some condition to always be true, it can completely elide some checks.
Since DEBUG_LIST is functionally a superset of LIST_HARDENED, the
Kconfig variables are changed to reflect that: DEBUG_LIST selects
LIST_HARDENED, whereas LIST_HARDENED itself has no dependency on
DEBUG_LIST.
Running netperf with CONFIG_LIST_HARDENED (using a Clang compiler with
"preserve_most") shows throughput improvements, in my case of ~7% on
average (up to 20-30% on some test cases).
Link: https://r.android.com/1266735 [1]
Link: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/main/config [2]
Link: https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings [3]
Link: https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html [4]
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20230811151847.1594958-3-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-11 23:18:40 +08:00
|
|
|
__list_valid_slowpath
|
2023-08-11 23:18:39 +08:00
|
|
|
bool __list_add_valid_or_report(struct list_head *new, struct list_head *prev,
|
|
|
|
struct list_head *next)
|
2006-09-29 16:59:00 +08:00
|
|
|
{
|
2022-06-01 06:29:51 +08:00
|
|
|
if (CHECK_DATA_CORRUPTION(prev == NULL,
|
|
|
|
"list_add corruption. prev is NULL.\n") ||
|
|
|
|
CHECK_DATA_CORRUPTION(next == NULL,
|
|
|
|
"list_add corruption. next is NULL.\n") ||
|
|
|
|
CHECK_DATA_CORRUPTION(next->prev != prev,
|
2018-04-11 07:33:06 +08:00
|
|
|
"list_add corruption. next->prev should be prev (%px), but was %px. (next=%px).\n",
|
2017-02-25 07:00:38 +08:00
|
|
|
prev, next->prev, next) ||
|
|
|
|
CHECK_DATA_CORRUPTION(prev->next != next,
|
2018-04-11 07:33:06 +08:00
|
|
|
"list_add corruption. prev->next should be next (%px), but was %px. (prev=%px).\n",
|
2017-02-25 07:00:38 +08:00
|
|
|
next, prev->next, prev) ||
|
|
|
|
CHECK_DATA_CORRUPTION(new == prev || new == next,
|
2018-04-11 07:33:06 +08:00
|
|
|
"list_add double add: new=%px, prev=%px, next=%px.\n",
|
2017-02-25 07:00:38 +08:00
|
|
|
new, prev, next))
|
|
|
|
return false;
|
2016-08-18 05:42:11 +08:00
|
|
|
|
2016-08-18 05:42:08 +08:00
|
|
|
return true;
|
2006-09-29 16:59:00 +08:00
|
|
|
}
|
2023-08-11 23:18:39 +08:00
|
|
|
EXPORT_SYMBOL(__list_add_valid_or_report);
|
2006-09-29 16:59:00 +08:00
|
|
|
|
list: Introduce CONFIG_LIST_HARDENED
Numerous production kernel configs (see [1, 2]) are choosing to enable
CONFIG_DEBUG_LIST, which is also being recommended by KSPP for hardened
configs [3]. The motivation behind this is that the option can be used
as a security hardening feature (e.g. CVE-2019-2215 and CVE-2019-2025
are mitigated by the option [4]).
The feature has never been designed with performance in mind, yet common
list manipulation is happening across hot paths all over the kernel.
Introduce CONFIG_LIST_HARDENED, which performs list pointer checking
inline, and only upon list corruption calls the reporting slow path.
To generate optimal machine code with CONFIG_LIST_HARDENED:
1. Elide checking for pointer values which upon dereference would
result in an immediate access fault (i.e. minimal hardening
checks). The trade-off is lower-quality error reports.
2. Use the __preserve_most function attribute (available with Clang,
but not yet with GCC) to minimize the code footprint for calling
the reporting slow path. As a result, function size of callers is
reduced by avoiding saving registers before calling the rarely
called reporting slow path.
Note that all TUs in lib/Makefile already disable function tracing,
including list_debug.c, and __preserve_most's implied notrace has
no effect in this case.
3. Because the inline checks are a subset of the full set of checks in
__list_*_valid_or_report(), always return false if the inline
checks failed. This avoids redundant compare and conditional
branch right after return from the slow path.
As a side-effect of the checks being inline, if the compiler can prove
some condition to always be true, it can completely elide some checks.
Since DEBUG_LIST is functionally a superset of LIST_HARDENED, the
Kconfig variables are changed to reflect that: DEBUG_LIST selects
LIST_HARDENED, whereas LIST_HARDENED itself has no dependency on
DEBUG_LIST.
Running netperf with CONFIG_LIST_HARDENED (using a Clang compiler with
"preserve_most") shows throughput improvements, in my case of ~7% on
average (up to 20-30% on some test cases).
Link: https://r.android.com/1266735 [1]
Link: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/main/config [2]
Link: https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings [3]
Link: https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html [4]
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20230811151847.1594958-3-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-11 23:18:40 +08:00
|
|
|
__list_valid_slowpath
|
2023-08-11 23:18:39 +08:00
|
|
|
bool __list_del_entry_valid_or_report(struct list_head *entry)
|
2011-02-19 03:32:28 +08:00
|
|
|
{
|
|
|
|
struct list_head *prev, *next;
|
|
|
|
|
|
|
|
prev = entry->prev;
|
|
|
|
next = entry->next;
|
|
|
|
|
2022-06-01 06:29:51 +08:00
|
|
|
if (CHECK_DATA_CORRUPTION(next == NULL,
|
|
|
|
"list_del corruption, %px->next is NULL\n", entry) ||
|
|
|
|
CHECK_DATA_CORRUPTION(prev == NULL,
|
|
|
|
"list_del corruption, %px->prev is NULL\n", entry) ||
|
|
|
|
CHECK_DATA_CORRUPTION(next == LIST_POISON1,
|
2018-04-11 07:33:06 +08:00
|
|
|
"list_del corruption, %px->next is LIST_POISON1 (%px)\n",
|
2017-02-25 07:00:38 +08:00
|
|
|
entry, LIST_POISON1) ||
|
|
|
|
CHECK_DATA_CORRUPTION(prev == LIST_POISON2,
|
2018-04-11 07:33:06 +08:00
|
|
|
"list_del corruption, %px->prev is LIST_POISON2 (%px)\n",
|
2017-02-25 07:00:38 +08:00
|
|
|
entry, LIST_POISON2) ||
|
|
|
|
CHECK_DATA_CORRUPTION(prev->next != entry,
|
lib/list_debug.c: print more list debugging context in __list_del_entry_valid()
Currently, the entry->prev and entry->next are considered to be valid as
long as they are not LIST_POISON{1|2}. However, the memory may be
corrupted. The prev->next is invalid probably because 'prev' is
invalid, not because prev->next's content is illegal.
Unfortunately, the printk and its subfunctions will modify the registers
that hold the 'prev' and 'next', and we don't see this valuable
information in the BUG context.
So print the contents of 'entry->prev' and 'entry->next'.
Here's an example:
list_del corruption. prev->next should be c0ecbf74, but was c08410dc
kernel BUG at lib/list_debug.c:53!
... ...
PC is at __list_del_entry_valid+0x58/0x98
LR is at __list_del_entry_valid+0x58/0x98
psr: 60000093
sp : c0ecbf30 ip : 00000000 fp : 00000001
r10: c08410d0 r9 : 00000001 r8 : c0825e0c
r7 : 20000013 r6 : c08410d0 r5 : c0ecbf74 r4 : c0ecbf74
r3 : c0825d08 r2 : 00000000 r1 : df7ce6f4 r0 : 00000044
... ...
Stack: (0xc0ecbf30 to 0xc0ecc000)
bf20: c0ecbf74 c0164fd0 c0ecbf70 c0165170
bf40: c0eca000 c0840c00 c0840c00 c0824500 c0825e0c c0189bbc c088f404 60000013
bf60: 60000013 c0e85100 000004ec 00000000 c0ebcdc0 c0ecbf74 c0ecbf74 c0825d08
bf80: c0e807c0 c018965c 00000000 c013f2a0 c0e807c0 c013f154 00000000 00000000
bfa0: 00000000 00000000 00000000 c01001b0 00000000 00000000 00000000 00000000
bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
(__list_del_entry_valid) from (__list_del_entry+0xc/0x20)
(__list_del_entry) from (finish_swait+0x60/0x7c)
(finish_swait) from (rcu_gp_kthread+0x560/0xa20)
(rcu_gp_kthread) from (kthread+0x14c/0x15c)
(kthread) from (ret_from_fork+0x14/0x24)
At first, I thought prev->next was overwritten. Later, I carefully
analyzed the RCU code and the disassembly code. The error occurred when
deleting a node from the list rcu_state.gp_wq. The System.map shows
that the address of rcu_state is c0840c00. Then I use gdb to obtain the
offset of rcu_state.gp_wq.task_list.
(gdb) p &((struct rcu_state *)0)->gp_wq.task_list
$1 = (struct list_head *) 0x4dc
Again:
list_del corruption. prev->next should be c0ecbf74, but was c08410dc
c08410dc = c0840c00 + 0x4dc = &rcu_state.gp_wq.task_list
Because rcu_state.gp_wq has at most one node, so I can guess that "prev
= &rcu_state.gp_wq.task_list". But for other scenes, maybe I wasn't so
lucky, I cannot figure out the value of 'prev'.
Link: https://lkml.kernel.org/r/20211207025835.1909-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20 10:08:59 +08:00
|
|
|
"list_del corruption. prev->next should be %px, but was %px. (prev=%px)\n",
|
|
|
|
entry, prev->next, prev) ||
|
2017-02-25 07:00:38 +08:00
|
|
|
CHECK_DATA_CORRUPTION(next->prev != entry,
|
lib/list_debug.c: print more list debugging context in __list_del_entry_valid()
Currently, the entry->prev and entry->next are considered to be valid as
long as they are not LIST_POISON{1|2}. However, the memory may be
corrupted. The prev->next is invalid probably because 'prev' is
invalid, not because prev->next's content is illegal.
Unfortunately, the printk and its subfunctions will modify the registers
that hold the 'prev' and 'next', and we don't see this valuable
information in the BUG context.
So print the contents of 'entry->prev' and 'entry->next'.
Here's an example:
list_del corruption. prev->next should be c0ecbf74, but was c08410dc
kernel BUG at lib/list_debug.c:53!
... ...
PC is at __list_del_entry_valid+0x58/0x98
LR is at __list_del_entry_valid+0x58/0x98
psr: 60000093
sp : c0ecbf30 ip : 00000000 fp : 00000001
r10: c08410d0 r9 : 00000001 r8 : c0825e0c
r7 : 20000013 r6 : c08410d0 r5 : c0ecbf74 r4 : c0ecbf74
r3 : c0825d08 r2 : 00000000 r1 : df7ce6f4 r0 : 00000044
... ...
Stack: (0xc0ecbf30 to 0xc0ecc000)
bf20: c0ecbf74 c0164fd0 c0ecbf70 c0165170
bf40: c0eca000 c0840c00 c0840c00 c0824500 c0825e0c c0189bbc c088f404 60000013
bf60: 60000013 c0e85100 000004ec 00000000 c0ebcdc0 c0ecbf74 c0ecbf74 c0825d08
bf80: c0e807c0 c018965c 00000000 c013f2a0 c0e807c0 c013f154 00000000 00000000
bfa0: 00000000 00000000 00000000 c01001b0 00000000 00000000 00000000 00000000
bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
(__list_del_entry_valid) from (__list_del_entry+0xc/0x20)
(__list_del_entry) from (finish_swait+0x60/0x7c)
(finish_swait) from (rcu_gp_kthread+0x560/0xa20)
(rcu_gp_kthread) from (kthread+0x14c/0x15c)
(kthread) from (ret_from_fork+0x14/0x24)
At first, I thought prev->next was overwritten. Later, I carefully
analyzed the RCU code and the disassembly code. The error occurred when
deleting a node from the list rcu_state.gp_wq. The System.map shows
that the address of rcu_state is c0840c00. Then I use gdb to obtain the
offset of rcu_state.gp_wq.task_list.
(gdb) p &((struct rcu_state *)0)->gp_wq.task_list
$1 = (struct list_head *) 0x4dc
Again:
list_del corruption. prev->next should be c0ecbf74, but was c08410dc
c08410dc = c0840c00 + 0x4dc = &rcu_state.gp_wq.task_list
Because rcu_state.gp_wq has at most one node, so I can guess that "prev
= &rcu_state.gp_wq.task_list". But for other scenes, maybe I wasn't so
lucky, I cannot figure out the value of 'prev'.
Link: https://lkml.kernel.org/r/20211207025835.1909-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20 10:08:59 +08:00
|
|
|
"list_del corruption. next->prev should be %px, but was %px. (next=%px)\n",
|
|
|
|
entry, next->prev, next))
|
2017-02-25 07:00:38 +08:00
|
|
|
return false;
|
|
|
|
|
2016-08-18 05:42:10 +08:00
|
|
|
return true;
|
2006-09-29 16:59:00 +08:00
|
|
|
}
|
2023-08-11 23:18:39 +08:00
|
|
|
EXPORT_SYMBOL(__list_del_entry_valid_or_report);
|