2019-05-19 20:07:45 +08:00
|
|
|
# SPDX-License-Identifier: GPL-2.0-only
|
2017-05-17 23:43:40 +08:00
|
|
|
#
|
|
|
|
# RCU-related configuration options
|
|
|
|
#
|
|
|
|
|
|
|
|
menu "RCU Subsystem"
|
|
|
|
|
|
|
|
config TREE_RCU
|
|
|
|
bool
|
2019-10-15 10:55:57 +08:00
|
|
|
default y if SMP
|
2022-06-08 22:40:25 +08:00
|
|
|
# Dynticks-idle tracking
|
|
|
|
select CONTEXT_TRACKING_IDLE
|
2017-05-17 23:43:40 +08:00
|
|
|
help
|
|
|
|
This option selects the RCU implementation that is
|
|
|
|
designed for very large SMP system with hundreds or
|
|
|
|
thousands of CPUs. It also scales down nicely to
|
|
|
|
smaller systems.
|
|
|
|
|
|
|
|
config PREEMPT_RCU
|
|
|
|
bool
|
2019-07-27 05:19:38 +08:00
|
|
|
default y if PREEMPTION
|
2019-10-15 10:55:57 +08:00
|
|
|
select TREE_RCU
|
2017-05-17 23:43:40 +08:00
|
|
|
help
|
|
|
|
This option selects the RCU implementation that is
|
|
|
|
designed for very large SMP systems with hundreds or
|
|
|
|
thousands of CPUs, but for which real-time response
|
|
|
|
is also required. It also scales down nicely to
|
|
|
|
smaller systems.
|
|
|
|
|
|
|
|
Select this option if you are unsure.
|
|
|
|
|
|
|
|
config TINY_RCU
|
|
|
|
bool
|
2024-02-15 07:33:55 +08:00
|
|
|
default y if !PREEMPT_RCU && !SMP
|
2017-05-17 23:43:40 +08:00
|
|
|
help
|
|
|
|
This option selects the RCU implementation that is
|
|
|
|
designed for UP systems from which real-time response
|
|
|
|
is not required. This option greatly reduces the
|
|
|
|
memory footprint of RCU.
|
|
|
|
|
|
|
|
config RCU_EXPERT
|
|
|
|
bool "Make expert-level adjustments to RCU configuration"
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
This option needs to be enabled if you wish to make
|
|
|
|
expert-level adjustments to RCU configuration. By default,
|
|
|
|
no such adjustments can be made, which has the often-beneficial
|
|
|
|
side-effect of preventing "make oldconfig" from asking you all
|
|
|
|
sorts of detailed questions about how you would like numerous
|
|
|
|
obscure RCU options to be set up.
|
|
|
|
|
|
|
|
Say Y if you need to make expert-level adjustments to RCU.
|
|
|
|
|
|
|
|
Say N if you are unsure.
|
|
|
|
|
|
|
|
config TINY_SRCU
|
|
|
|
bool
|
2022-11-23 05:53:57 +08:00
|
|
|
default y if TINY_RCU
|
2017-05-17 23:43:40 +08:00
|
|
|
help
|
|
|
|
This option selects the single-CPU non-preemptible version of SRCU.
|
|
|
|
|
|
|
|
config TREE_SRCU
|
|
|
|
bool
|
2022-11-23 05:53:57 +08:00
|
|
|
default y if !TINY_RCU
|
2017-05-17 23:43:40 +08:00
|
|
|
help
|
|
|
|
This option selects the full-fledged version of SRCU.
|
|
|
|
|
srcu: Create an srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()
On strict load-store architectures, the use of this_cpu_inc() by
srcu_read_lock() and srcu_read_unlock() is not NMI-safe in TREE SRCU.
To see this suppose that an NMI arrives in the middle of srcu_read_lock(),
just after it has read ->srcu_lock_count, but before it has written
the incremented value back to memory. If that NMI handler also does
srcu_read_lock() and srcu_read_lock() on that same srcu_struct structure,
then upon return from that NMI handler, the interrupted srcu_read_lock()
will overwrite the NMI handler's update to ->srcu_lock_count, but
leave unchanged the NMI handler's update by srcu_read_unlock() to
->srcu_unlock_count.
This can result in a too-short SRCU grace period, which can in turn
result in arbitrary memory corruption.
If the NMI handler instead interrupts the srcu_read_unlock(), this
can result in eternal SRCU grace periods, which is not much better.
This commit therefore creates a pair of new srcu_read_lock_nmisafe()
and srcu_read_unlock_nmisafe() functions, which allow SRCU readers in
both NMI handlers and in process and IRQ context. It is bad practice
to mix the existing and the new _nmisafe() primitives on the same
srcu_struct structure. Use one set or the other, not both.
Just to underline that "bad practice" point, using srcu_read_lock() at
process level and srcu_read_lock_nmisafe() in your NMI handler will not,
repeat NOT, work. If you do not immediately understand why this is the
case, please review the earlier paragraphs in this commit log.
[ paulmck: Apply kernel test robot feedback. ]
[ paulmck: Apply feedback from Randy Dunlap. ]
[ paulmck: Apply feedback from John Ogness. ]
[ paulmck: Apply feedback from Frederic Weisbecker. ]
Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
2022-09-16 05:29:07 +08:00
|
|
|
config NEED_SRCU_NMI_SAFE
|
|
|
|
def_bool HAVE_NMI && !ARCH_HAS_NMI_SAFE_THIS_CPU_OPS && !TINY_SRCU
|
|
|
|
|
2020-03-04 03:49:21 +08:00
|
|
|
config TASKS_RCU_GENERIC
|
rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated. This commit therefore adds a variant of
Tasks RCU that:
o Has explicit read-side markers to allow finite grace periods in
the face of in-kernel loops for PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace().
o Protects code in the idle loop, exception entry/exit, and
CPU-hotplug code paths. In this respect, RCU-tasks trace is
similar to SRCU, but with lighter-weight readers.
o Avoids expensive read-side instruction, having overhead similar
to that of Preemptible RCU.
There are of course downsides:
o The grace-period code can send IPIs to CPUs, even when those
CPUs are in the idle loop or in nohz_full userspace. This is
mitigated by later commits.
o It is necessary to scan the full tasklist, much as for Tasks RCU.
o There is a single callback queue guarded by a single lock,
again, much as for Tasks RCU. However, those early use cases
that request multiple grace periods in quick succession are
expected to do so from a single task, which makes the single
lock almost irrelevant. If needed, multiple callback queues
can be provided using any number of schemes.
Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.
The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko. At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.
Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-10 10:56:53 +08:00
|
|
|
def_bool TASKS_RCU || TASKS_RUDE_RCU || TASKS_TRACE_RCU
|
2020-03-04 03:49:21 +08:00
|
|
|
help
|
|
|
|
This option enables generic infrastructure code supporting
|
|
|
|
task-based RCU implementations. Not for manual selection.
|
|
|
|
|
2022-03-18 06:18:27 +08:00
|
|
|
config FORCE_TASKS_RCU
|
|
|
|
bool "Force selection of TASKS_RCU"
|
|
|
|
depends on RCU_EXPERT
|
|
|
|
select TASKS_RCU
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
This option force-enables a task-based RCU implementation
|
|
|
|
that uses only voluntary context switch (not preemption!),
|
|
|
|
idle, and user-mode execution as quiescent states. Not for
|
|
|
|
manual selection in most cases.
|
|
|
|
|
2024-02-23 02:09:19 +08:00
|
|
|
config NEED_TASKS_RCU
|
2022-03-18 06:18:27 +08:00
|
|
|
bool
|
|
|
|
default n
|
2024-02-23 02:09:19 +08:00
|
|
|
|
|
|
|
config TASKS_RCU
|
|
|
|
bool
|
|
|
|
default NEED_TASKS_RCU && (PREEMPTION || PREEMPT_AUTO)
|
2022-03-18 02:05:09 +08:00
|
|
|
select IRQ_WORK
|
2017-05-17 23:43:40 +08:00
|
|
|
|
2022-03-18 07:16:45 +08:00
|
|
|
config FORCE_TASKS_RUDE_RCU
|
|
|
|
bool "Force selection of Tasks Rude RCU"
|
|
|
|
depends on RCU_EXPERT
|
|
|
|
select TASKS_RUDE_RCU
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
This option force-enables a task-based RCU implementation
|
|
|
|
that uses only context switch (including preemption) and
|
|
|
|
user-mode execution as quiescent states. It forces IPIs and
|
|
|
|
context switches on all online CPUs, including idle ones,
|
|
|
|
so use with caution. Not for manual selection in most cases.
|
|
|
|
|
2020-03-03 13:06:43 +08:00
|
|
|
config TASKS_RUDE_RCU
|
2022-03-18 07:16:45 +08:00
|
|
|
bool
|
|
|
|
default n
|
2022-03-18 00:30:10 +08:00
|
|
|
select IRQ_WORK
|
2020-03-03 13:06:43 +08:00
|
|
|
|
2022-03-18 04:29:59 +08:00
|
|
|
config FORCE_TASKS_TRACE_RCU
|
|
|
|
bool "Force selection of Tasks Trace RCU"
|
|
|
|
depends on RCU_EXPERT
|
|
|
|
select TASKS_TRACE_RCU
|
|
|
|
default n
|
rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated. This commit therefore adds a variant of
Tasks RCU that:
o Has explicit read-side markers to allow finite grace periods in
the face of in-kernel loops for PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace().
o Protects code in the idle loop, exception entry/exit, and
CPU-hotplug code paths. In this respect, RCU-tasks trace is
similar to SRCU, but with lighter-weight readers.
o Avoids expensive read-side instruction, having overhead similar
to that of Preemptible RCU.
There are of course downsides:
o The grace-period code can send IPIs to CPUs, even when those
CPUs are in the idle loop or in nohz_full userspace. This is
mitigated by later commits.
o It is necessary to scan the full tasklist, much as for Tasks RCU.
o There is a single callback queue guarded by a single lock,
again, much as for Tasks RCU. However, those early use cases
that request multiple grace periods in quick succession are
expected to do so from a single task, which makes the single
lock almost irrelevant. If needed, multiple callback queues
can be provided using any number of schemes.
Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.
The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko. At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.
Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-10 10:56:53 +08:00
|
|
|
help
|
|
|
|
This option enables a task-based RCU implementation that uses
|
|
|
|
explicit rcu_read_lock_trace() read-side markers, and allows
|
2022-03-18 04:29:59 +08:00
|
|
|
these readers to appear in the idle loop as well as on the
|
|
|
|
CPU hotplug code paths. It can force IPIs on online CPUs,
|
|
|
|
including idle ones, so use with caution. Not for manual
|
|
|
|
selection in most cases.
|
|
|
|
|
|
|
|
config TASKS_TRACE_RCU
|
|
|
|
bool
|
|
|
|
default n
|
|
|
|
select IRQ_WORK
|
rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated. This commit therefore adds a variant of
Tasks RCU that:
o Has explicit read-side markers to allow finite grace periods in
the face of in-kernel loops for PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace().
o Protects code in the idle loop, exception entry/exit, and
CPU-hotplug code paths. In this respect, RCU-tasks trace is
similar to SRCU, but with lighter-weight readers.
o Avoids expensive read-side instruction, having overhead similar
to that of Preemptible RCU.
There are of course downsides:
o The grace-period code can send IPIs to CPUs, even when those
CPUs are in the idle loop or in nohz_full userspace. This is
mitigated by later commits.
o It is necessary to scan the full tasklist, much as for Tasks RCU.
o There is a single callback queue guarded by a single lock,
again, much as for Tasks RCU. However, those early use cases
that request multiple grace periods in quick succession are
expected to do so from a single task, which makes the single
lock almost irrelevant. If needed, multiple callback queues
can be provided using any number of schemes.
Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.
The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko. At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.
Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-10 10:56:53 +08:00
|
|
|
|
2017-05-17 23:43:40 +08:00
|
|
|
config RCU_STALL_COMMON
|
2019-10-15 10:55:57 +08:00
|
|
|
def_bool TREE_RCU
|
2017-05-17 23:43:40 +08:00
|
|
|
help
|
|
|
|
This option enables RCU CPU stall code that is common between
|
|
|
|
the TINY and TREE variants of RCU. The purpose is to allow
|
|
|
|
the tiny variants to disable RCU CPU stall warnings, while
|
|
|
|
making these warnings mandatory for the tree variants.
|
|
|
|
|
|
|
|
config RCU_NEED_SEGCBLIST
|
2021-11-09 08:18:57 +08:00
|
|
|
def_bool ( TREE_RCU || TREE_SRCU || TASKS_RCU_GENERIC )
|
2017-05-17 23:43:40 +08:00
|
|
|
|
|
|
|
config RCU_FANOUT
|
|
|
|
int "Tree-based hierarchical RCU fanout value"
|
|
|
|
range 2 64 if 64BIT
|
|
|
|
range 2 32 if !64BIT
|
2019-10-15 10:55:57 +08:00
|
|
|
depends on TREE_RCU && RCU_EXPERT
|
2017-05-17 23:43:40 +08:00
|
|
|
default 64 if 64BIT
|
|
|
|
default 32 if !64BIT
|
|
|
|
help
|
|
|
|
This option controls the fanout of hierarchical implementations
|
|
|
|
of RCU, allowing RCU to work efficiently on machines with
|
|
|
|
large numbers of CPUs. This value must be at least the fourth
|
|
|
|
root of NR_CPUS, which allows NR_CPUS to be insanely large.
|
|
|
|
The default value of RCU_FANOUT should be used for production
|
|
|
|
systems, but if you are stress-testing the RCU implementation
|
|
|
|
itself, small RCU_FANOUT values allow you to test large-system
|
|
|
|
code paths on small(er) systems.
|
|
|
|
|
|
|
|
Select a specific number if testing RCU itself.
|
|
|
|
Take the default if unsure.
|
|
|
|
|
|
|
|
config RCU_FANOUT_LEAF
|
|
|
|
int "Tree-based hierarchical RCU leaf-level fanout value"
|
2020-08-06 07:52:17 +08:00
|
|
|
range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
|
|
|
|
range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
|
|
|
|
range 2 3 if RCU_STRICT_GRACE_PERIOD
|
2019-10-15 10:55:57 +08:00
|
|
|
depends on TREE_RCU && RCU_EXPERT
|
2020-08-06 07:52:17 +08:00
|
|
|
default 16 if !RCU_STRICT_GRACE_PERIOD
|
|
|
|
default 2 if RCU_STRICT_GRACE_PERIOD
|
2017-05-17 23:43:40 +08:00
|
|
|
help
|
|
|
|
This option controls the leaf-level fanout of hierarchical
|
|
|
|
implementations of RCU, and allows trading off cache misses
|
|
|
|
against lock contention. Systems that synchronize their
|
|
|
|
scheduling-clock interrupts for energy-efficiency reasons will
|
|
|
|
want the default because the smaller leaf-level fanout keeps
|
|
|
|
lock contention levels acceptably low. Very large systems
|
|
|
|
(hundreds or thousands of CPUs) will instead want to set this
|
|
|
|
value to the maximum value possible in order to reduce the
|
|
|
|
number of cache misses incurred during RCU's grace-period
|
|
|
|
initialization. These systems tend to run CPU-bound, and thus
|
|
|
|
are not helped by synchronized interrupts, and thus tend to
|
|
|
|
skew them, which reduces lock contention enough that large
|
|
|
|
leaf-level fanouts work well. That said, setting leaf-level
|
|
|
|
fanout to a large number will likely cause problematic
|
|
|
|
lock contention on the leaf-level rcu_node structures unless
|
|
|
|
you boot with the skew_tick kernel parameter.
|
|
|
|
|
|
|
|
Select a specific number if testing RCU itself.
|
|
|
|
|
|
|
|
Select the maximum permissible value for large systems, but
|
|
|
|
please understand that you may also need to set the skew_tick
|
|
|
|
kernel boot parameter to avoid contention on the rcu_node
|
|
|
|
structure's locks.
|
|
|
|
|
|
|
|
Take the default if unsure.
|
|
|
|
|
|
|
|
config RCU_BOOST
|
|
|
|
bool "Enable RCU priority boosting"
|
2020-12-15 22:16:45 +08:00
|
|
|
depends on (RT_MUTEXES && PREEMPT_RCU && RCU_EXPERT) || PREEMPT_RT
|
|
|
|
default y if PREEMPT_RT
|
2017-05-17 23:43:40 +08:00
|
|
|
help
|
|
|
|
This option boosts the priority of preempted RCU readers that
|
|
|
|
block the current preemptible RCU grace period for too long.
|
|
|
|
This option also prevents heavy loads from blocking RCU
|
2018-07-08 09:12:26 +08:00
|
|
|
callback invocation.
|
2017-05-17 23:43:40 +08:00
|
|
|
|
|
|
|
Say Y here if you are working with real-time apps or heavy loads
|
|
|
|
Say N here if you are unsure.
|
|
|
|
|
|
|
|
config RCU_BOOST_DELAY
|
|
|
|
int "Milliseconds to delay boosting after RCU grace-period start"
|
|
|
|
range 0 3000
|
|
|
|
depends on RCU_BOOST
|
|
|
|
default 500
|
|
|
|
help
|
|
|
|
This option specifies the time to wait after the beginning of
|
|
|
|
a given grace period before priority-boosting preempted RCU
|
|
|
|
readers blocking that grace period. Note that any RCU reader
|
|
|
|
blocking an expedited RCU grace period is boosted immediately.
|
|
|
|
|
|
|
|
Accept the default if unsure.
|
rcu: Move expedited grace period (GP) work to RT kthread_worker
Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period
latency because its workqueues run at SCHED_OTHER, and thus can be
delayed by normal processes. This commit avoids these delays by moving
the expedited GP work items to a real-time-priority kthread_worker.
This option is controlled by CONFIG_RCU_EXP_KTHREAD and disabled by
default on PREEMPT_RT=y kernels which disable expedited grace periods
after boot by unconditionally setting rcupdate.rcu_normal_after_boot=1.
The results were evaluated on arm64 Android devices (6GB ram) running
5.10 kernel, and capturing trace data in critical user-level code.
The table below shows the resulting order-of-magnitude improvements
in synchronize_rcu_expedited() latency:
------------------------------------------------------------------------
| | workqueues | kthread_worker | Diff |
------------------------------------------------------------------------
| Count | 725 | 688 | |
------------------------------------------------------------------------
| Min Duration (ns) | 326 | 447 | 37.12% |
------------------------------------------------------------------------
| Q1 (ns) | 39,428 | 38,971 | -1.16% |
------------------------------------------------------------------------
| Q2 - Median (ns) | 98,225 | 69,743 | -29.00% |
------------------------------------------------------------------------
| Q3 (ns) | 342,122 | 126,638 | -62.98% |
------------------------------------------------------------------------
| Max Duration (ns) | 372,766,967 | 2,329,671 | -99.38% |
------------------------------------------------------------------------
| Avg Duration (ns) | 2,746,353 | 151,242 | -94.49% |
------------------------------------------------------------------------
| Standard Deviation (ns) | 19,327,765 | 294,408 | |
------------------------------------------------------------------------
The below table show the range of maximums/minimums for
synchronize_rcu_expedited() latency from all experiments:
------------------------------------------------------------------------
| | workqueues | kthread_worker | Diff |
------------------------------------------------------------------------
| Total No. of Experiments | 25 | 23 | |
------------------------------------------------------------------------
| Largest Maximum (ns) | 372,766,967 | 2,329,671 | -99.38% |
------------------------------------------------------------------------
| Smallest Maximum (ns) | 38,819 | 86,954 | 124.00% |
------------------------------------------------------------------------
| Range of Maximums (ns) | 372,728,148 | 2,242,717 | |
------------------------------------------------------------------------
| Largest Minimum (ns) | 88,623 | 27,588 | -68.87% |
------------------------------------------------------------------------
| Smallest Minimum (ns) | 326 | 447 | 37.12% |
------------------------------------------------------------------------
| Range of Minimums (ns) | 88,297 | 27,141 | |
------------------------------------------------------------------------
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Wei Wang <wvw@google.com>
Tested-by: Kyle Lin <kylelin@google.com>
Tested-by: Chunwei Lu <chunweilu@google.com>
Tested-by: Lulu Wang <luluw@google.com>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-09 08:35:27 +08:00
|
|
|
|
|
|
|
config RCU_EXP_KTHREAD
|
|
|
|
bool "Perform RCU expedited work in a real-time kthread"
|
|
|
|
depends on RCU_BOOST && RCU_EXPERT
|
|
|
|
default !PREEMPT_RT && NR_CPUS <= 32
|
|
|
|
help
|
|
|
|
Use this option to further reduce the latencies of expedited
|
|
|
|
grace periods at the expense of being more disruptive.
|
|
|
|
|
|
|
|
This option is disabled by default on PREEMPT_RT=y kernels which
|
|
|
|
disable expedited grace periods after boot by unconditionally
|
|
|
|
setting rcupdate.rcu_normal_after_boot=1.
|
|
|
|
|
|
|
|
Accept the default if unsure.
|
2017-05-17 23:43:40 +08:00
|
|
|
|
|
|
|
config RCU_NOCB_CPU
|
|
|
|
bool "Offload RCU callback processing from boot-selected CPUs"
|
2019-10-15 10:55:57 +08:00
|
|
|
depends on TREE_RCU
|
2017-05-17 23:43:40 +08:00
|
|
|
depends on RCU_EXPERT || NO_HZ_FULL
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
Use this option to reduce OS jitter for aggressive HPC or
|
|
|
|
real-time workloads. It can also be used to offload RCU
|
|
|
|
callback invocation to energy-efficient CPUs in battery-powered
|
2020-09-24 14:34:10 +08:00
|
|
|
asymmetric multiprocessors. The price of this reduced jitter
|
|
|
|
is that the overhead of call_rcu() increases and that some
|
|
|
|
workloads will incur significant increases in context-switch
|
|
|
|
rates.
|
2017-05-17 23:43:40 +08:00
|
|
|
|
2018-07-08 09:12:26 +08:00
|
|
|
This option offloads callback invocation from the set of CPUs
|
|
|
|
specified at boot time by the rcu_nocbs parameter. For each
|
|
|
|
such CPU, a kthread ("rcuox/N") will be created to invoke
|
|
|
|
callbacks, where the "N" is the CPU being offloaded, and where
|
2020-09-24 14:34:10 +08:00
|
|
|
the "x" is "p" for RCU-preempt (PREEMPTION kernels) and "s" for
|
|
|
|
RCU-sched (!PREEMPTION kernels). Nothing prevents this kthread
|
|
|
|
from running on the specified CPUs, but (1) the kthreads may be
|
|
|
|
preempted between each callback, and (2) affinity or cgroups can
|
|
|
|
be used to force the kthreads to run on whatever set of CPUs is
|
|
|
|
desired.
|
|
|
|
|
|
|
|
Say Y here if you need reduced OS jitter, despite added overhead.
|
2017-05-17 23:43:40 +08:00
|
|
|
Say N here if you are unsure.
|
|
|
|
|
2022-04-23 01:52:47 +08:00
|
|
|
config RCU_NOCB_CPU_DEFAULT_ALL
|
|
|
|
bool "Offload RCU callback processing from all CPUs by default"
|
|
|
|
depends on RCU_NOCB_CPU
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
Use this option to offload callback processing from all CPUs
|
|
|
|
by default, in the absence of the rcu_nocbs or nohz_full boot
|
|
|
|
parameter. This also avoids the need to use any boot parameters
|
|
|
|
to achieve the effect of offloading all CPUs on boot.
|
|
|
|
|
|
|
|
Say Y here if you want offload all CPUs by default on boot.
|
|
|
|
Say N here if you are unsure.
|
|
|
|
|
rcu/nocb: Add option to opt rcuo kthreads out of RT priority
This commit introduces a RCU_NOCB_CPU_CB_BOOST Kconfig option that
prevents rcuo kthreads from running at real-time priority, even in
kernels built with RCU_BOOST. This capability is important to devices
needing low-latency (as in a few milliseconds) response from expedited
RCU grace periods, but which are not running a classic real-time workload.
On such devices, permitting the rcuo kthreads to run at real-time priority
results in unacceptable latencies imposed on the application tasks,
which run as SCHED_OTHER.
See for example the following trace output:
<snip>
<...>-60 [006] d..1 2979.028717: rcu_batch_start: rcu_preempt CBs=34619 bl=270
<snip>
If that rcuop kthread were permitted to run at real-time SCHED_FIFO
priority, it would monopolize its CPU for hundreds of milliseconds
while invoking those 34619 RCU callback functions, which would cause an
unacceptably long latency spike for many application stacks on Android
platforms.
However, some existing real-time workloads require that callback
invocation run at SCHED_FIFO priority, for example, those running on
systems with heavy SCHED_OTHER background loads. (It is the real-time
system's administrator's responsibility to make sure that important
real-time tasks run at a higher priority than do RCU's kthreads.)
Therefore, this new RCU_NOCB_CPU_CB_BOOST Kconfig option defaults to
"y" on kernels built with PREEMPT_RT and defaults to "n" otherwise.
The effect is to preserve current behavior for real-time systems, but for
other systems to allow expedited RCU grace periods to run with real-time
priority while continuing to invoke RCU callbacks as SCHED_OTHER.
As you would expect, this RCU_NOCB_CPU_CB_BOOST Kconfig option has no
effect except on CPUs with offloaded RCU callbacks.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-05-11 16:57:03 +08:00
|
|
|
config RCU_NOCB_CPU_CB_BOOST
|
|
|
|
bool "Offload RCU callback from real-time kthread"
|
|
|
|
depends on RCU_NOCB_CPU && RCU_BOOST
|
|
|
|
default y if PREEMPT_RT
|
|
|
|
help
|
|
|
|
Use this option to invoke offloaded callbacks as SCHED_FIFO
|
|
|
|
to avoid starvation by heavy SCHED_OTHER background load.
|
|
|
|
Of course, running as SCHED_FIFO during callback floods will
|
|
|
|
cause the rcuo[ps] kthreads to monopolize the CPU for hundreds
|
|
|
|
of milliseconds or more. Therefore, when enabling this option,
|
|
|
|
it is your responsibility to ensure that latency-sensitive
|
|
|
|
tasks either run with higher priority or run on some other CPU.
|
|
|
|
|
|
|
|
Say Y here if you want to set RT priority for offloading kthreads.
|
|
|
|
Say N here if you are building a !PREEMPT_RT kernel and are unsure.
|
|
|
|
|
2020-03-19 08:16:37 +08:00
|
|
|
config TASKS_TRACE_RCU_READ_MB
|
|
|
|
bool "Tasks Trace RCU readers use memory barriers in user and idle"
|
2022-03-18 04:29:59 +08:00
|
|
|
depends on RCU_EXPERT && TASKS_TRACE_RCU
|
2020-03-19 08:16:37 +08:00
|
|
|
default PREEMPT_RT || NR_CPUS < 8
|
|
|
|
help
|
|
|
|
Use this option to further reduce the number of IPIs sent
|
|
|
|
to CPUs executing in userspace or idle during tasks trace
|
|
|
|
RCU grace periods. Given that a reasonable setting of
|
|
|
|
the rcupdate.rcu_task_ipi_delay kernel boot parameter
|
|
|
|
eliminates such IPIs for many workloads, proper setting
|
|
|
|
of this Kconfig option is important mostly for aggressive
|
|
|
|
real-time installations and for battery-powered devices,
|
|
|
|
hence the default chosen above.
|
|
|
|
|
|
|
|
Say Y here if you hate IPIs.
|
|
|
|
Say N here if you hate read-side memory barriers.
|
|
|
|
Take the default if you are unsure.
|
|
|
|
|
2022-10-17 00:22:54 +08:00
|
|
|
config RCU_LAZY
|
|
|
|
bool "RCU callback lazy invocation functionality"
|
|
|
|
depends on RCU_NOCB_CPU
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
To save power, batch RCU callbacks and flush after delay, memory
|
|
|
|
pressure, or callback list growing too big.
|
|
|
|
|
2023-12-03 09:12:52 +08:00
|
|
|
Requires rcu_nocbs=all to be set.
|
|
|
|
|
|
|
|
Use rcutree.enable_rcu_lazy=0 to turn it off at boot time.
|
|
|
|
|
|
|
|
config RCU_LAZY_DEFAULT_OFF
|
|
|
|
bool "Turn RCU lazy invocation off by default"
|
|
|
|
depends on RCU_LAZY
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
Allows building the kernel with CONFIG_RCU_LAZY=y yet keep it default
|
|
|
|
off. Boot time param rcutree.enable_rcu_lazy=1 can be used to switch
|
|
|
|
it back on.
|
|
|
|
|
2023-04-01 00:05:56 +08:00
|
|
|
config RCU_DOUBLE_CHECK_CB_TIME
|
|
|
|
bool "RCU callback-batch backup time check"
|
|
|
|
depends on RCU_EXPERT
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
Use this option to provide more precise enforcement of the
|
|
|
|
rcutree.rcu_resched_ns module parameter in situations where
|
|
|
|
a single RCU callback might run for hundreds of microseconds,
|
|
|
|
thus defeating the 32-callback batching used to amortize the
|
|
|
|
cost of the fine-grained but expensive local_clock() function.
|
|
|
|
|
|
|
|
This option rounds rcutree.rcu_resched_ns up to the next
|
|
|
|
jiffy, and overrides the 32-callback batching if this limit
|
|
|
|
is exceeded.
|
|
|
|
|
|
|
|
Say Y here if you need tighter callback-limit enforcement.
|
|
|
|
Say N here if you are unsure.
|
|
|
|
|
2017-05-17 23:43:40 +08:00
|
|
|
endmenu # "RCU Subsystem"
|