The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds the start time, in jiffies, of the most recently started
rcu_barrier_tasks*() operation to the diagnostic output used by rcuscale.
This information can be helpful in distinguishing a hung barrier operation
from a long series of barrier operations.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds rcu_tasks_torture_stats_print(),
rcu_tasks_trace_torture_stats_print(), and
rcu_tasks_rude_torture_stats_print() functions that provide detailed
diagnostics on grace-period, callback, and barrier state.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Each Tasks RCU flavor keeps a count of the number of callbacks that the
current rcu_barrier_tasks*() is waiting on, but there is currently no
easy way to work out which callback is stuck. One way to do this is to
mark idle RCU-barrier callbacks by making the ->next pointer point to
the callback itself, and this commit does just that.
Later commits will use this for debug output.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The rtp->tasks_gp_seq grace-period sequence number is not a strict count,
but rather the usual RCU sequence number with the lower few bits tracking
per-grace-period state and the upper bits the count of grace periods
since boot, give or take the initial value. This commit therefore
adjusts this comment.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The current mapping of smp_processor_id() to a CPU processing Tasks-RCU
callbacks makes some assumptions about layout. This commit therefore
adds a WARN_ON() to check these assumptions.
[ neeraj.upadhyay: Replace nr_cpu_ids with rcu_task_cpu_ids. ]
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
For kernels built with CONFIG_FORCE_NR_CPUS=y, the nr_cpu_ids is
defined as NR_CPUS instead of the number of possible cpus, this
will cause the following system panic:
smpboot: Allowing 4 CPUs, 0 hotplug CPUs
...
setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:512 nr_node_ids:1
...
BUG: unable to handle page fault for address: ffffffff9911c8c8
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 15 Comm: rcu_tasks_trace Tainted: G W
6.6.21 #1 5dc7acf91a5e8e9ac9dcfc35bee0245691283ea6
RIP: 0010:rcu_tasks_need_gpcb+0x25d/0x2c0
RSP: 0018:ffffa371c00a3e60 EFLAGS: 00010082
CR2: ffffffff9911c8c8 CR3: 000000040fa20005 CR4: 00000000001706f0
Call Trace:
<TASK>
? __die+0x23/0x80
? page_fault_oops+0xa4/0x180
? exc_page_fault+0x152/0x180
? asm_exc_page_fault+0x26/0x40
? rcu_tasks_need_gpcb+0x25d/0x2c0
? __pfx_rcu_tasks_kthread+0x40/0x40
rcu_tasks_one_gp+0x69/0x180
rcu_tasks_kthread+0x94/0xc0
kthread+0xe8/0x140
? __pfx_kthread+0x40/0x40
ret_from_fork+0x34/0x80
? __pfx_kthread+0x40/0x40
ret_from_fork_asm+0x1b/0x80
</TASK>
Considering that there may be holes in the CPU numbers, use the
maximum possible cpu number, instead of nr_cpu_ids, for configuring
enqueue and dequeue limits.
[ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ]
Closes: https://lore.kernel.org/linux-input/CALMA0xaTSMN+p4xUXkzrtR5r6k7hgoswcaXx7baR_z9r5jjskw@mail.gmail.com/T/#u
Reported-by: Zhixu Liu <zhixu.liu@gmail.com>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently
unused. This commit therefore removes their definitions and boot-time
self-tests.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running
on all online CPUs, no explicit ordering synchronizes properly with a
context switch. This lack of ordering can permit the new task to miss
pre-grace-period update-side accesses. The following diagram, courtesy
of Paul, shows the possible bad scenario:
CPU 0 CPU 1
----- -----
// Pre-GP update side access
WRITE_ONCE(*X, 1);
smp_mb();
r0 = rq->curr;
RCU_INIT_POINTER(rq->curr, TASK_B)
spin_unlock(rq)
rcu_read_lock_trace()
r1 = X;
/* ignore TASK_B */
Either r0==TASK_B or r1==1 is needed but neither is guaranteed.
One possible solution to solve this is to wait for an RCU grace period
at the beginning of the RCU-tasks-trace grace period before taking the
current tasks snaphot. However this would introduce large additional
latencies to RCU-tasks-trace grace periods.
Another solution is to lock the target runqueue while taking the current
task snapshot. This ensures that the update side sees the latest context
switch and subsequent context switches will see the pre-grace-period
update side accesses.
This commit therefore adds runqueue locking to cpu_curr_snapshot().
Fixes: e386b67257 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This reverts commit 28319d6dc5. The race
it fixed was subject to conditions that don't exist anymore since:
1612160b91 ("rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks")
This latter commit removes the use of SRCU that used to cover the
RCU-tasks blind spot on exit between the tasklist's removal and the
final preemption disabling. The task is now placed instead into a
temporary list inside which voluntary sleeps are accounted as RCU-tasks
quiescent states. This would disarm the deadlock initially reported
against PID namespace exit.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit make rcu-tasks related rcutorture test support rcu-tasks
gp state printing when the writer stall occurs or the at the end of
rcutorture test, and generate rcu_ops->get_gp_data() operation to
simplify the acquisition of gp state for different types of rcutorture
tests.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
There is a possibility of buffer overflow in
show_rcu_tasks_trace_gp_kthread() if counters, passed
to sprintf() are huge. Counter numbers, needed for this
are unrealistically high, but buffer overflow is still
possible.
Use snprintf() with buffer size instead of sprintf().
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: edf3775f0a ("rcu-tasks: Add count for idle tasks on offline CPUs")
Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
The synchronize_srcu() has been removed by commit("rcu-tasks: Eliminate
deadlocks involving do_exit() and RCU tasks") in rcu_tasks_postscan.
This commit therefore fixes the tasks_rcu_exit_srcu_stall_timer comment.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Because the Tasks RCU ->rtp_exit_list is initialized at rcu_init()
time while there is only one CPU running with interrupts disabled, it
is not possible for an exiting task to encounter an uninitialized list.
This commit therefore replaces the conditional initialization with
a WARN_ON_ONCE().
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Closes: https://lore.kernel.org/all/ZdiNXmO3wRvmzPsr@lothringen/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tasks Trace RCU needs a single-byte cmpxchg(), but no such thing exists.
Therefore, rcu_trc_cmpxchg_need_qs() emulates one using field substitution
and a four-byte cmpxchg(), such that the other three bytes are always
atomically updated to their old values. This works, but results in
false-positive KCSAN failures because as far as KCSAN knows, this
cmpxchg() operation is updating all four bytes.
This commit therefore encloses the cmpxchg() in a data_race() and adds
a single-byte instrument_atomic_read_write(), thus telling KCSAN exactly
what is going on so as to avoid the false positives.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Currently, all waits for grace periods sleep at TASK_UNINTERRUPTIBLE,
regardless of RCU flavor. This has worked well, but there have been
cases where a longer-than-average Tasks RCU grace period has triggered
softlockup splats, many of them, before the Tasks RCU CPU stall warning
appears. These softlockup splats unnecessarily consume console bandwidth
and complicate diagnosis of the underlying problem. Plus a long but not
pathologically long Tasks RCU grace period might trigger a few softlockup
splats before completing normally, which generates noise for no good
reason.
This commit therefore causes Tasks RCU grace periods to sleep at TASK_IDLE
priority. If there really is a persistent problem, the eventual Tasks
RCU CPU stall warning will flag it, and without the extra noise.
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
The current code will scan the entirety of each per-CPU list of exiting
tasks in ->rtp_exit_list with interrupts disabled. This is normally just
fine, because each CPU typically won't have very many tasks in this state.
However, if a large number of tasks block late in do_exit(), these lists
could be arbitrarily long. Low probability, perhaps, but it really
could happen.
This commit therefore occasionally re-enables interrupts while traversing
these lists, inserting a dummy element to hold the current place in the
list. In kernels built with CONFIG_PREEMPT_RT=y, this re-enabling happens
after each list element is processed, otherwise every one-to-two jiffies.
[ paulmck: Apply Frederic Weisbecker feedback. ]
Link: https://lore.kernel.org/all/ZdeI_-RfdLR8jlsm@localhost.localdomain/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock. This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch. However, such deadlocks are becoming
more frequent. In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.
In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state. And with some adjustments, RCU Tasks could
just as well take advantage of that fact.
This commit therefore eliminates these deadlock by replacing the
SRCU-based wait for do_exit() completion with per-CPU lists of tasks
currently exiting. A given task will be on one of these per-CPU lists for
the same period of time that this task would previously have been in the
previous SRCU read-side critical section. These lists enable RCU Tasks
to find the tasks that have already been removed from the tasks list,
but that must nevertheless be waited upon.
The RCU Tasks grace period gathers any of these do_exit() tasks that it
must wait on, and adds them to the list of holdouts. Per-CPU locking
and get_task_struct() are used to synchronize addition to and removal
from these lists.
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
This commit continues the elimination of deadlocks involving do_exit()
and RCU tasks by causing exit_tasks_rcu_start() to add the current
task to a per-CPU list and causing exit_tasks_rcu_stop() to remove the
current task from whatever list it is on. These lists will be used to
track tasks that are exiting, while still accounting for any RCU-tasks
quiescent states that these tasks pass though.
[ paulmck: Apply Frederic Weisbecker feedback. ]
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock. This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch. However, such deadlocks are becoming
more frequent. In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.
In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state. And with some adjustments, RCU Tasks could
just as well take advantage of that fact.
This commit therefore initializes the data structures that will be needed
to rely on these quiescent states and to eliminate these deadlocks.
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
In order for RCU Tasks to reliably maintain per-CPU lists of exiting
tasks, those lists must be initialized before it is possible for tasks
to exit, especially given that the boot CPU is not necessarily CPU 0
(an example being, powerpc kexec() kernels). And at the time that
rcu_init_tasks_generic() is called, a task could potentially exit,
unconventional though that sort of thing might be.
This commit therefore moves the calls to cblist_init_generic() from
functions called from rcu_init_tasks_generic() to a new function named
tasks_cblist_init_generic() that is invoked from rcu_init().
This constituted a bug in a commit that never went to mainline, so
there is no need for any backporting to -stable.
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock. This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch. However, such deadlocks are becoming
more frequent. In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.
In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state. And with some adjustments, RCU Tasks could
just as well take advantage of that fact.
This commit therefore adds the data structures that will be needed
to rely on these quiescent states and to eliminate these deadlocks.
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
The task_struct structure's ->rcu_tasks_idle_cpu can be concurrently
read and written from the RCU Tasks grace-period kthread and from the
CPU on which the task_struct structure's task is running. This commit
therefore marks the accesses appropriately.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
The commit:
cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
has changed the semantics of what is to be considered an idle task in
such a way that the idle task of an offline CPU may not carry the
PF_IDLE flag anymore.
However RCU-tasks-trace tests the opposite assertion, still assuming
that idle tasks carry the PF_IDLE flag during their whole lifecycle.
Remove this assumption to avoid spurious warnings but keep the initial
test verifying that the idle task is the current task on any offline
CPU.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The commit:
cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
has changed the semantics of what is to be considered an idle task in
such a way that CPU boot code preceding the actual idle loop is excluded
from it.
This has however introduced new potential RCU-tasks stalls when either:
1) Grace period is started before init/0 had a chance to set PF_IDLE,
keeping it stuck in the holdout list until idle ever schedules.
2) Grace period is started when some possible CPUs have never been
online, keeping their idle tasks stuck in the holdout list until the
CPU ever boots up.
3) Similar to 1) but with secondary CPUs: Grace period is started
concurrently with secondary CPU booting, putting its idle task in
the holdout list because PF_IDLE isn't yet observed on it. It stays
then stuck in the holdout list until that CPU ever schedules. The
effect is mitigated here by the hotplug AP thread that must run to
bring the CPU up.
Fix this with handling the new semantics of PF_IDLE, keeping in mind
that it may or may not be set on an idle task. Take advantage of that to
strengthen the coverage of an RCU-tasks quiescent state within an idle
task, excluding the CPU boot code from it. Only the code running within
the idle loop is now a quiescent state, along with offline CPUs.
Fixes: cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
When a structure containing an RCU callback rhp is (incorrectly) freed
and reallocated after rhp is passed to call_rcu(), it is not unusual for
rhp->func to be set to NULL. This defeats the debugging prints used by
__call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y,
which expect to identify the offending code using the identity of this
function.
And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things
are even worse, as can be seen from this splat:
Unable to handle kernel NULL pointer dereference at virtual address 0
... ...
PC is at 0x0
LR is at rcu_do_batch+0x1c0/0x3b8
... ...
(rcu_do_batch) from (rcu_core+0x1d4/0x284)
(rcu_core) from (__do_softirq+0x24c/0x344)
(__do_softirq) from (__irq_exit_rcu+0x64/0x108)
(__irq_exit_rcu) from (irq_exit+0x8/0x10)
(irq_exit) from (__handle_domain_irq+0x74/0x9c)
(__handle_domain_irq) from (gic_handle_irq+0x8c/0x98)
(gic_handle_irq) from (__irq_svc+0x5c/0x94)
(__irq_svc) from (arch_cpu_idle+0x20/0x3c)
(arch_cpu_idle) from (default_idle_call+0x4c/0x78)
(default_idle_call) from (do_idle+0xf8/0x150)
(do_idle) from (cpu_startup_entry+0x18/0x20)
(cpu_startup_entry) from (0xc01530)
This commit therefore adds calls to mem_dump_obj(rhp) to output some
information, for example:
slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256
This provides the rough size of the memory block and the offset of the
rcu_head structure, which as least provides at least a few clues to help
locate the problem. If the problem is reproducible, additional slab
debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can
provide significantly more information.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The rcu_tasks_lazy_ms variable is not used outside the file tasks.h,
so this commit marks it static.
kernel/rcu/tasks.h:1085:5: warning: symbol 'rcu_tasks_lazy_ms' was not declared. Should it be static?
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6086
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The rcu_tasks_need_gpcb() samples ->percpu_dequeue_lim as part of the
condition clause of a "for" loop, which is a bit confusing. This commit
therefore hoists this sampling out of the loop, using the result loaded
in the condition clause.
So why does this work in the face of a concurrent switch from single-CPU
queueing to per-CPU queueing?
o The call_rcu_tasks_generic() that makes the change has already
enqueued its callback, which means that all of the other CPU's
callback queues are empty.
o For the call_rcu_tasks_generic() that first notices
the switch to per-CPU queues, the smp_store_release()
used to update ->percpu_enqueue_lim pairs with the
raw_spin_trylock_rcu_node()'s full barrier that is
between the READ_ONCE(rtp->percpu_enqueue_shift) and the
rcu_segcblist_enqueue() that enqueues the callback.
o Because this CPU's queue is empty (unless it happens to
be the original single queue, in which case there is no
need for synchronization), this call_rcu_tasks_generic()
will do an irq_work_queue() to schedule a handler for the
needed rcuwait_wake_up() call. This call will be ordered
after the first call_rcu_tasks_generic() function's change to
->percpu_dequeue_lim.
o This rcuwait_wake_up() will either happen before or after the
set_current_state() in rcuwait_wait_event(). If it happens
before, the "condition" argument's call to rcu_tasks_need_gpcb()
will be ordered after the original change, and all callbacks on
all CPUs will be visible. Otherwise, if it happens after, then
the grace-period kthread's state will be set back to running,
which will result in a later call to rcuwait_wait_event() and
thus to rcu_tasks_need_gpcb(), which will again see the change.
So it all works out.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Currently, rcu_tasks_initiate_self_tests() prints a message and then
initiates self tests on up to three different RCU Tasks flavors. If one
of the flavors has a grace-period hang, it is not easy to work out which
of the three hung. This commit therefore prints a message prior to each
individual test.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
In kernels built with CONFIG_PROVE_RCU=y (for example, lockdep kernels),
the following sequence of events can occur:
o rcu_init_tasks_generic() is invoked just before init is spawned.
It invokes rcu_spawn_tasks_kthread() and friends.
o rcu_spawn_tasks_kthread() invokes rcu_spawn_tasks_kthread_generic(),
which uses kthread_run() to create the needed kthread.
o Control returns to rcu_init_tasks_generic(), which, because this
is a CONFIG_PROVE_RCU=y kernel, invokes the version of the
rcu_tasks_initiate_self_tests() function that actually does
something, including invoking synchronize_rcu_tasks(), which
in turn invokes synchronize_rcu_tasks_generic().
o synchronize_rcu_tasks_generic() sees that the ->kthread_ptr is
still NULL, because the newly spawned kthread has not yet
started.
o The new kthread starts, preempting synchronize_rcu_tasks_generic()
just after its check. This kthread invokes rcu_tasks_one_gp(),
which acquires ->tasks_gp_mutex, and, seeing no work, blocks
in rcuwait_wait_event(). Note that this step requires either
a preemptible kernel or a fault-injection-style sleep at the
beginning of mutex_lock().
o synchronize_rcu_tasks_generic() resumes and invokes rcu_tasks_one_gp().
o rcu_tasks_one_gp() attempts to acquire ->tasks_gp_mutex, which
is still held by the newly spawned kthread's rcu_tasks_one_gp()
function. Deadlock.
Because the only reason for ->tasks_gp_mutex is to handle pre-kthread
synchronous grace periods, this commit avoids this deadlock by having
rcu_tasks_one_gp() momentarily release ->tasks_gp_mutex while invoking
rcuwait_wait_event(). This allows the call to rcu_tasks_one_gp() from
synchronize_rcu_tasks_generic() proceed.
Note that it is not necessary to release the mutex anywhere else in
rcu_tasks_one_gp() because rcuwait_wait_event() is the only function
that can block indefinitely.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Roy Hopkins <rhopkins@suse.de>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Roy Hopkins <rhopkins@suse.de>
Currently, cblist_init_generic() holds a raw spinlock when invoking
INIT_WORK(). This fails in kernels built with CONFIG_DEBUG_OBJECTS=y
due to memory allocation being forbidden while holding a raw spinlock.
But the only reason for holding the raw spinlock is to synchronize
with early boot calls to call_rcu_tasks(), call_rcu_tasks_rude, and,
last but not least, call_rcu_tasks_trace(). These calls also invoke
cblist_init_generic() in order to support early boot queueing of
callbacks.
Except that there are no early boot calls to either of these three
functions, and the BPF guys confirm that they have no plans to add any
such calls.
This commit therefore removes the synchronization and adds a
WARN_ON_ONCE() to catch the case of now-prohibited early boot RCU Tasks
callback queueing.
If early boot queueing is needed, an "initialized" flag may be added to
the rcu_tasks structure. Then queueing a callback before this flag is set
would initialize the callback list (if needed) and queue the callback.
The decision as to where to queue the callback given the possibility of
non-zero boot CPUs is left as an exercise for the reader.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit causes RCU Tasks Trace to output the CPU time consumed by
its grace-period kthread. The CPU time is whatever is in the designated
task's current->stime field, and thus is controlled by whatever CPU-time
accounting scheme is in effect.
This output appears in microseconds as follows on the console:
rcu_scale: Grace-period kthread CPU time: 42367.037
[ paulmck: Apply Willy Tarreau feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds the ability to output the CPU time consumed by the
grace-period kthread for the RCU variant under test. The CPU time is
whatever is in the designated task's current->stime field, and thus is
controlled by whatever CPU-time accounting scheme is in effect.
This output appears in microseconds as follows on the console:
rcu_scale: Grace-period kthread CPU time: 42367.037
[ paulmck: Apply feedback from Stephen Rothwell and kernel test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yujie Liu <yujie.liu@intel.com>
The various RCU Tasks flavors now do lazy grace periods when there are
only asynchronous grace period requests. By default, the system will let
250 milliseconds elapse after the first call_rcu_tasks*() callbacki is
queued before starting a grace period. In contrast, synchronous grace
period requests such as synchronize_rcu_tasks*() will start a grace
period immediately.
However, invoking one of the call_rcu_tasks*() functions in a too-tight
loop can result in a callback flood, which in turn can exhaust memory
if grace periods are delayed for too long.
This commit therefore sets a limit so that the grace-period kthread
will be awakened when any CPU's callback list expands to contain
rcupdate.rcu_task_lazy_lim callbacks elements (defaulting to 32, set to -1
to disable), the grace-period kthread will be awakened, thus cancelling
any ongoing laziness and getting out in front of the potential callback
flood.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds kernel boot parameters for callback laziness, allowing
the RCU Tasks flavors to be individually adjusted.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The performance requirements on RCU Tasks, and in particular on RCU
Tasks Trace, have evolved over time as the workloads have evolved.
The current implementation is designed to provide low grace-period
latencies, and also to accommodate short-duration floods of callbacks.
However, current workloads can also provide a constant background
callback-queuing rate of a few hundred call_rcu_tasks_trace() invocations
per second. This results in continuous back-to-back RCU Tasks Trace
grace periods, which in turn can consume the better part of 10% of a CPU.
One could take the attitude that there are several tens of other CPUs on
the systems running such workloads, but energy efficiency is a thing.
On these systems, although asynchronous grace-period requests happen
every few milliseconds, synchronous grace-period requests are quite rare.
This commit therefore arrranges for grace periods to be initiated
immediately in response to calls to synchronize_rcu_tasks*() and
also to calls to synchronize_rcu_mult() that are passed one of the
call_rcu_tasks*() functions. These are recognized by the tell-tale
wakeme_after_rcu callback function.
In other cases, callbacks are gathered up for up to about 250 milliseconds
before a grace period is initiated. This results in more than an order of
magnitude reduction in RCU Tasks Trace grace periods, with corresponding
reduction in consumption of CPU time.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_tasks_invoke_cbs() function relies on queue_work_on() to silently
fall back to WORK_CPU_UNBOUND when the specified CPU is offline. However,
the queue_work_on() function's silent fallback mechanism relies on that
CPU having been online at some time in the past. When queue_work_on()
is passed a CPU that has never been online, workqueue lockups ensue,
which can be bad for your kernel's general health and well-being.
This commit therefore checks whether a given CPU has ever been online,
and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
queue_work_on(). Why not simply omit the queue_work_on() call entirely?
Because this function is flooding callback-invocation notifications
to all CPUs, and must deal with possibilities that include a sparse
cpu_possible_mask.
This commit also moves the setting of the rcu_data structure's
->beenonline field to rcu_cpu_starting(), which executes on the
incoming CPU before that CPU has ever enabled interrupts. This ensures
that the required workqueues are present. In addition, because the
incoming CPU has not yet enabled its interrupts, there cannot yet have
been any softirq handlers running on this CPU, which means that the
WARN_ON_ONCE(!rdp->beenonline) within the RCU_SOFTIRQ handler cannot
have triggered yet.
Fixes: d363f833c6 ("rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit uses rtp->name instead of __func__ and outputs the value
of rcu_task_cb_adjust, thus reducing console-log output.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The call to synchronize_srcu() from rcu_tasks_postscan() can be stalled
by a task getting stuck in do_exit() between that function's calls to
exit_tasks_rcu_start() and exit_tasks_rcu_finish(). To ease diagnosis
of this situation, print a stall warning message every rcu_task_stall_info
period when rcu_tasks_postscan() is stalled.
[ paulmck: Adjust to handle CONFIG_SMP=n. ]
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/rcu/20230111212736.GA1062057@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
The tasks_rcu_exit_srcu variable is used only by kernels built
with CONFIG_TASKS_RCU=y, but is defined for all kernesl with
CONFIG_TASKS_RCU_GENERIC=y. Therefore, in kernels built with
CONFIG_TASKS_RCU_GENERIC=y but CONFIG_TASKS_RCU=n, this gives
a "defined but not used" warning.
This commit therefore moves this variable under CONFIG_TASKS_RCU.
Link: https://lore.kernel.org/oe-kbuild-all/202303191536.XzMSyzTl-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_tasks_need_gpcb() determines whether or not: (1) There are
callbacks needing another grace period, (2) There are callbacks ready
to be invoked, and (3) It would be a good time to shrink back down to a
single-CPU callback list. This third case is interesting because some
other CPU might be adding new callbacks, which might suddenly make this
a very bad time to be shrinking.
This is currently handled by requiring call_rcu_tasks_generic() to
enqueue callbacks under the protection of rcu_read_lock() and requiring
rcu_tasks_need_gpcb() to wait for an RCU grace period to elapse before
finalizing the transition. This works well in practice.
Unfortunately, the current code assumes that a grace period whose end is
detected by the poll_state_synchronize_rcu() in the second "if" condition
actually ended before the earlier code counted the callbacks queued on
CPUs other than CPU 0 (local variable "ncbsnz"). Given the current code,
it is possible that a long-delayed call_rcu_tasks_generic() invocation
will queue a callback on a non-zero CPU after these CPUs have had their
callbacks counted and zero has been stored to ncbsnz. Such a callback
would trigger the WARN_ON_ONCE() in the second "if" statement.
To see this, consider the following sequence of events:
o CPU 0 invokes rcu_tasks_one_gp(), and counts fewer than
rcu_task_collapse_lim callbacks. It sees at least one
callback queued on some other CPU, thus setting ncbsnz
to a non-zero value.
o CPU 1 invokes call_rcu_tasks_generic() and loads 42 from
->percpu_enqueue_lim. It therefore decides to enqueue its
callback onto CPU 1's callback list, but is delayed.
o CPU 0 sees the rcu_task_cb_adjust is non-zero and that the number
of callbacks does not exceed rcu_task_collapse_lim. It therefore
checks percpu_enqueue_lim, and sees that its value is greater
than the value one. CPU 0 therefore starts the shift back
to a single callback list. It sets ->percpu_enqueue_lim to 1,
but CPU 1 has already read the old value of 42. It also gets
a grace-period state value from get_state_synchronize_rcu().
o CPU 0 sees that ncbsnz is non-zero in its second "if" statement,
so it declines to finalize the shrink operation.
o CPU 0 again invokes rcu_tasks_one_gp(), and counts fewer than
rcu_task_collapse_lim callbacks. It also sees that there are
no callback queued on any other CPU, and thus sets ncbsnz to zero.
o CPU 1 resumes execution and enqueues its callback onto its own
list. This invalidates the value of ncbsnz.
o CPU 0 sees the rcu_task_cb_adjust is non-zero and that the number
of callbacks does not exceed rcu_task_collapse_lim. It therefore
checks percpu_enqueue_lim, but sees that its value is already
unity. It therefore does not get a new grace-period state value.
o CPU 0 sees that rcu_task_cb_adjust is non-zero, ncbsnz is zero,
and that poll_state_synchronize_rcu() says that the grace period
has completed. it therefore finalizes the shrink operation,
setting ->percpu_dequeue_lim to the value one.
o CPU 0 does a debug check, scanning the other CPUs' callback lists.
It sees that CPU 1's list has a callback, so it (rightly)
triggers the WARN_ON_ONCE(). After all, the new value of
->percpu_dequeue_lim says to not bother looking at CPU 1's
callback list, which means that this callback will never be
invoked. This can result in hangs and maybe even OOMs.
Based on long experience with rcutorture, this is an extremely
low-probability race condition, but it really can happen, especially in
preemptible kernels or within guest OSes.
This commit therefore checks for completion of the grace period
before counting callbacks. With this change, in the above failure
scenario CPU 0 would know not to prematurely end the shrink operation
because the grace period would not have completed before the count
operation started.
[ paulmck: Adjust grace-period end rather than adding RCU reader. ]
[ paulmck: Avoid spurious WARN_ON_ONCE() with ->percpu_dequeue_lim check. ]
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>