Pull scheduler updates from Ingo Molnar:
- MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.
As perf and the scheduler is getting bigger and more complex,
document the status quo of current responsibilities and interests,
and spread the review pain^H^H^H^H fun via an increase in the Cc:
linecount generated by scripts/get_maintainer.pl. :-)
- Add another series of patches that brings the -rt (PREEMPT_RT) tree
closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
into a new CONFIG_PREEMPTION category that will allow the eventual
introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
to go though.
- Extend the CPU cgroup controller with uclamp.min and uclamp.max to
allow the finer shaping of CPU bandwidth usage.
- Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).
- Improve the behavior of high CPU count, high thread count
applications running under cpu.cfs_quota_us constraints.
- Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.
- Improve CPU isolation housekeeping CPU allocation NUMA locality.
- Fix deadline scheduler bandwidth calculations and logic when cpusets
rebuilds the topology, or when it gets deadline-throttled while it's
being offlined.
- Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
setscheduler() system calls without creating global serialization.
Add new synchronization between cpuset topology-changing events and
the deadline acceptance tests in setscheduler(), which were broken
before.
- Rework the active_mm state machine to be less confusing and more
optimal.
- Rework (simplify) the pick_next_task() slowpath.
- Improve load-balancing on AMD EPYC systems.
- ... and misc cleanups, smaller fixes and improvements - please see
the Git log for more details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
sched/psi: Correct overly pessimistic size calculation
sched/fair: Speed-up energy-aware wake-ups
sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
sched/uclamp: Update CPU's refcount on TG's clamp changes
sched/uclamp: Use TG's clamps to restrict TASK's clamps
sched/uclamp: Propagate system defaults to the root group
sched/uclamp: Propagate parent clamps
sched/uclamp: Extend CPU's cgroup controller
sched/topology: Improve load balancing on AMD EPYC systems
arch, ia64: Make NUMA select SMP
sched, perf: MAINTAINERS update, add submaintainers and reviewers
sched/fair: Use rq_lock/unlock in online_fair_sched_group
cpufreq: schedutil: fix equation in comment
sched: Rework pick_next_task() slow-path
sched: Allow put_prev_task() to drop rq->lock
sched/fair: Expose newidle_balance()
sched: Add task_struct pointer to sched_class::set_curr_task
sched: Rework CPU hotplug task selection
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Fix kerneldoc comment for ia64_set_curr_task
...
Bimodal behavior of rcu_do_batch() is not really suited to Google
applications like gfe servers.
When a process with millions of sockets exits, closing all files
queues two rcu callbacks per socket.
This eventually reaches the point where RCU enters an emergency
mode, where rcu_do_batch() do not return until whole queue is flushed.
Each rcu callback lasts at least 70 nsec, so with millions of
elements, we easily spend more than 100 msec without rescheduling.
Goal of this patch is to avoid the infamous message like following
"need_resched set for > 51999388 ns (52 ticks) without schedule"
We dynamically adjust the number of elements we process, instead
of 10 / INFINITE choices, we use a floor of ~1 % of current entries.
If the number is above 1000, we switch to a time based limit of 3 msec
per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns
Signed-off-by: Eric Dumazet <edumazet@google.com>
[ paulmck: Forward-port and remove debug statements. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
When under overload conditions, __call_rcu_nocb_wake() will wake the
no-CBs GP kthread any time the no-CBs CB kthread is asleep or there
are no ready-to-invoke callbacks, but only after a timer delay. If the
no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup
from __call_rcu_nocb_wake() is redundant. This commit therefore makes
__call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if
->nocb_bypass_timer is pending. This requires adding a bit of ordering
of timer actions.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, __call_rcu_nocb_wake() advances callbacks each time that it
detects excessive numbers of callbacks, though only if it succeeds in
conditionally acquiring its leaf rcu_node structure's ->lock. Despite
the conditional acquisition of ->lock, this does increase contention.
This commit therefore avoids advancing callbacks unless there are
callbacks in ->cblist whose grace period has completed and advancing
has not yet been done during this jiffy.
Note that this decision does not take the presence of new callbacks
into account. That is because on this code path, there will always be
at least one new callback, namely the one we just enqueued.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, nocb_cb_wait() advances callbacks on each pass through its
loop, though only if it succeeds in conditionally acquiring its leaf
rcu_node structure's ->lock. Despite the conditional acquisition of
->lock, this does increase contention. This commit therefore avoids
advancing callbacks unless there are callbacks in ->cblist whose grace
period has completed.
Note that nocb_cb_wait() doesn't worry about callbacks that have not
yet been assigned a grace period. The idea is that the only reason for
nocb_cb_wait() to advance callbacks is to allow it to continue invoking
callbacks. Time will tell whether this is the correct choice.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the
offlined CPU's ->cblist and that of the surviving CPU, then merges
them. However, after the merge, and of the offlined CPU's callbacks
that were not ready to be invoked will no longer be associated with a
grace-period number. This commit therefore invokes rcu_advance_cbs()
one more time on the merged ->cblist in order to assign a grace-period
number to these callbacks.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
When callbacks are in full flow, the common case is waiting for a
grace period, and this grace period will normally take a few jiffies to
complete. It therefore isn't all that helpful for __call_rcu_nocb_wake()
to do a synchronous wakeup in this case. This commit therefore turns this
into a timer-based deferred wakeup of the no-CBs grace-period kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit causes locking, sleeping, and callback state to be printed
for no-CBs CPUs when the rcutorture writer is delayed sufficiently for
rcutorture to complain.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Upcoming ->nocb_lock contention-reduction work requires that the
rcu_segcblist structure's ->len field be concurrently manipulated,
but only if there are no-CBs CPUs in the kernel. This commit
therefore makes this ->len field be an atomic_long_t, but only
in CONFIG_RCU_NOCB_CPU=y kernels.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
When there are excessive numbers of callbacks, and when either the
corresponding no-CBs callback kthread is asleep or there is no more
ready-to-invoke callbacks, and when least one callback is pending,
__call_rcu_nocb_wake() will advance the callbacks, but refrain from
awakening the corresponding no-CBs grace-period kthread. However,
because rcu_advance_cbs_nowake() is used, it is possible (if a bit
unlikely) that the needed advancement could not happen due to a grace
period not being in progress. Plus there will always be at least one
pending callback due to one having just now been enqueued.
This commit therefore attempts to advance callbacks and awakens the
no-CBs grace-period kthread when there are excessive numbers of callbacks
posted and when the no-CBs callback kthread is not in a position to do
anything helpful.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The sleep/wakeup of the no-CBs grace-period kthreads is synchronized
using the ->nocb_lock of the first CPU corresponding to that kthread.
This commit provides a separate ->nocb_gp_lock for this purpose, thus
reducing contention on ->nocb_lock.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, nocb_cb_wait() unconditionally acquires the leaf rcu_node
->lock to advance callbacks when done invoking the previous batch.
It does this while holding ->nocb_lock, which means that contention on
the leaf rcu_node ->lock visits itself on the ->nocb_lock. This commit
therefore makes this lock acquisition conditional, forgoing callback
advancement when the leaf rcu_node ->lock is not immediately available.
(In this case, the no-CBs grace-period kthread will eventually do any
needed callback advancement.)
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, __call_rcu_nocb_wake() conditionally acquires the leaf rcu_node
structure's ->lock, and only afterwards does rcu_advance_cbs_nowake()
check to see if it is possible to advance callbacks without potentially
needing to awaken the grace-period kthread. Given that the no-awaken
check can be done locklessly, this commit reverses the order, so that
rcu_advance_cbs_nowake() is invoked without holding the leaf rcu_node
structure's ->lock and rcu_advance_cbs_nowake() checks the grace-period
state before conditionally acquiring that lock, thus reducing the number
of needless acquistions of the leaf rcu_node structure's ->lock.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, when the square root of the number of CPUs is rounded down
by int_sqrt(), this round-down is applied to the number of callback
kthreads per grace-period kthreads. This makes almost no difference
for large systems, but results in oddities such as three no-CBs
grace-period kthreads for a five-CPU system, which is a bit excessive.
This commit therefore causes the round-down to apply to the number of
no-CBs grace-period kthreads, so that systems with from four to eight
CPUs have only two no-CBs grace period kthreads.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
A given rcu_data structure's ->nocb_lock can be acquired very frequently
by the corresponding CPU and occasionally by the corresponding no-CBs
grace-period and callbacks kthreads. In particular, these two kthreads
will have frequent gaps between ->nocb_lock acquisitions that are roughly
a grace period in duration. This means that any excessive ->nocb_lock
contention will be due to the CPU's acquisitions, and this in turn
enables a very naive contention-avoidance strategy to be quite effective.
This commit therefore modifies rcu_nocb_lock() to first
attempt a raw_spin_trylock(), and to atomically increment a
separate ->nocb_lock_contended across a raw_spin_lock(). This new
->nocb_lock_contended field is checked in __call_rcu_nocb_wake() when
interrupts are enabled, with a spin-wait for contending acquisitions
to complete, thus allowing the kthreads a chance to acquire the lock.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, the code provides an extra wakeup for the no-CBs grace-period
kthread if one of its CPUs is generating excessive numbers of callbacks.
But satisfying though it is to wake something up when things are going
south, unless the thing being awakened can actually help solve the
problem, that extra wakeup does nothing but consume additional CPU time,
which is exactly what you don't want during a call_rcu() flood.
This commit therefore avoids doing anything if the corresponding
no-CBs callback kthread is going full tilt. Otherwise, if advancing
callbacks immediately might help and if the leaf rcu_node structure's
lock is immediately available, this commit invokes a new variant of
rcu_advance_cbs() that advances callbacks only if doing so won't require
awakening the grace-period kthread (not to be confused with any of the
no-CBs grace-period kthreads).
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
It might be hard to imagine having more than two billion callbacks
queued on a single CPU's ->cblist, but someone will do it sometime.
This commit therefore makes __call_rcu_nocb_wake() handle this situation
by upgrading local variable "len" from "int" to "long".
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, wake_nocb_gp_defer() simply stores whatever waketype was
passed in, which can result in a RCU_NOCB_WAKE_FORCE being downgraded
to RCU_NOCB_WAKE, which could in turn delay callback processing.
This commit therefore adds a check so that wake_nocb_gp_defer() only
updates ->nocb_defer_wakeup when the update increases the forcefulness,
thus avoiding downgrades.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The __call_rcu_nocb_wake() function and its predecessors set
->qlen_last_fqs_check to zero for the first callback and to LONG_MAX / 2
for forced reawakenings. The former can result in a too-quick reawakening
when there are many callbacks ready to invoke and the latter prevents a
second reawakening. This commit therefore sets ->qlen_last_fqs_check
to the current number of callbacks in both cases. While in the area,
this commit also moves both assignments under ->nocb_lock.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Historically, no-CBs CPUs allowed the scheduler-clock tick to be
unconditionally disabled on any transition to idle or nohz_full userspace
execution (see the rcu_needs_cpu() implementations). Unfortunately,
the checks used by rcu_needs_cpu() are defeated now that no-CBs CPUs
use ->cblist, which might make users of battery-powered devices rather
unhappy. This commit therefore adds explicit rcu_segcblist_is_offloaded()
checks to return to the historical energy-efficient semantics.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Some compilers complain that wait_gp_seq might be used uninitialized
in nocb_gp_wait(). This cannot actually happen because when wait_gp_seq
is uninitialized, needwait_gp must be false, which prevents wait_gp_seq
from being used. But this analysis is apparently beyond some compilers,
so this commit adds a bogus initialization of wait_gp_seq for the sole
purpose of suppressing the false-positive warning.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, rcu_pending() invokes rcu_segcblist_is_offloaded() even
in CONFIG_RCU_NOCB_CPU=n kernels, which cannot possibly be offloaded.
Given that rcu_pending() is on a fastpath, it makes sense to check for
CONFIG_RCU_NOCB_CPU=y before invoking rcu_segcblist_is_offloaded().
This commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, rcu_core() invokes rcu_segcblist_is_offloaded() each time it
needs to know whether the current CPU is a no-CBs CPU. Given that it is
not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this repeated runtime invocation wastes CPU. This commit
therefore created a const on-stack variable to allow this check to be
done only once per rcu_core() invocation.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, rcu_do_batch() invokes rcu_segcblist_is_offloaded() each time
it needs to know whether the current CPU is a no-CBs CPU. Given that it
is not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this per-callback invocation wastes CPU. This commit therefore
created a const on-stack variable to allow this check to be done only
once per rcu_do_batch() invocation.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit removes the obsolete nocb_q_count and nocb_q_count_lazy
fields, also removing rcu_get_n_cbs_nocb_cpu(), adjusting
rcu_get_n_cbs_cpu(), and making rcutree_migrate_callbacks() once again
disable the ->cblist fields of offline CPUs.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently the RCU callbacks for no-CBs CPUs are queued on a series of
ad-hoc linked lists, which means that these callbacks cannot benefit
from "drive-by" grace periods, thus suffering needless delays prior
to invocation. In addition, the no-CBs grace-period kthreads first
wait for callbacks to appear and later wait for a new grace period,
which means that callbacks appearing during a grace-period wait can
be delayed. These delays increase memory footprint, and could even
result in an out-of-memory condition.
This commit therefore enqueues RCU callbacks from no-CBs CPUs on the
rcu_segcblist structure that is already used by non-no-CBs CPUs. It also
restructures the no-CBs grace-period kthread to be checking for incoming
callbacks while waiting for grace periods. Also, instead of waiting
for a new grace period, it waits for the closest grace period that will
cause some of the callbacks to be safe to invoke. All of these changes
reduce callback latency and thus the number of outstanding callbacks,
in turn reducing the probability of an out-of-memory condition.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
As a first step towards making no-CBs CPUs use the ->cblist, this commit
leaves the ->cblist enabled for these CPUs. The main reason to make
no-CBs CPUs use ->cblist is to take advantage of callback numbering,
which will reduce the effects of missed grace periods which in turn will
reduce forward-progress problems for no-CBs CPUs.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, rcu_segcblist_empty() assumes that the callback list is not
being changed by other CPUs, but upcoming changes will require it to
operate locklessly. This commit therefore adds the needed READ_ONCE()
call, along with the WRITE_ONCE() calls when updating the callback list's
->head field.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, rcu_segcblist_restempty() assumes that the callback list
is not being changed by other CPUs, but upcoming changes will require
it to operate locklessly. This commit therefore adds the needed
READ_ONCE() calls, along with the WRITE_ONCE() calls when updating
the callback list.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The idea behind the checks for extended quiescent states at the end of
__call_rcu_nocb() is to handle cases where call_rcu() is invoked directly
from within an extended quiescent state, for example, from the idle loop.
However, this will result in a timer-mediated deferred wakeup, which
will cause the needed wakeup to happen within a jiffy or thereabouts.
There should be no forward-progress concerns, and if there are, the proper
response is to exit the extended quiescent state while executing the
endless blast of call_rcu() invocations, for example, using RCU_NONIDLE().
Given the more realistic case of an isolated call_rcu() invocation, there
should be no problem.
This commit therefore removes the checks for invoking call_rcu() within
an extended quiescent state for on no-CBs CPUs.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
In theory, a timer is used to defer wakeups of no-CBs grace-period
kthreads when the wakeup cannot be done safely directly from the
call_rcu(). In practice, the one-jiffy delay is not always consistent
with timely callback invocation under heavy call_rcu() loads. Therefore,
there are a number of checks for a pending deferred wakeup, including
from the scheduling-clock interrupt. Unfortunately, this check follows
the rcu_nohz_full_cpu() early exit, which renders it useless on such CPUs.
This commit therefore moves the check for the pending deferred no-CB
wakeup to precede the rcu_nohz_full_cpu() early exit.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Because rcutree_migrate_callbacks() is invoked infrequently and because
an exact snapshot of the grace-period state might save some callbacks a
second trip through a grace period, this function has used the root
rcu_node structure. However, this safe-second-trip optimization
happens only if rcutree_migrate_callbacks() races with grace-period
initialization, so it is not worth the added mental load. This commit
therefore makes rcutree_migrate_callbacks() start with the leaf rcu_node
structures, as is done elsewhere.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit is a preparatory patch for offloaded callbacks using the
same ->cblist structure used by non-offloaded callbacks. It therefore
adds rcu_segcblist_is_offloaded() calls where they will be needed when
!rcu_segcblist_is_enabled() no longer flags the offloaded case. It also
adds checks in rcu_do_batch() to ensure that there are no missed checks:
Currently, it should not be possible for offloaded execution to reach
rcu_do_batch(), though this will change later in this series.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
RCU callback processing currently uses rcu_is_nocb_cpu() to determine
whether or not the current CPU's callbacks are to be offloaded.
This works, but it is not so good for cache locality. Plus use of
->cblist for offloaded callbacks will greatly increase the frequency
of these checks. This commit therefore adds a ->offloaded flag to the
rcu_segcblist structure to provide a more flexible and cache-friendly
means of checking for callback offloading.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
NULLing the RCU_NEXT_TAIL pointer was a clever way to save a byte, but
forward-progress considerations would require that this pointer be both
NULL and non-NULL, which, absent a quantum-computer port of the Linux
kernel, simply won't happen. This commit therefore creates as separate
->enabled flag to replace the current NULL checks.
[ paulmck: Add include files per 0day test robot and -next. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit causes the no-CBs grace-period/callback hierarchy to be
printed to the console when the dump_tree kernel boot parameter is set.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit changes the name of the rcu_nocb_leader_stride kernel
boot parameter to rcu_nocb_gp_stride in order to account for the new
distinction between callback and grace-period no-CBs kthreads.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The nocb_cb_wait() function traces a "FollowerSleep" trace_rcu_nocb_wake()
event, which never was documented and is now misleading. This commit
therefore changes "FollowerSleep" to "CBSleep", documents this, and
updates the documentation for "Sleep" as well.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit renames rdp_leader to rdp_gp in order to account for the
new distinction between callback and grace-period no-CBs kthreads.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads. While in the area, it also
updates local variables.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>