This patch adds scx_flatcg example scheduler which implements hierarchical
weight-based cgroup CPU control by flattening the cgroup hierarchy into a
single layer by compounding the active weight share at each level.
This flattening of hierarchy can bring a substantial performance gain when
the cgroup hierarchy is nested multiple levels. in a simple benchmark using
wrk[8] on apache serving a CGI script calculating sha1sum of a small file,
it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two
apache instances competing with 2:1 weight ratio nested four level deep.
However, the gain comes at the cost of not being able to properly handle
thundering herd of cgroups. For example, if many cgroups which are nested
behind a low priority parent cgroup wake up around the same time, they may
be able to consume more CPU cycles than they are entitled to. In many use
cases, this isn't a real concern especially given the performance gain.
Also, there are ways to mitigate the problem further by e.g. introducing an
extra scheduling layer on cgroup delegation boundaries.
v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of
SCX_OPS_KNOB_CGROUP_WEIGHT.
v4: - Revert reference counted kptr for cgv_node as the change caused easily
reproducible stalls.
v3: - Updated to reflect the core API changes including ops.init/exit_task()
and direct dispatch from ops.select_cpu(). Fixes and improvements
including additional statistics.
- Use reference counted kptr for cgv_node instead of xchg'ing against
stash location.
- Dropped '-p' option.
v2: - Use SCX_BUG[_ON]() to simplify error handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
This patch adds a new example scheduler, scx_central, which demonstrates
central scheduling where one CPU is responsible for making all scheduling
decisions in the system using scx_bpf_kick_cpu(). The central CPU makes
scheduling decisions for all CPUs in the system, queues tasks on the
appropriate local dsq's and preempts the worker CPUs. The worker CPUs in
turn preempt the central CPU when it needs tasks to run.
Currently, every CPU depends on its own tick to expire the current task. A
follow-up patch implementing tickless support for sched_ext will allow the
worker CPUs to go full tickless so that they can run completely undisturbed.
v3: - Kumar fixed a bug where the dispatch path could overflow the dispatch
buffer if too many are dispatched to the fallback DSQ.
- Use the new SCX_KICK_IDLE to wake up non-central CPUs.
- Dropped '-p' option.
v2: - Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]()
to simplify error handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Add two simple example BPF schedulers - simple and qmap.
* simple: In terms of scheduling, it behaves identical to not having any
operation implemented at all. The two operations it implements are only to
improve visibility and exit handling. On certain homogeneous
configurations, this actually can perform pretty well.
* qmap: A fixed five level priority scheduler to demonstrate queueing PIDs
on BPF maps for scheduling. While not very practical, this is useful as a
simple example and will be used to demonstrate different features.
v7: - Compat helpers stripped out in prepartion of upstreaming as the
upstreamed patchset will be the baselinfe. Utility macros that can be
used to implement compat features are kept.
- Explicitly disable map autoattach on struct_ops to avoid trying to
attach twice while maintaining compatbility with older libbpf.
v6: - Common header files reorganized and cleaned up. Compat helpers are
added to demonstrate how schedulers can maintain backward
compatibility with older kernels while making use of newly added
features.
- simple_select_cpu() added to keep track of the number of local
dispatches. This is needed because the default ops.select_cpu()
implementation is updated to dispatch directly and won't call
ops.enqueue().
- Updated to reflect the sched_ext API changes. Switching all tasks is
the default behavior now and scx_qmap supports partial switching when
`-p` is specified.
- tools/sched_ext/Kconfig dropped. This will be included in the doc
instead.
v5: - Improve Makefile. Build artifects are now collected into a separate
dir which change be changed. Install and help targets are added and
clean actually cleans everything.
- MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR()
and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss.
- Add scx_common.h which provides common utilities to user code such as
SCX_BUG[_ON]() and RESIZE_ARRAY().
- Use SCX_BUG[_ON]() to simplify error handling.
v4: - Dropped _example prefix from scheduler names.
v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit
to ease later additions. Comment updates.
- Added declarations for BPF inline iterators. In the future, hopefully,
these will be consolidated into a generic BPF header so that they
don't need to be replicated here.
v2: - Updated with the generic BPF cpumask helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>