Commit Graph

728 Commits

Author SHA1 Message Date
Chen Ridong
381b53c3b5 cgroup/cpuset: rename functions shared between v1 and v2
Some functions name declared in cpuset-internel.h are generic. To avoid
confilicting with other variables for the same name, rename these
functions with cpuset_/cpuset1_ prefix to make them unique to cpuset.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
b0ced9d378 cgroup/cpuset: move v1 interfaces to cpuset-v1.c
Move legacy cpuset controller interfaces files and corresponding code
into cpuset-v1.c. 'update_flag', 'cpuset_write_resmask' and
'cpuset_common_seq_show' are also used for v1, so declare them in
cpuset-internal.h.

'cpuset_write_s64', 'cpuset_read_s64' and 'fmeter_getrate' are only used
cpuset-v1.c now, make it static.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
be126b5b1b cgroup/cpuset: move validate_change_legacy to cpuset-v1.c
The validate_change_legacy functions is used for v1, move it to
cpuset-v1.c. And two micro 'cpuset_for_each_child' and
'cpuset_for_each_descendant_pre' are common for v1 and v2, move them to
cpuset-internal.h.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
23ca5237e3 cgroup/cpuset: move legacy hotplug update to cpuset-v1.c
There are some differents about hotplug update between cpuset v1 and
cpuset v2. Move the legacy code to cpuset-v1.c.

'update_tasks_cpumask' and 'update_tasks_nodemask' are both used in cpuset
v1 and cpuset v2, declare them in cpuset-internal.h.

The change from original code is that use callback_lock helpers to get
callback_lock lock/unlock.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
530020f28f cgroup/cpuset: add callback_lock helper
To modify cpuset, both cpuset_mutex and callback_lock are needed. Add
helpers for cpuset-v1 to get callback_lock.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
90eec9548d cgroup/cpuset: move memory_spread to cpuset-v1.c
'memory_spread' is only set in cpuset v1. move corresponding code into
cpuset-v1.c.

Currently, 'cpuset_update_task_spread_flags' and 'update_tasks_flags' are
exposed to cpuset.c.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
047b830974 cgroup/cpuset: move relax_domain_level to cpuset-v1.c
Setting domain level is not supported at cpuset v2, so move corresponding
code into cpuset-v1.c.

The 'cpuset_write_s64' and 'cpuset_read_s64' are only used for setting
domain level, move them to cpuset-v1.c. Currently, expose to cpuset.c.
After cpuset legacy interface files are move to cpuset-v1.c, they can
be static. The 'rebuild_sched_domains_locked' is exposed to cpuset-v1.c.

The change from original code is that using 'cpuset_lock' and
'cpuset_unlock' functions to lock or unlock cpuset_mutex.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:15 -10:00
Chen Ridong
49434094ef cgroup/cpuset: move memory_pressure to cpuset-v1.c
Collection of memory_pressure can be enabled by writing 1 to the cpuset
file 'memory_pressure_enabled', which is only for cpuset-v1. Therefore,
move the corresponding code to cpuset-v1.c.

Currently, the 'fmeter_init' and 'fmeter_getrate' functions are called
at cpuset.c, so expose them to cpuset.c.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:15 -10:00
Chen Ridong
619a33efa0 cgroup/cpuset: move common code to cpuset-internal.h
Move some declarations that will be used for cpuset v1 and v2,
including 'cpuset struct', 'cpuset_flagbits_t', cpuset_filetype_t,etc.
No logical change.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:15 -10:00
Chen Ridong
71e934a808 cgroup/cpuset: introduce cpuset-v1.c
This patch introduces the cgroup/cpuset-v1.c source file which will be
used for all legacy (cgroup v1) cpuset cgroup code. It also introduces
cgroup/cpuset-internal.h to keep declarations shared between
cgroup/cpuset.c and cpuset/cpuset-v1.c.

As of now, let's compile it if CONFIG_CPUSET is set. Later on it can be
switched to use a separate config option, so that the legacy code won't be
compiled if not required.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:15 -10:00
Waiman Long
c188f33c86 cgroup/cpuset: Account for boot time isolated CPUs
With the "isolcpus" boot command line parameter, we are able to
create isolated CPUs at boot time. These isolated CPUs aren't fully
accounted for in the cpuset code. For instance, the root cgroup's
"cpuset.cpus.isolated" control file does not include the boot time
isolated CPUs. Fix that by looking for pre-isolated CPUs at init time.

The prstate_housekeeping_conflict() function does check the
HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it
can only be used in isolated partition. Given the fact that we are going
to make housekeeping cpumasks dynamic, the current check may not be right
anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against
it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 09:23:39 -10:00
Chen Ridong
3c2acae888 cgroup/cpuset: remove use_parent_ecpus of cpuset
use_parent_ecpus is used to track whether the children are using the
parent's effective_cpus. When a parent's effective_cpus is changed
due to changes in a child partition's effective_xcpus, any child
using parent'effective_cpus must call update_cpumasks_hier. However,
if a child is not a valid partition, it is sufficient to determine
whether to call update_cpumasks_hier based on whether the child's
effective_cpus is going to change. To make the code more succinct,
it is suggested to remove use_parent_ecpus.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20 08:51:48 -10:00
Chen Ridong
9414f68d45 cgroup/cpuset: remove fetch_xcpus
Both fetch_xcpus and user_xcpus functions are used to retrieve the value
of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the
implicit value used as exclusive in a local partition. I can not imagine
a scenario where effective_xcpus is not empty when exclusive_cpus is
empty. Therefore, I suggest removing the fetch_xcpus function.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20 08:51:34 -10:00
Chen Ridong
e55f45b4ba cgroup/cpuset: Correct invalid remote parition prs
When enable a remote partition, I found that:

cd /sys/fs/cgroup/
mkdir test
mkdir test/test1
echo +cpuset > cgroup.subtree_control
echo +cpuset >  test/cgroup.subtree_control
echo 3 > test/test1/cpuset.cpus
echo root > test/test1/cpuset.cpus.partition
cat test/test1/cpuset.cpus.partition
root invalid (Parent is not a partition root)

The parent of a remote partition could not be a root. This is due to the
emtpy effective_xcpus. It would be better to prompt the message "invalid
cpu list in cpuset.cpus.exclusive".

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20 08:51:16 -10:00
Chen Ridong
d1a92d2d6c cgroup: update some statememt about delegation
The comment in cgroup_file_write is missing some interfaces, such as
'cgroup.threads'. All delegatable files are listed in
'/sys/kernel/cgroup/delegate', so update the comment in cgroup_file_write.
Besides, add a statement that files outside the namespace shouldn't be
visible from inside the delegated namespace.

tj: Reflowed text for consistency.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-19 12:16:17 -10:00
Waiman Long
9b103943ab cgroup: Fix incorrect WARN_ON_ONCE() in css_release_work_fn()
It turns out that the WARN_ON_ONCE() call in css_release_work_fn
introduced by commit ab03125268 ("cgroup: Show # of subsystem CSSes
in cgroup.stat") is incorrect. Although css->nr_descendants must be
0 when a css is released and ready to be freed, the corresponding
cgrp->nr_dying_subsys[ss->id] may not be 0 if a subsystem is activated
and deactivated multiple times with one or more of its previous
activation leaving behind dying csses.

Fix the incorrect warning by removing the cgrp->nr_dying_subsys check.

Fixes: ab03125268 ("cgroup: Show # of subsystem CSSes in cgroup.stat")
Closes: https://lore.kernel.org/cgroups/6f301773-2fce-4602-a391-8af7ef00b2fb@redhat.com/T/#t
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-09 06:35:29 -10:00
Waiman Long
99570300d3 cgroup/cpuset: Check for partition roots with overlapping CPUs
With the previous commit that eliminates the overlapping partition
root corner cases in the hotplug code, the partition roots passed down
to generate_sched_domains() should not have overlapping CPUs. Enable
overlapping cpuset check for v2 and warn if that happens.

This patch also has the benefit of increasing test coverage of the new
Union-Find cpuset merging code to cgroup v2.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05 10:57:38 -10:00
Tejun Heo
bc3c27516f Merge branch 'cgroup/for-6.11-fixes' into cgroup/for-6.12
cgroup/for-6.12 is about to receive updates that are dependent on changes
from both for-6.11-fixes and for-6.12. Pull in for-6.11-fixes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05 10:56:49 -10:00
Waiman Long
ff0ce721ec cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug
It was found that some hotplug operations may cause multiple
rebuild_sched_domains_locked() calls. Some of those intermediate calls
may use cpuset states not in the final correct form leading to incorrect
sched domain setting.

Fix this problem by using the existing force_rebuild flag to inhibit
immediate rebuild_sched_domains_locked() calls if set and only doing
one final call at the end. Also renaming the force_rebuild flag to
force_sd_rebuild to make its meaning for clear.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05 10:54:25 -10:00
Waiman Long
311a1bdc44 cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set
Commit e2ffe502ba ("cgroup/cpuset: Add cpuset.cpus.exclusive for
v2") adds a user writable cpuset.cpus.exclusive file for setting
exclusive CPUs to be used for the creation of partitions. Since then
effective_xcpus depends on both the cpuset.cpus and cpuset.cpus.exclusive
setting. If cpuset.cpus.exclusive is set, effective_xcpus will depend
only on cpuset.cpus.exclusive.  When it is not set, effective_xcpus
will be set according to the cpuset.cpus value when the cpuset becomes
a valid partition root.

When cpuset.cpus is being cleared by the user, effective_xcpus should
only be cleared when cpuset.cpus.exclusive is not set. However, that
is not currently the case.

  # cd /sys/fs/cgroup/
  # mkdir test
  # echo +cpuset > cgroup.subtree_control
  # cd test
  # echo 3 > cpuset.cpus.exclusive
  # cat cpuset.cpus.exclusive.effective
  3
  # echo > cpuset.cpus
  # cat cpuset.cpus.exclusive.effective // was cleared

Fix it by clearing effective_xcpus only if cpuset.cpus.exclusive is
not set.

Fixes: e2ffe502ba ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
Cc: stable@vger.kernel.org # v6.7+
Reported-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05 10:52:40 -10:00
Chen Ridong
959ab6350a cgroup/cpuset: fix panic caused by partcmd_update
We find a bug as below:
BUG: unable to handle page fault for address: 00000003
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 358 Comm: bash Tainted: G        W I        6.6.0-10893-g60d6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/4
RIP: 0010:partition_sched_domains_locked+0x483/0x600
Code: 01 48 85 d2 74 0d 48 83 05 29 3f f8 03 01 f3 48 0f bc c2 89 c0 48 9
RSP: 0018:ffffc90000fdbc58 EFLAGS: 00000202
RAX: 0000000100000003 RBX: ffff888100b3dfa0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000002fe80
RBP: ffff888100b3dfb0 R08: 0000000000000001 R09: 0000000000000000
R10: ffffc90000fdbcb0 R11: 0000000000000004 R12: 0000000000000002
R13: ffff888100a92b48 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f44a5425740(0000) GS:ffff888237d80000(0000) knlGS:0000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000100030973 CR3: 000000010722c000 CR4: 00000000000006e0
Call Trace:
 <TASK>
 ? show_regs+0x8c/0xa0
 ? __die_body+0x23/0xa0
 ? __die+0x3a/0x50
 ? page_fault_oops+0x1d2/0x5c0
 ? partition_sched_domains_locked+0x483/0x600
 ? search_module_extables+0x2a/0xb0
 ? search_exception_tables+0x67/0x90
 ? kernelmode_fixup_or_oops+0x144/0x1b0
 ? __bad_area_nosemaphore+0x211/0x360
 ? up_read+0x3b/0x50
 ? bad_area_nosemaphore+0x1a/0x30
 ? exc_page_fault+0x890/0xd90
 ? __lock_acquire.constprop.0+0x24f/0x8d0
 ? __lock_acquire.constprop.0+0x24f/0x8d0
 ? asm_exc_page_fault+0x26/0x30
 ? partition_sched_domains_locked+0x483/0x600
 ? partition_sched_domains_locked+0xf0/0x600
 rebuild_sched_domains_locked+0x806/0xdc0
 update_partition_sd_lb+0x118/0x130
 cpuset_write_resmask+0xffc/0x1420
 cgroup_file_write+0xb2/0x290
 kernfs_fop_write_iter+0x194/0x290
 new_sync_write+0xeb/0x160
 vfs_write+0x16f/0x1d0
 ksys_write+0x81/0x180
 __x64_sys_write+0x21/0x30
 x64_sys_call+0x2f25/0x4630
 do_syscall_64+0x44/0xb0
 entry_SYSCALL_64_after_hwframe+0x78/0xe2
RIP: 0033:0x7f44a553c887

It can be reproduced with cammands:
cd /sys/fs/cgroup/
mkdir test
cd test/
echo +cpuset > ../cgroup.subtree_control
echo root > cpuset.cpus.partition
cat /sys/fs/cgroup/cpuset.cpus.effective
0-3
echo 0-3 > cpuset.cpus // taking away all cpus from root

This issue is caused by the incorrect rebuilding of scheduling domains.
In this scenario, test/cpuset.cpus.partition should be an invalid root
and should not trigger the rebuilding of scheduling domains. When calling
update_parent_effective_cpumask with partcmd_update, if newmask is not
null, it should recheck newmask whether there are cpus is available
for parect/cs that has tasks.

Fixes: 0c7f293efc ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05 10:52:40 -10:00
Xiu Jianfeng
4980f71202 cgroup/pids: Remove unreachable paths of pids_{can,cancel}_fork
According to the implementation of cgroup_css_set_fork(), it will fail
if cset cannot be found and the can_fork/cancel_fork methods will not
be called in this case, which means that the argument 'cset' for these
methods must not be NULL, so remove the unrechable paths in them.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05 10:32:16 -10:00
Waiman Long
ab03125268 cgroup: Show # of subsystem CSSes in cgroup.stat
Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
help manage different structures in various cgroup subsystems by being
an embedded element inside a larger structure like cpuset or mem_cgroup.

The /proc/cgroups file shows the number of cgroups for each of the
subsystems.  With cgroup v1, the number of CSSes is the same as the
number of cgroups.  That is not the case anymore with cgroup v2. The
/proc/cgroups file cannot show the actual number of CSSes for the
subsystems that are bound to cgroup v2.

So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
we can't tell by looking at /proc/cgroups which cgroup subsystems may
be responsible.

As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
cgroup.stat file is now being extended to show the number of live and
dying CSSes associated with all the non-inhibited cgroup subsystems that
have been bound to cgroup v2. The number includes CSSes in the current
cgroup as well as in all the descendants underneath it.  This will help
us pinpoint which subsystems are responsible for the increasing number
of dying (nr_dying_descendants) cgroups.

The CSSes dying counts are stored in the cgroup structure itself
instead of inside the CSS as suggested by Johannes. This will allow
us to accurately track dying counts of cgroup subsystems that have
recently been disabled in a cgroup. It is now possible that a zero
subsystem number is coupled with a non-zero dying subsystem number.

The cgroup-v2.rst file is updated to discuss this new behavior.

With this patch applied, a sample output from root cgroup.stat file
was shown below.

	nr_descendants 56
	nr_subsys_cpuset 1
	nr_subsys_cpu 43
	nr_subsys_io 43
	nr_subsys_memory 56
	nr_subsys_perf_event 57
	nr_subsys_hugetlb 1
	nr_subsys_pids 56
	nr_subsys_rdma 1
	nr_subsys_misc 1
	nr_dying_descendants 30
	nr_dying_subsys_cpuset 0
	nr_dying_subsys_cpu 0
	nr_dying_subsys_io 0
	nr_dying_subsys_memory 30
	nr_dying_subsys_perf_event 0
	nr_dying_subsys_hugetlb 0
	nr_dying_subsys_pids 0
	nr_dying_subsys_rdma 0
	nr_dying_subsys_misc 0

Another sample output from system.slice/cgroup.stat was:

	nr_descendants 34
	nr_subsys_cpuset 0
	nr_subsys_cpu 32
	nr_subsys_io 32
	nr_subsys_memory 34
	nr_subsys_perf_event 35
	nr_subsys_hugetlb 0
	nr_subsys_pids 34
	nr_subsys_rdma 0
	nr_subsys_misc 0
	nr_dying_descendants 30
	nr_dying_subsys_cpuset 0
	nr_dying_subsys_cpu 0
	nr_dying_subsys_io 0
	nr_dying_subsys_memory 30
	nr_dying_subsys_perf_event 0
	nr_dying_subsys_hugetlb 0
	nr_dying_subsys_pids 0
	nr_dying_subsys_rdma 0
	nr_dying_subsys_misc 0

Note that 'debug' controller wasn't used to provide this information because
the controller is not recommended in productions kernels, also many of them
won't enable CONFIG_CGROUP_DEBUG by default.

Similar information could be retrieved with debuggers like drgn but that's
also not always available (e.g. lockdown) and the additional cost of runtime
tracking here is deemed marginal.

tj: Added Michal's paragraphs on why this is not added the debug controller
    to the commit message.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20240715150034.2583772-1-longman@redhat.com
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-31 07:00:02 -10:00
Xavier
8a895c2e6a cpuset: use Union-Find to optimize the merging of cpumasks
The process of constructing scheduling domains
 involves multiple loops and repeated evaluations, leading to numerous
 redundant and ineffective assessments that impact code efficiency.

Here, we use union-find to optimize the merging of cpumasks. By employing
path compression and union by rank, we effectively reduce the number of
lookups and merge comparisons.

Signed-off-by: Xavier <xavier_qy@163.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30 13:04:50 -10:00
Chen Ridong
4a711dd910 cgroup/cpuset: add decrease attach_in_progress helpers
There are several functions to decrease attach_in_progress, and they
will wake up cpuset_attach_wq when attach_in_progress is zero. So,
add a helper to make it concise.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30 12:38:14 -10:00
Xiu Jianfeng
c149c4a48b cgroup/cpuset: Remove cpuset_slab_spread_rotor
Since the SLAB implementation was removed in v6.8, so the
cpuset_slab_spread_rotor is no longer used and can be removed.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30 12:29:08 -10:00
Xiu Jianfeng
d72a00a848 cgroup/pids: Avoid spurious event notification
Currently when a process in a group forks and fails due to it's
parent's max restriction, all the cgroups from 'pids_forking' to root
will generate event notifications but only the cgroups from
'pids_over_limit' to root will increase the counter of PIDCG_MAX.

Consider this scenario: there are 4 groups A, B, C,and D, the
relationships are as follows, and user is watching on C.pids.events.

root->A->B->C->D

When a process in D forks and fails due to B.max restriction, the
user will get a spurious event notification because when he wakes up
and reads C.pids.events, he will find that the content has not changed.

To address this issue, only the cgroups from 'pids_over_limit' to root
will have their PIDCG_MAX counters increased and event notifications
generated.

Fixes: 385a635cac ("cgroup/pids: Make event counters hierarchical")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30 12:13:19 -10:00
Chen Ridong
d632604757 cgroup/cpuset: remove child_ecpus_count
The child_ecpus_count variable was previously used to update
sibling cpumask when parent's effective_cpus is updated. However, it became
obsolete after commit e2ffe502ba ("cgroup/cpuset: Add
cpuset.cpus.exclusive for v2"). It should be removed.

tj: Restored {} for style consistency.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30 09:45:26 -10:00
Tejun Heo
9283ff5be1 Merge branch 'for-6.10-fixes' into for-6.11 2024-07-14 18:04:03 -10:00
Xiu Jianfeng
6a26f9c689 cgroup/misc: Introduce misc.events.local
Currently the event counting provided by misc.events is hierarchical,
it's not practical if user is only concerned with events of a
specified cgroup. Therefore, introduce misc.events.local collect events
specific to the given cgroup.

This is analogous to memory.events.local and pids.events.local.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-12 06:45:23 -10:00
Chen Ridong
b824766504 cgroup/rstat: add force idle show helper
In the function cgroup_base_stat_cputime_show, there are five
instances of #ifdef, which makes the code not concise.
To address this, add the function cgroup_force_idle_show
to make the code more succinct.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-05 08:32:08 -10:00
Waiman Long
57b56d1680 cgroup: Protect css->cgroup write under css_set_lock
The writing of css->cgroup associated with the cgroup root in
rebind_subsystems() is currently protected only by cgroup_mutex.
However, the reading of css->cgroup in both proc_cpuset_show() and
proc_cgroup_show() is protected just by css_set_lock. That makes the
readers susceptible to racing problems like data tearing or caching.
It is also a problem that can be reported by KCSAN.

This can be fixed by using READ_ONCE() and WRITE_ONCE() to access
css->cgroup. Alternatively, the writing of css->cgroup can be moved
under css_set_lock as well which is done by this patch.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-03 08:59:06 -10:00
Xiu Jianfeng
1028f391d5 cgroup/misc: Introduce misc.peak
Introduce misc.peak to record the historical maximum usage of the
resource, as in some scenarios the value of misc.max could be
adjusted based on the peak usage of the resource.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-03 08:08:43 -10:00
Chen Ridong
1be59c97c8 cgroup/cpuset: Prevent UAF in proc_cpuset_show()
An UAF can happen when /proc/cpuset is read as reported in [1].

This can be reproduced by the following methods:
1.add an mdelay(1000) before acquiring the cgroup_lock In the
 cgroup_path_ns function.
2.$cat /proc/<pid>/cpuset   repeatly.
3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
$umount /sys/fs/cgroup/cpuset/   repeatly.

The race that cause this bug can be shown as below:

(umount)		|	(cat /proc/<pid>/cpuset)
css_release		|	proc_cpuset_show
css_release_work_fn	|	css = task_get_css(tsk, cpuset_cgrp_id);
css_free_rwork_fn	|	cgroup_path_ns(css->cgroup, ...);
cgroup_destroy_root	|	mutex_lock(&cgroup_mutex);
rebind_subsystems	|
cgroup_free_root 	|
			|	// cgrp was freed, UAF
			|	cgroup_path_ns_locked(cgrp,..);

When the cpuset is initialized, the root node top_cpuset.css.cgrp
will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will
allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated
&cgroup_root.cgrp. When the umount operation is executed,
top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.

The problem is that when rebinding to cgrp_dfl_root, there are cases
where the cgroup_root allocated by setting up the root for cgroup v1
is cached. This could lead to a Use-After-Free (UAF) if it is
subsequently freed. The descendant cgroups of cgroup v1 can only be
freed after the css is released. However, the css of the root will never
be released, yet the cgroup_root should be freed when it is unmounted.
This means that obtaining a reference to the css of the root does
not guarantee that css.cgrp->root will not be freed.

Fix this problem by using rcu_read_lock in proc_cpuset_show().
As cgroup_root is kfree_rcu after commit d23b5c5777
("cgroup: Make operations on the cgroup root_list RCU safe"),
css->cgroup won't be freed during the critical section.
To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to
replace task_get_css with task_css.

[1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-28 07:10:31 -10:00
Waiman Long
737bb142a0 cgroup/cpuset: Make cpuset.cpus.exclusive independent of cpuset.cpus
The "cpuset.cpus.exclusive.effective" value is currently limited to a
subset of its "cpuset.cpus". This makes the exclusive CPUs distribution
hierarchy subsumed within the larger "cpuset.cpus" hierarchy. We have to
decide on what CPUs are used locally and what CPUs can be passed down as
exclusive CPUs down the hierarchy and combine them into "cpuset.cpus".

The advantage of the current scheme is to have only one hierarchy to
worry about. However, it make it harder to use as all the "cpuset.cpus"
values have to be properly set along the way down to the designated remote
partition root. It also makes it more cumbersome to find out what CPUs
can be used locally.

Make creation of remote partition simpler by breaking the
dependency of "cpuset.cpus.exclusive" on "cpuset.cpus" and make
them independent entities. Now we have two separate hierarchies -
one for setting "cpuset.cpus.effective" and the other one for setting
"cpuset.cpus.exclusive.effective". We may not need to set "cpuset.cpus"
when we activate a partition root anymore.

Also update Documentation/admin-guide/cgroup-v2.rst and cpuset.c comment
to document this change.

Suggested-by: Petr Malat <oss@malat.biz>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-19 07:37:38 -10:00
Waiman Long
fe8cd2736e cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition
The CS_CPU_EXCLUSIVE flag is currently set whenever cpuset.cpus.exclusive
is set to make sure that the exclusivity test will be run to ensure its
exclusiveness. At the same time, this flag can be changed whenever the
partition root state is changed. For example, the CS_CPU_EXCLUSIVE flag
will be reset whenever a partition root becomes invalid. This makes
using CS_CPU_EXCLUSIVE to ensure exclusiveness a bit fragile.

The current scheme also makes setting up a cpuset.cpus.exclusive
hierarchy to enable remote partition harder as cpuset.cpus.exclusive
cannot overlap with any cpuset.cpus of sibling cpusets if their
cpuset.cpus.exclusive aren't set.

Solve these issues by deferring the setting of CS_CPU_EXCLUSIVE flag
until the cpuset become a valid partition root while adding new checks
in validate_change() to ensure that cpuset.cpus.exclusive of sibling
cpusets cannot overlap.

An additional check is also added to validate_change() to make sure that
cpuset.cpus of one cpuset cannot be a subset of cpuset.cpus.exclusive
of a sibling cpuset to avoid the problem that none of those CPUs will
be available when these exclusive CPUs are extracted out to a newly
enabled partition root. The Documentation/admin-guide/cgroup-v2.rst
file is updated to document the new constraints.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-19 07:37:37 -10:00
Waiman Long
ccac8e8de9 cgroup/cpuset: Fix remote root partition creation problem
Since commit 181c8e091a ("cgroup/cpuset: Introduce remote partition"),
a remote partition can be created underneath a non-partition root cpuset
as long as its exclusive_cpus are set to distribute exclusive CPUs down
to its children. The generate_sched_domains() function, however, doesn't
take into account this new behavior and hence will fail to create the
sched domain needed for a remote root (non-isolated) partition.

There are two issues related to remote partition support. First of
all, generate_sched_domains() has a fast path that is activated if
root_load_balance is true and top_cpuset.nr_subparts is non-zero. The
later condition isn't quite correct for remote partitions as nr_subparts
just shows the number of local child partitions underneath it. There
can be no local child partition under top_cpuset even if there are
remote partitions further down the hierarchy. Fix that by checking
for subpartitions_cpus which contains exclusive CPUs allocated to both
local and remote partitions.

Secondly, the valid partition check for subtree skipping in the csa[]
generation loop isn't enough as remote partition does not need to
have a partition root parent. Fix this problem by breaking csa[] array
generation loop of generate_sched_domains() into v1 and v2 specific parts
and checking a cpuset's exclusive_cpus before skipping its subtree in
the v2 case.

Also simplify generate_sched_domains() for cgroup v2 as only
non-isolating partition roots should be included in building the cpuset
array and none of the v1 scheduling attributes other than a different
way to create an isolated partition are supported.

Fixes: 181c8e091a ("cgroup/cpuset: Introduce remote partition")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-19 07:37:37 -10:00
Oleg Nesterov
6fe960147e cgroup: avoid the unnecessary list_add(dying_tasks) in cgroup_exit()
cgroup_exit() needs to do this only if the exiting task is a leader and it
is not the last live thread.  The patch doesn't use delay_group_leader(),
atomic_read(signal->live) matches the code css_task_iter_advance() more.

cgroup_release() can now check list_empty(task->cg_list) before it takes
css_set_lock and calls ss_set_skip_task_iters().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-19 07:18:41 -10:00
Waiman Long
1805c1729f cgroup/cpuset: Optimize isolated partition only generate_sched_domains() calls
If only isolated partitions are being created underneath the cgroup root,
there will only be one sched domain with top_cpuset.effective_cpus. We can
skip the unnecessary sched domains scanning code and save some cycles.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-03 09:47:11 -10:00
Xiu Jianfeng
018ee567de cgroup/cpuset: Reduce the lock protecting CS_SCHED_LOAD_BALANCE
In the cpuset_css_online(), clearing the CS_SCHED_LOAD_BALANCE bit
of cs->flags is guarded by callback_lock and cpuset_mutex. There is
no problem with itself, because it is consistent with the description
of there two global lock at the beginning of this file. However, since
the operation of checking, setting and clearing the flag bit is atomic,
protection of callback_lock is unnecessary here, see CS_SPREAD_*. so
to make it more consistent with the other code, move the operation
outside the critical section of callback_lock.

No functional changes intended.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-01 07:06:17 -10:00
David Wang
a8d55ff5f3 kernel/cgroup: cleanup cgroup_base_files when fail to add cgroup_psi_files
Even though css_clear_dir would be called to cleanup
all existing cgroup files when css_populate_dir failed,
reclaiming newly created cgroup files before
css_populate_dir returns with failure makes code more
consistent.

Signed-off-by: David Wang <00107082@163.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-26 08:48:11 -10:00
Michal Koutný
3f26a885a0 cgroup/pids: Add pids.events.local
Hierarchical counting of events is not practical for watching when a
particular pids.max is being hit. Therefore introduce .local flavor of
events file (akin to memory controller) that collects only events
relevant to given cgroup.

The file is only added to the default hierarchy.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-26 08:45:10 -10:00
Michal Koutný
385a635cac cgroup/pids: Make event counters hierarchical
The pids.events file should honor the hierarchy, so make the events
propagate from their origin up to the root on the unified hierarchy. The
legacy behavior remains non-hierarchical.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-26 08:45:09 -10:00
Michal Koutný
73e75e6fc3 cgroup/pids: Separate semantics of pids.events related to pids.max
Currently, when pids.max limit is breached in the hierarchy, the event
is counted and reported in the cgroup where the forking task resides.

This decouples the limit and the notification caused by the limit making
it hard to detect when the actual limit was effected.

Redefine the pids.events:max as: the number of times the limit of the
cgroup was hit.

(Implementation differentiates also "forkfail" event but this is
currently not exposed as it would better fit into pids.stat. It also
differs from pids.events:max only when pids.max is configured on
non-leaf cgroups.)

Since it changes semantics of the original "max" event, introduce this
change only in the v2 API of the controller and add a cgroup2 mount
option to revert to the legacy behavior.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-26 08:45:09 -10:00
Xiu Jianfeng
0ac380020c cgroup/cpuset: Update comment on callback_lock
Since commit 51ffe41178 ("cpuset: convert away from cftype->read()"),
cpuset_common_file_read() has been renamed.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-26 08:30:38 -10:00
Xiu Jianfeng
d9fc6b4220 cgroup/cpuset: Remove unnecessary zeroing
The struct cpuset is kzalloc'd, all the members are zeroed already,
so don't need nodes_clear() here.

No functional changes intended.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-26 08:28:30 -10:00
Linus Torvalds
8dde191aab Misc fixes:
- Fix a sched_balance_newidle setting bug
 
  - Fix bug in the setting of /sys/fs/cgroup/test/cpu.max.burst
 
  - Fix variable-shadowing build warning
 
  - Extend sched-domains debug output
 
  - Fix documentation
 
  - Fix comments
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZIbj4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hEng/+NlAh7mm4AWckVjUxqyUnJ/omaV9Fe5F+
 koiihntyvhk+4RR40XomXPq37Av3zPo1dnKI4fJ3yioMs1tB+8JD+nVo3DURLGT/
 4k+lYI+K6RXBzUTpzeYZWVfa+ddGwbRu1KA5joI7QvRfjil7QP5rC5AQbAj0AiVO
 Xvor0M9vEcfkqShTttx4h2u7WVR4zqVEhBxkWNMT6dMxN2HnKm4qcAiX39E8p+Vx
 maC2/iO+1rXORRbUh+KBHR40WAwe2CVvh5hCe1sl+/vGfCbAnMK1k+j85UdV1pFD
 aZ1jSBwIERnx9PdD5zK0GCRx9hmux8mkJCeBseZyK/XubYuVOLiwBxfYA/9C3i3O
 1mQizaFBD8zanEiWj10sOxbfry+XhLwcISIiWC+xLpxKb0MvDD1TIeZR1fJv3Oz7
 14iYhq2CuKhfntYmV6fYTzSzXL2s16dMYMH/7m7cLY0P/cJo2vw7GNxkwPeJsOVN
 uX6jnRde2Kp3q+Er3I2u1SGeAZ8fEzXr19MCWRA0qI+wvgYQkaTgoh9zO9AwRNoa
 9hS/jc6Gq+O5xBMMJIPZMfOVai9RhYlPmQavFCGJLd3EFoVi9jp9+/iXgtyARCZp
 rfXFV9Dd9GvpFRzNnsMrLiKswBzUop5+epHYKZhVHJKH7aiHMbGEFD6cgNlf8k9b
 GFda3ay4JHA=
 =2okO
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

 - Fix a sched_balance_newidle setting bug

 - Fix bug in the setting of /sys/fs/cgroup/test/cpu.max.burst

 - Fix variable-shadowing build warning

 - Extend sched-domains debug output

 - Fix documentation

 - Fix comments

* tag 'sched-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write()
  sched/fair: Remove stale FREQUENCY_UTIL comment
  sched/fair: Fix initial util_avg calculation
  docs: cgroup-v1: Clarify that domain levels are system-specific
  sched/debug: Dump domains' level
  sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level
  arch/topology: Fix variable naming to avoid shadowing
2024-05-19 11:38:15 -07:00
Vitalii Bursov
a1fd0b9d75 sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level
Change relax_domain_level checks so that it would be possible
to include or exclude all domains from newidle balancing.

This matches the behavior described in the documentation:

  -1   no request. use system default or follow request of others.
   0   no search.
   1   search siblings (hyperthreads in a core).

"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.

Fixes: 1d3504fcf5 ("sched, cpuset: customize sched domains, core")
Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
2024-05-17 09:48:24 +02:00
Jesper Dangaard Brouer
21c38a3bd4 cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
This closely resembles helpers added for the global cgroup_rstat_lock in
commit fc29e04ae1 ("cgroup/rstat: add cgroup_rstat_lock helpers and
tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock.

Based on production workloads, we observe the fast-path "update" function
cgroup_rstat_updated() is invoked around 3 million times per sec, while the
"flush" function cgroup_rstat_flush_locked(), walking each possible CPU,
can see periodic spikes of 700 invocations/sec.

For this reason, the tracepoints are split into normal and fastpath
versions for this per-CPU lock. Making it feasible for production to
continuously monitor the non-fastpath tracepoint to detect lock contention
issues. The reason for monitoring is that lock disables IRQs which can
disturb e.g. softirq processing on the local CPUs involved. When the
global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex),
this per CPU lock becomes the next bottleneck that can introduce latency
variations.

A practical bpftrace script for monitoring contention latency:

 bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
     @start[tid]=nsecs; @cnt[probe]=count()}
   tracepoint:cgroup:cgroup_rstat_cpu_locked {
     if (args->contended) {
       @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}
     @cnt[probe]=count()}
   interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}'

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-14 09:43:17 -10:00
Xiu Jianfeng
b7d56d953a cgroup/cpuset: Remove outdated comment in sched_partition_write()
The comment here is outdated and can cause confusion, from the code
perspective, there’s also no need for new comment, so just remove it.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-25 07:16:19 -10:00