linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-23 12:14:10 +08:00

Author	SHA1	Message	Date
Chen Ridong	381b53c3b5	cgroup/cpuset: rename functions shared between v1 and v2 Some functions name declared in cpuset-internel.h are generic. To avoid confilicting with other variables for the same name, rename these functions with cpuset_/cpuset1_ prefix to make them unique to cpuset. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:16 -10:00
Chen Ridong	b0ced9d378	cgroup/cpuset: move v1 interfaces to cpuset-v1.c Move legacy cpuset controller interfaces files and corresponding code into cpuset-v1.c. 'update_flag', 'cpuset_write_resmask' and 'cpuset_common_seq_show' are also used for v1, so declare them in cpuset-internal.h. 'cpuset_write_s64', 'cpuset_read_s64' and 'fmeter_getrate' are only used cpuset-v1.c now, make it static. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:16 -10:00
Chen Ridong	be126b5b1b	cgroup/cpuset: move validate_change_legacy to cpuset-v1.c The validate_change_legacy functions is used for v1, move it to cpuset-v1.c. And two micro 'cpuset_for_each_child' and 'cpuset_for_each_descendant_pre' are common for v1 and v2, move them to cpuset-internal.h. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:16 -10:00
Chen Ridong	23ca5237e3	cgroup/cpuset: move legacy hotplug update to cpuset-v1.c There are some differents about hotplug update between cpuset v1 and cpuset v2. Move the legacy code to cpuset-v1.c. 'update_tasks_cpumask' and 'update_tasks_nodemask' are both used in cpuset v1 and cpuset v2, declare them in cpuset-internal.h. The change from original code is that use callback_lock helpers to get callback_lock lock/unlock. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:16 -10:00
Chen Ridong	530020f28f	cgroup/cpuset: add callback_lock helper To modify cpuset, both cpuset_mutex and callback_lock are needed. Add helpers for cpuset-v1 to get callback_lock. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:16 -10:00
Chen Ridong	90eec9548d	cgroup/cpuset: move memory_spread to cpuset-v1.c 'memory_spread' is only set in cpuset v1. move corresponding code into cpuset-v1.c. Currently, 'cpuset_update_task_spread_flags' and 'update_tasks_flags' are exposed to cpuset.c. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:16 -10:00
Chen Ridong	047b830974	cgroup/cpuset: move relax_domain_level to cpuset-v1.c Setting domain level is not supported at cpuset v2, so move corresponding code into cpuset-v1.c. The 'cpuset_write_s64' and 'cpuset_read_s64' are only used for setting domain level, move them to cpuset-v1.c. Currently, expose to cpuset.c. After cpuset legacy interface files are move to cpuset-v1.c, they can be static. The 'rebuild_sched_domains_locked' is exposed to cpuset-v1.c. The change from original code is that using 'cpuset_lock' and 'cpuset_unlock' functions to lock or unlock cpuset_mutex. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:15 -10:00
Chen Ridong	49434094ef	cgroup/cpuset: move memory_pressure to cpuset-v1.c Collection of memory_pressure can be enabled by writing 1 to the cpuset file 'memory_pressure_enabled', which is only for cpuset-v1. Therefore, move the corresponding code to cpuset-v1.c. Currently, the 'fmeter_init' and 'fmeter_getrate' functions are called at cpuset.c, so expose them to cpuset.c. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:15 -10:00
Chen Ridong	619a33efa0	cgroup/cpuset: move common code to cpuset-internal.h Move some declarations that will be used for cpuset v1 and v2, including 'cpuset struct', 'cpuset_flagbits_t', cpuset_filetype_t,etc. No logical change. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:15 -10:00
Chen Ridong	71e934a808	cgroup/cpuset: introduce cpuset-v1.c This patch introduces the cgroup/cpuset-v1.c source file which will be used for all legacy (cgroup v1) cpuset cgroup code. It also introduces cgroup/cpuset-internal.h to keep declarations shared between cgroup/cpuset.c and cpuset/cpuset-v1.c. As of now, let's compile it if CONFIG_CPUSET is set. Later on it can be switched to use a separate config option, so that the legacy code won't be compiled if not required. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 10:00:15 -10:00
Waiman Long	c188f33c86	cgroup/cpuset: Account for boot time isolated CPUs With the "isolcpus" boot command line parameter, we are able to create isolated CPUs at boot time. These isolated CPUs aren't fully accounted for in the cpuset code. For instance, the root cgroup's "cpuset.cpus.isolated" control file does not include the boot time isolated CPUs. Fix that by looking for pre-isolated CPUs at init time. The prstate_housekeeping_conflict() function does check the HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it can only be used in isolated partition. Given the fact that we are going to make housekeeping cpumasks dynamic, the current check may not be right anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-30 09:23:39 -10:00
Chen Ridong	3c2acae888	cgroup/cpuset: remove use_parent_ecpus of cpuset use_parent_ecpus is used to track whether the children are using the parent's effective_cpus. When a parent's effective_cpus is changed due to changes in a child partition's effective_xcpus, any child using parent'effective_cpus must call update_cpumasks_hier. However, if a child is not a valid partition, it is sufficient to determine whether to call update_cpumasks_hier based on whether the child's effective_cpus is going to change. To make the code more succinct, it is suggested to remove use_parent_ecpus. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 08:51:48 -10:00
Chen Ridong	9414f68d45	cgroup/cpuset: remove fetch_xcpus Both fetch_xcpus and user_xcpus functions are used to retrieve the value of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the implicit value used as exclusive in a local partition. I can not imagine a scenario where effective_xcpus is not empty when exclusive_cpus is empty. Therefore, I suggest removing the fetch_xcpus function. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 08:51:34 -10:00
Chen Ridong	e55f45b4ba	cgroup/cpuset: Correct invalid remote parition prs When enable a remote partition, I found that: cd /sys/fs/cgroup/ mkdir test mkdir test/test1 echo +cpuset > cgroup.subtree_control echo +cpuset > test/cgroup.subtree_control echo 3 > test/test1/cpuset.cpus echo root > test/test1/cpuset.cpus.partition cat test/test1/cpuset.cpus.partition root invalid (Parent is not a partition root) The parent of a remote partition could not be a root. This is due to the emtpy effective_xcpus. It would be better to prompt the message "invalid cpu list in cpuset.cpus.exclusive". Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 08:51:16 -10:00
Chen Ridong	d1a92d2d6c	cgroup: update some statememt about delegation The comment in cgroup_file_write is missing some interfaces, such as 'cgroup.threads'. All delegatable files are listed in '/sys/kernel/cgroup/delegate', so update the comment in cgroup_file_write. Besides, add a statement that files outside the namespace shouldn't be visible from inside the delegated namespace. tj: Reflowed text for consistency. Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-19 12:16:17 -10:00
Waiman Long	9b103943ab	cgroup: Fix incorrect WARN_ON_ONCE() in css_release_work_fn() It turns out that the WARN_ON_ONCE() call in css_release_work_fn introduced by commit `ab03125268` ("cgroup: Show # of subsystem CSSes in cgroup.stat") is incorrect. Although css->nr_descendants must be 0 when a css is released and ready to be freed, the corresponding cgrp->nr_dying_subsys[ss->id] may not be 0 if a subsystem is activated and deactivated multiple times with one or more of its previous activation leaving behind dying csses. Fix the incorrect warning by removing the cgrp->nr_dying_subsys check. Fixes: `ab03125268` ("cgroup: Show # of subsystem CSSes in cgroup.stat") Closes: https://lore.kernel.org/cgroups/6f301773-2fce-4602-a391-8af7ef00b2fb@redhat.com/T/#t Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-09 06:35:29 -10:00
Waiman Long	99570300d3	cgroup/cpuset: Check for partition roots with overlapping CPUs With the previous commit that eliminates the overlapping partition root corner cases in the hotplug code, the partition roots passed down to generate_sched_domains() should not have overlapping CPUs. Enable overlapping cpuset check for v2 and warn if that happens. This patch also has the benefit of increasing test coverage of the new Union-Find cpuset merging code to cgroup v2. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-05 10:57:38 -10:00
Tejun Heo	bc3c27516f	Merge branch 'cgroup/for-6.11-fixes' into cgroup/for-6.12 cgroup/for-6.12 is about to receive updates that are dependent on changes from both for-6.11-fixes and for-6.12. Pull in for-6.11-fixes. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-05 10:56:49 -10:00
Waiman Long	ff0ce721ec	cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug It was found that some hotplug operations may cause multiple rebuild_sched_domains_locked() calls. Some of those intermediate calls may use cpuset states not in the final correct form leading to incorrect sched domain setting. Fix this problem by using the existing force_rebuild flag to inhibit immediate rebuild_sched_domains_locked() calls if set and only doing one final call at the end. Also renaming the force_rebuild flag to force_sd_rebuild to make its meaning for clear. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-05 10:54:25 -10:00
Waiman Long	311a1bdc44	cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set Commit `e2ffe502ba` ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") adds a user writable cpuset.cpus.exclusive file for setting exclusive CPUs to be used for the creation of partitions. Since then effective_xcpus depends on both the cpuset.cpus and cpuset.cpus.exclusive setting. If cpuset.cpus.exclusive is set, effective_xcpus will depend only on cpuset.cpus.exclusive. When it is not set, effective_xcpus will be set according to the cpuset.cpus value when the cpuset becomes a valid partition root. When cpuset.cpus is being cleared by the user, effective_xcpus should only be cleared when cpuset.cpus.exclusive is not set. However, that is not currently the case. # cd /sys/fs/cgroup/ # mkdir test # echo +cpuset > cgroup.subtree_control # cd test # echo 3 > cpuset.cpus.exclusive # cat cpuset.cpus.exclusive.effective 3 # echo > cpuset.cpus # cat cpuset.cpus.exclusive.effective // was cleared Fix it by clearing effective_xcpus only if cpuset.cpus.exclusive is not set. Fixes: `e2ffe502ba` ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Cc: stable@vger.kernel.org # v6.7+ Reported-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-05 10:52:40 -10:00
Chen Ridong	959ab6350a	cgroup/cpuset: fix panic caused by partcmd_update We find a bug as below: BUG: unable to handle page fault for address: 00000003 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 3 PID: 358 Comm: bash Tainted: G W I 6.6.0-10893-g60d6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/4 RIP: 0010:partition_sched_domains_locked+0x483/0x600 Code: 01 48 85 d2 74 0d 48 83 05 29 3f f8 03 01 f3 48 0f bc c2 89 c0 48 9 RSP: 0018:ffffc90000fdbc58 EFLAGS: 00000202 RAX: 0000000100000003 RBX: ffff888100b3dfa0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000002fe80 RBP: ffff888100b3dfb0 R08: 0000000000000001 R09: 0000000000000000 R10: ffffc90000fdbcb0 R11: 0000000000000004 R12: 0000000000000002 R13: ffff888100a92b48 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f44a5425740(0000) GS:ffff888237d80000(0000) knlGS:0000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000100030973 CR3: 000000010722c000 CR4: 00000000000006e0 Call Trace: <TASK> ? show_regs+0x8c/0xa0 ? __die_body+0x23/0xa0 ? __die+0x3a/0x50 ? page_fault_oops+0x1d2/0x5c0 ? partition_sched_domains_locked+0x483/0x600 ? search_module_extables+0x2a/0xb0 ? search_exception_tables+0x67/0x90 ? kernelmode_fixup_or_oops+0x144/0x1b0 ? __bad_area_nosemaphore+0x211/0x360 ? up_read+0x3b/0x50 ? bad_area_nosemaphore+0x1a/0x30 ? exc_page_fault+0x890/0xd90 ? __lock_acquire.constprop.0+0x24f/0x8d0 ? __lock_acquire.constprop.0+0x24f/0x8d0 ? asm_exc_page_fault+0x26/0x30 ? partition_sched_domains_locked+0x483/0x600 ? partition_sched_domains_locked+0xf0/0x600 rebuild_sched_domains_locked+0x806/0xdc0 update_partition_sd_lb+0x118/0x130 cpuset_write_resmask+0xffc/0x1420 cgroup_file_write+0xb2/0x290 kernfs_fop_write_iter+0x194/0x290 new_sync_write+0xeb/0x160 vfs_write+0x16f/0x1d0 ksys_write+0x81/0x180 __x64_sys_write+0x21/0x30 x64_sys_call+0x2f25/0x4630 do_syscall_64+0x44/0xb0 entry_SYSCALL_64_after_hwframe+0x78/0xe2 RIP: 0033:0x7f44a553c887 It can be reproduced with cammands: cd /sys/fs/cgroup/ mkdir test cd test/ echo +cpuset > ../cgroup.subtree_control echo root > cpuset.cpus.partition cat /sys/fs/cgroup/cpuset.cpus.effective 0-3 echo 0-3 > cpuset.cpus // taking away all cpus from root This issue is caused by the incorrect rebuilding of scheduling domains. In this scenario, test/cpuset.cpus.partition should be an invalid root and should not trigger the rebuilding of scheduling domains. When calling update_parent_effective_cpumask with partcmd_update, if newmask is not null, it should recheck newmask whether there are cpus is available for parect/cs that has tasks. Fixes: `0c7f293efc` ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-05 10:52:40 -10:00
Xiu Jianfeng	4980f71202	cgroup/pids: Remove unreachable paths of pids_{can,cancel}_fork According to the implementation of cgroup_css_set_fork(), it will fail if cset cannot be found and the can_fork/cancel_fork methods will not be called in this case, which means that the argument 'cset' for these methods must not be NULL, so remove the unrechable paths in them. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-05 10:32:16 -10:00
Waiman Long	ab03125268	cgroup: Show # of subsystem CSSes in cgroup.stat Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to help manage different structures in various cgroup subsystems by being an embedded element inside a larger structure like cpuset or mem_cgroup. The /proc/cgroups file shows the number of cgroups for each of the subsystems. With cgroup v1, the number of CSSes is the same as the number of cgroups. That is not the case anymore with cgroup v2. The /proc/cgroups file cannot show the actual number of CSSes for the subsystems that are bound to cgroup v2. So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup), we can't tell by looking at /proc/cgroups which cgroup subsystems may be responsible. As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical cgroup.stat file is now being extended to show the number of live and dying CSSes associated with all the non-inhibited cgroup subsystems that have been bound to cgroup v2. The number includes CSSes in the current cgroup as well as in all the descendants underneath it. This will help us pinpoint which subsystems are responsible for the increasing number of dying (nr_dying_descendants) cgroups. The CSSes dying counts are stored in the cgroup structure itself instead of inside the CSS as suggested by Johannes. This will allow us to accurately track dying counts of cgroup subsystems that have recently been disabled in a cgroup. It is now possible that a zero subsystem number is coupled with a non-zero dying subsystem number. The cgroup-v2.rst file is updated to discuss this new behavior. With this patch applied, a sample output from root cgroup.stat file was shown below. nr_descendants 56 nr_subsys_cpuset 1 nr_subsys_cpu 43 nr_subsys_io 43 nr_subsys_memory 56 nr_subsys_perf_event 57 nr_subsys_hugetlb 1 nr_subsys_pids 56 nr_subsys_rdma 1 nr_subsys_misc 1 nr_dying_descendants 30 nr_dying_subsys_cpuset 0 nr_dying_subsys_cpu 0 nr_dying_subsys_io 0 nr_dying_subsys_memory 30 nr_dying_subsys_perf_event 0 nr_dying_subsys_hugetlb 0 nr_dying_subsys_pids 0 nr_dying_subsys_rdma 0 nr_dying_subsys_misc 0 Another sample output from system.slice/cgroup.stat was: nr_descendants 34 nr_subsys_cpuset 0 nr_subsys_cpu 32 nr_subsys_io 32 nr_subsys_memory 34 nr_subsys_perf_event 35 nr_subsys_hugetlb 0 nr_subsys_pids 34 nr_subsys_rdma 0 nr_subsys_misc 0 nr_dying_descendants 30 nr_dying_subsys_cpuset 0 nr_dying_subsys_cpu 0 nr_dying_subsys_io 0 nr_dying_subsys_memory 30 nr_dying_subsys_perf_event 0 nr_dying_subsys_hugetlb 0 nr_dying_subsys_pids 0 nr_dying_subsys_rdma 0 nr_dying_subsys_misc 0 Note that 'debug' controller wasn't used to provide this information because the controller is not recommended in productions kernels, also many of them won't enable CONFIG_CGROUP_DEBUG by default. Similar information could be retrieved with debuggers like drgn but that's also not always available (e.g. lockdown) and the additional cost of runtime tracking here is deemed marginal. tj: Added Michal's paragraphs on why this is not added the debug controller to the commit message. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20240715150034.2583772-1-longman@redhat.com Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-31 07:00:02 -10:00
Xavier	8a895c2e6a	cpuset: use Union-Find to optimize the merging of cpumasks The process of constructing scheduling domains involves multiple loops and repeated evaluations, leading to numerous redundant and ineffective assessments that impact code efficiency. Here, we use union-find to optimize the merging of cpumasks. By employing path compression and union by rank, we effectively reduce the number of lookups and merge comparisons. Signed-off-by: Xavier <xavier_qy@163.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-30 13:04:50 -10:00
Chen Ridong	4a711dd910	cgroup/cpuset: add decrease attach_in_progress helpers There are several functions to decrease attach_in_progress, and they will wake up cpuset_attach_wq when attach_in_progress is zero. So, add a helper to make it concise. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-30 12:38:14 -10:00
Xiu Jianfeng	c149c4a48b	cgroup/cpuset: Remove cpuset_slab_spread_rotor Since the SLAB implementation was removed in v6.8, so the cpuset_slab_spread_rotor is no longer used and can be removed. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-30 12:29:08 -10:00
Xiu Jianfeng	d72a00a848	cgroup/pids: Avoid spurious event notification Currently when a process in a group forks and fails due to it's parent's max restriction, all the cgroups from 'pids_forking' to root will generate event notifications but only the cgroups from 'pids_over_limit' to root will increase the counter of PIDCG_MAX. Consider this scenario: there are 4 groups A, B, C,and D, the relationships are as follows, and user is watching on C.pids.events. root->A->B->C->D When a process in D forks and fails due to B.max restriction, the user will get a spurious event notification because when he wakes up and reads C.pids.events, he will find that the content has not changed. To address this issue, only the cgroups from 'pids_over_limit' to root will have their PIDCG_MAX counters increased and event notifications generated. Fixes: `385a635cac` ("cgroup/pids: Make event counters hierarchical") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-30 12:13:19 -10:00
Chen Ridong	d632604757	cgroup/cpuset: remove child_ecpus_count The child_ecpus_count variable was previously used to update sibling cpumask when parent's effective_cpus is updated. However, it became obsolete after commit `e2ffe502ba` ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2"). It should be removed. tj: Restored {} for style consistency. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-30 09:45:26 -10:00
Tejun Heo	9283ff5be1	Merge branch 'for-6.10-fixes' into for-6.11	2024-07-14 18:04:03 -10:00
Xiu Jianfeng	6a26f9c689	cgroup/misc: Introduce misc.events.local Currently the event counting provided by misc.events is hierarchical, it's not practical if user is only concerned with events of a specified cgroup. Therefore, introduce misc.events.local collect events specific to the given cgroup. This is analogous to memory.events.local and pids.events.local. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-12 06:45:23 -10:00
Chen Ridong	b824766504	cgroup/rstat: add force idle show helper In the function cgroup_base_stat_cputime_show, there are five instances of #ifdef, which makes the code not concise. To address this, add the function cgroup_force_idle_show to make the code more succinct. Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-05 08:32:08 -10:00
Waiman Long	57b56d1680	cgroup: Protect css->cgroup write under css_set_lock The writing of css->cgroup associated with the cgroup root in rebind_subsystems() is currently protected only by cgroup_mutex. However, the reading of css->cgroup in both proc_cpuset_show() and proc_cgroup_show() is protected just by css_set_lock. That makes the readers susceptible to racing problems like data tearing or caching. It is also a problem that can be reported by KCSAN. This can be fixed by using READ_ONCE() and WRITE_ONCE() to access css->cgroup. Alternatively, the writing of css->cgroup can be moved under css_set_lock as well which is done by this patch. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-03 08:59:06 -10:00
Xiu Jianfeng	1028f391d5	cgroup/misc: Introduce misc.peak Introduce misc.peak to record the historical maximum usage of the resource, as in some scenarios the value of misc.max could be adjusted based on the peak usage of the resource. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-03 08:08:43 -10:00
Chen Ridong	1be59c97c8	cgroup/cpuset: Prevent UAF in proc_cpuset_show() An UAF can happen when /proc/cpuset is read as reported in [1]. This can be reproduced by the following methods: 1.add an mdelay(1000) before acquiring the cgroup_lock In the cgroup_path_ns function. 2.$cat /proc/<pid>/cpuset repeatly. 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/ $umount /sys/fs/cgroup/cpuset/ repeatly. The race that cause this bug can be shown as below: (umount) \| (cat /proc/<pid>/cpuset) css_release \| proc_cpuset_show css_release_work_fn \| css = task_get_css(tsk, cpuset_cgrp_id); css_free_rwork_fn \| cgroup_path_ns(css->cgroup, ...); cgroup_destroy_root \| mutex_lock(&cgroup_mutex); rebind_subsystems \| cgroup_free_root \| \| // cgrp was freed, UAF \| cgroup_path_ns_locked(cgrp,..); When the cpuset is initialized, the root node top_cpuset.css.cgrp will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated &cgroup_root.cgrp. When the umount operation is executed, top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp. The problem is that when rebinding to cgrp_dfl_root, there are cases where the cgroup_root allocated by setting up the root for cgroup v1 is cached. This could lead to a Use-After-Free (UAF) if it is subsequently freed. The descendant cgroups of cgroup v1 can only be freed after the css is released. However, the css of the root will never be released, yet the cgroup_root should be freed when it is unmounted. This means that obtaining a reference to the css of the root does not guarantee that css.cgrp->root will not be freed. Fix this problem by using rcu_read_lock in proc_cpuset_show(). As cgroup_root is kfree_rcu after commit `d23b5c5777` ("cgroup: Make operations on the cgroup root_list RCU safe"), css->cgroup won't be freed during the critical section. To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to replace task_get_css with task_css. [1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd Fixes: `a79a908fd2` ("cgroup: introduce cgroup namespaces") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-28 07:10:31 -10:00
Waiman Long	737bb142a0	cgroup/cpuset: Make cpuset.cpus.exclusive independent of cpuset.cpus The "cpuset.cpus.exclusive.effective" value is currently limited to a subset of its "cpuset.cpus". This makes the exclusive CPUs distribution hierarchy subsumed within the larger "cpuset.cpus" hierarchy. We have to decide on what CPUs are used locally and what CPUs can be passed down as exclusive CPUs down the hierarchy and combine them into "cpuset.cpus". The advantage of the current scheme is to have only one hierarchy to worry about. However, it make it harder to use as all the "cpuset.cpus" values have to be properly set along the way down to the designated remote partition root. It also makes it more cumbersome to find out what CPUs can be used locally. Make creation of remote partition simpler by breaking the dependency of "cpuset.cpus.exclusive" on "cpuset.cpus" and make them independent entities. Now we have two separate hierarchies - one for setting "cpuset.cpus.effective" and the other one for setting "cpuset.cpus.exclusive.effective". We may not need to set "cpuset.cpus" when we activate a partition root anymore. Also update Documentation/admin-guide/cgroup-v2.rst and cpuset.c comment to document this change. Suggested-by: Petr Malat <oss@malat.biz> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-19 07:37:38 -10:00
Waiman Long	fe8cd2736e	cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition The CS_CPU_EXCLUSIVE flag is currently set whenever cpuset.cpus.exclusive is set to make sure that the exclusivity test will be run to ensure its exclusiveness. At the same time, this flag can be changed whenever the partition root state is changed. For example, the CS_CPU_EXCLUSIVE flag will be reset whenever a partition root becomes invalid. This makes using CS_CPU_EXCLUSIVE to ensure exclusiveness a bit fragile. The current scheme also makes setting up a cpuset.cpus.exclusive hierarchy to enable remote partition harder as cpuset.cpus.exclusive cannot overlap with any cpuset.cpus of sibling cpusets if their cpuset.cpus.exclusive aren't set. Solve these issues by deferring the setting of CS_CPU_EXCLUSIVE flag until the cpuset become a valid partition root while adding new checks in validate_change() to ensure that cpuset.cpus.exclusive of sibling cpusets cannot overlap. An additional check is also added to validate_change() to make sure that cpuset.cpus of one cpuset cannot be a subset of cpuset.cpus.exclusive of a sibling cpuset to avoid the problem that none of those CPUs will be available when these exclusive CPUs are extracted out to a newly enabled partition root. The Documentation/admin-guide/cgroup-v2.rst file is updated to document the new constraints. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-19 07:37:37 -10:00
Waiman Long	ccac8e8de9	cgroup/cpuset: Fix remote root partition creation problem Since commit `181c8e091a` ("cgroup/cpuset: Introduce remote partition"), a remote partition can be created underneath a non-partition root cpuset as long as its exclusive_cpus are set to distribute exclusive CPUs down to its children. The generate_sched_domains() function, however, doesn't take into account this new behavior and hence will fail to create the sched domain needed for a remote root (non-isolated) partition. There are two issues related to remote partition support. First of all, generate_sched_domains() has a fast path that is activated if root_load_balance is true and top_cpuset.nr_subparts is non-zero. The later condition isn't quite correct for remote partitions as nr_subparts just shows the number of local child partitions underneath it. There can be no local child partition under top_cpuset even if there are remote partitions further down the hierarchy. Fix that by checking for subpartitions_cpus which contains exclusive CPUs allocated to both local and remote partitions. Secondly, the valid partition check for subtree skipping in the csa[] generation loop isn't enough as remote partition does not need to have a partition root parent. Fix this problem by breaking csa[] array generation loop of generate_sched_domains() into v1 and v2 specific parts and checking a cpuset's exclusive_cpus before skipping its subtree in the v2 case. Also simplify generate_sched_domains() for cgroup v2 as only non-isolating partition roots should be included in building the cpuset array and none of the v1 scheduling attributes other than a different way to create an isolated partition are supported. Fixes: `181c8e091a` ("cgroup/cpuset: Introduce remote partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-19 07:37:37 -10:00
Oleg Nesterov	6fe960147e	cgroup: avoid the unnecessary list_add(dying_tasks) in cgroup_exit() cgroup_exit() needs to do this only if the exiting task is a leader and it is not the last live thread. The patch doesn't use delay_group_leader(), atomic_read(signal->live) matches the code css_task_iter_advance() more. cgroup_release() can now check list_empty(task->cg_list) before it takes css_set_lock and calls ss_set_skip_task_iters(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-19 07:18:41 -10:00
Waiman Long	1805c1729f	cgroup/cpuset: Optimize isolated partition only generate_sched_domains() calls If only isolated partitions are being created underneath the cgroup root, there will only be one sched domain with top_cpuset.effective_cpus. We can skip the unnecessary sched domains scanning code and save some cycles. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-03 09:47:11 -10:00
Xiu Jianfeng	018ee567de	cgroup/cpuset: Reduce the lock protecting CS_SCHED_LOAD_BALANCE In the cpuset_css_online(), clearing the CS_SCHED_LOAD_BALANCE bit of cs->flags is guarded by callback_lock and cpuset_mutex. There is no problem with itself, because it is consistent with the description of there two global lock at the beginning of this file. However, since the operation of checking, setting and clearing the flag bit is atomic, protection of callback_lock is unnecessary here, see CS_SPREAD_*. so to make it more consistent with the other code, move the operation outside the critical section of callback_lock. No functional changes intended. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-06-01 07:06:17 -10:00
David Wang	a8d55ff5f3	kernel/cgroup: cleanup cgroup_base_files when fail to add cgroup_psi_files Even though css_clear_dir would be called to cleanup all existing cgroup files when css_populate_dir failed, reclaiming newly created cgroup files before css_populate_dir returns with failure makes code more consistent. Signed-off-by: David Wang <00107082@163.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-05-26 08:48:11 -10:00
Michal Koutný	3f26a885a0	cgroup/pids: Add pids.events.local Hierarchical counting of events is not practical for watching when a particular pids.max is being hit. Therefore introduce .local flavor of events file (akin to memory controller) that collects only events relevant to given cgroup. The file is only added to the default hierarchy. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-05-26 08:45:10 -10:00
Michal Koutný	385a635cac	cgroup/pids: Make event counters hierarchical The pids.events file should honor the hierarchy, so make the events propagate from their origin up to the root on the unified hierarchy. The legacy behavior remains non-hierarchical. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-05-26 08:45:09 -10:00
Michal Koutný	73e75e6fc3	cgroup/pids: Separate semantics of pids.events related to pids.max Currently, when pids.max limit is breached in the hierarchy, the event is counted and reported in the cgroup where the forking task resides. This decouples the limit and the notification caused by the limit making it hard to detect when the actual limit was effected. Redefine the pids.events:max as: the number of times the limit of the cgroup was hit. (Implementation differentiates also "forkfail" event but this is currently not exposed as it would better fit into pids.stat. It also differs from pids.events:max only when pids.max is configured on non-leaf cgroups.) Since it changes semantics of the original "max" event, introduce this change only in the v2 API of the controller and add a cgroup2 mount option to revert to the legacy behavior. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-05-26 08:45:09 -10:00
Xiu Jianfeng	0ac380020c	cgroup/cpuset: Update comment on callback_lock Since commit `51ffe41178` ("cpuset: convert away from cftype->read()"), cpuset_common_file_read() has been renamed. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-05-26 08:30:38 -10:00
Xiu Jianfeng	d9fc6b4220	cgroup/cpuset: Remove unnecessary zeroing The struct cpuset is kzalloc'd, all the members are zeroed already, so don't need nodes_clear() here. No functional changes intended. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-05-26 08:28:30 -10:00
Linus Torvalds	8dde191aab	Misc fixes: - Fix a sched_balance_newidle setting bug - Fix bug in the setting of /sys/fs/cgroup/test/cpu.max.burst - Fix variable-shadowing build warning - Extend sched-domains debug output - Fix documentation - Fix comments Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZIbj4RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hEng/+NlAh7mm4AWckVjUxqyUnJ/omaV9Fe5F+ koiihntyvhk+4RR40XomXPq37Av3zPo1dnKI4fJ3yioMs1tB+8JD+nVo3DURLGT/ 4k+lYI+K6RXBzUTpzeYZWVfa+ddGwbRu1KA5joI7QvRfjil7QP5rC5AQbAj0AiVO Xvor0M9vEcfkqShTttx4h2u7WVR4zqVEhBxkWNMT6dMxN2HnKm4qcAiX39E8p+Vx maC2/iO+1rXORRbUh+KBHR40WAwe2CVvh5hCe1sl+/vGfCbAnMK1k+j85UdV1pFD aZ1jSBwIERnx9PdD5zK0GCRx9hmux8mkJCeBseZyK/XubYuVOLiwBxfYA/9C3i3O 1mQizaFBD8zanEiWj10sOxbfry+XhLwcISIiWC+xLpxKb0MvDD1TIeZR1fJv3Oz7 14iYhq2CuKhfntYmV6fYTzSzXL2s16dMYMH/7m7cLY0P/cJo2vw7GNxkwPeJsOVN uX6jnRde2Kp3q+Er3I2u1SGeAZ8fEzXr19MCWRA0qI+wvgYQkaTgoh9zO9AwRNoa 9hS/jc6Gq+O5xBMMJIPZMfOVai9RhYlPmQavFCGJLd3EFoVi9jp9+/iXgtyARCZp rfXFV9Dd9GvpFRzNnsMrLiKswBzUop5+epHYKZhVHJKH7aiHMbGEFD6cgNlf8k9b GFda3ay4JHA= =2okO -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix a sched_balance_newidle setting bug - Fix bug in the setting of /sys/fs/cgroup/test/cpu.max.burst - Fix variable-shadowing build warning - Extend sched-domains debug output - Fix documentation - Fix comments * tag 'sched-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write() sched/fair: Remove stale FREQUENCY_UTIL comment sched/fair: Fix initial util_avg calculation docs: cgroup-v1: Clarify that domain levels are system-specific sched/debug: Dump domains' level sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level arch/topology: Fix variable naming to avoid shadowing	2024-05-19 11:38:15 -07:00
Vitalii Bursov	a1fd0b9d75	sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level Change relax_domain_level checks so that it would be possible to include or exclude all domains from newidle balancing. This matches the behavior described in the documentation: -1 no request. use system default or follow request of others. 0 no search. 1 search siblings (hyperthreads in a core). "2" enables levels 0 and 1, level_max excludes the last (level_max) level, and level_max+1 includes all levels. Fixes: `1d3504fcf5` ("sched, cpuset: customize sched domains, core") Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com	2024-05-17 09:48:24 +02:00
Jesper Dangaard Brouer	21c38a3bd4	cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints This closely resembles helpers added for the global cgroup_rstat_lock in commit `fc29e04ae1` ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock. Based on production workloads, we observe the fast-path "update" function cgroup_rstat_updated() is invoked around 3 million times per sec, while the "flush" function cgroup_rstat_flush_locked(), walking each possible CPU, can see periodic spikes of 700 invocations/sec. For this reason, the tracepoints are split into normal and fastpath versions for this per-CPU lock. Making it feasible for production to continuously monitor the non-fastpath tracepoint to detect lock contention issues. The reason for monitoring is that lock disables IRQs which can disturb e.g. softirq processing on the local CPUs involved. When the global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex), this per CPU lock becomes the next bottleneck that can introduce latency variations. A practical bpftrace script for monitoring contention latency: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { @start[tid]=nsecs; @cnt[probe]=count()} tracepoint:cgroup:cgroup_rstat_cpu_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} @cnt[probe]=count()} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}' Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-05-14 09:43:17 -10:00
Xiu Jianfeng	b7d56d953a	cgroup/cpuset: Remove outdated comment in sched_partition_write() The comment here is outdated and can cause confusion, from the code perspective, there’s also no need for new comment, so just remove it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-04-25 07:16:19 -10:00

1 2 3 4 5 ...

728 Commits