linux/kernel
David Finkel c6f53ed8f2 mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.


This patch (of 2):

Other mechanisms for querying the peak memory usage of either a process or
v1 memory cgroup allow for resetting the high watermark.  Restore parity
with those mechanisms, but with a less racy API.

For example:
 - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
   the high watermark.
 - writing "5" to the clear_refs pseudo-file in a processes's proc
   directory resets the peak RSS.

This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.

Writing any non-empty string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for subsequent
reads through that same FD.

Notably, following Johannes's suggestion, this implementation moves the
O(FDs that have written) behavior onto the FD write(2) path.  Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.

Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.

This behavior is particularly useful for work scheduling systems that need
to track memory usage of worker processes/cgroups per-work-item.  Since
memory can't be squeezed like CPU can (the OOM-killer has opinions), these
systems need to track the peak memory usage to compute system/container
fullness when binpacking workitems.

Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place.  To facilitate this, we track
the peak memory usage.  However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while a
work-item is executing.  Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.

As this data is used to ensure that binpacked work ends up with sufficient
headroom, this use-case mostly avoids the inaccuracies surrounding
reclaimable memory.

Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com
Signed-off-by: David Finkel <davidf@vimeo.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:53 -07:00
..
bpf bpf: Fix a kernel verifier crash in stacksafe() 2024-08-12 18:09:48 -07:00
cgroup mm, memcg: cg2 memory{.swap,}.peak write handlers 2024-09-01 20:25:53 -07:00
configs mm/slab: Plumb kmem_buckets into __do_kmalloc_node() 2024-07-03 12:24:19 +02:00
debug kdb: Get rid of redundant kdb_curr_task() 2024-06-21 15:49:29 +01:00
dma dma-debug: avoid deadlock between dma debug vs printk and netconsole 2024-08-06 10:29:32 -07:00
entry entry: Respect changes to system call number by trace_sys_enter() 2024-03-12 13:23:32 +01:00
events perf/bpf: Don't call bpf_overflow_handler() for tracing events 2024-08-13 10:25:28 -07:00
futex printk: Change type of CONFIG_BASE_SMALL to bool 2024-05-06 17:39:09 +02:00
gcov gcov: add support for GCC 14 2024-06-15 10:43:06 -07:00
irq genirq/irqdesc: Honor caller provided affinity in alloc_desc() 2024-08-07 17:27:00 +02:00
kcsan kcsan: Add missing MODULE_DESCRIPTION() macro 2024-06-06 11:21:14 -07:00
livepatch livepatch: Replace snprintf() with sysfs_emit() 2024-07-02 16:56:18 +02:00
locking bcachefs fixes for 6.11-rc3 2024-08-08 13:27:31 -07:00
module module: make waiting for a concurrent module loader interruptible 2024-08-09 08:33:28 -07:00
power mm: remove the implementation of swap_free() and always use swap_free_nr() 2024-07-03 19:30:01 -07:00
printk printk/panic: Allow cpu backtraces to be written into ringbuffer during panic 2024-08-13 14:16:22 +02:00
rcu Merge branches 'doc.2024.06.06a', 'fixes.2024.07.04a', 'mb.2024.06.28a', 'nocb.2024.06.03a', 'rcu-tasks.2024.06.06a', 'rcutorture.2024.06.06a' and 'srcu.2024.06.18a' into HEAD 2024-07-04 13:54:17 -07:00
sched task_stack: uninline stack_not_used 2024-09-01 20:25:49 -07:00
time timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex() 2024-08-05 16:14:14 +02:00
trace tracing: Return from tracing_buffers_read() if the file has been closed 2024-08-09 12:59:35 -04:00
.gitignore
acct.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
async.c async: Use a dedicated unbound workqueue with raised min_active 2024-02-09 11:13:59 -10:00
audit_fsnotify.c
audit_tree.c fsnotify: create a wrapper fsnotify_find_inode_mark() 2024-04-04 16:24:16 +02:00
audit_watch.c fsnotify: create a wrapper fsnotify_find_inode_mark() 2024-04-04 16:24:16 +02:00
audit.c audit: use KMEM_CACHE() instead of kmem_cache_create() 2024-01-25 10:12:22 -05:00
audit.h
auditfilter.c ima: Avoid blocking in RCU read-side critical section 2024-06-13 14:26:50 -04:00
auditsc.c audit,io_uring: io_uring openat triggers audit reference count underflow 2023-10-13 18:34:46 +02:00
backtracetest.c backtracetest: add MODULE_DESCRIPTION() 2024-06-24 22:24:55 -07:00
bounds.c bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS 2024-04-29 08:29:29 -07:00
capability.c lsm: constify the 'target' parameter in security_capget() 2023-08-08 16:48:47 -04:00
cfi.c
compat.c
configs.c
context_tracking.c context_tracking: Make context_tracking_key __ro_after_init 2024-03-22 11:18:18 +01:00
cpu_pm.c
cpu.c cpu/SMT: Enable SMT only if a core is online 2024-08-13 10:31:24 +10:00
crash_core.c Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
crash_reserve.c crash: fix riscv64 crash memory reserve dead loop 2024-08-15 22:16:16 -07:00
cred.c cred: Use KMEM_CACHE() instead of kmem_cache_create() 2024-02-23 17:33:31 -05:00
delayacct.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
dma.c
elfcorehdr.c crash: remove dependency of FA_DUMP on CRASH_DUMP 2024-02-23 17:48:22 -08:00
exec_domain.c
exit.c task_stack: uninline stack_not_used 2024-09-01 20:25:49 -07:00
exit.h exit: add internal include file with helpers 2023-09-21 12:03:50 -06:00
extable.c
fail_function.c
fork.c mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig options 2024-09-01 20:25:51 -07:00
freezer.c Linux 6.7-rc6 2023-12-23 15:52:13 +01:00
gen_kheaders.sh kheaders: use command -v to test for existence of cpio 2024-05-30 01:13:20 +09:00
groups.c groups: Convert group_info.usage to refcount_t 2023-09-29 11:28:39 -07:00
hung_task.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
iomem.c kernel/iomem.c: remove __weak ioremap_cache helper 2023-08-21 13:37:28 -07:00
irq_work.c
jump_label.c jump_label: Fix the fix, brown paper bags galore 2024-07-31 12:57:39 +02:00
kallsyms_internal.h kallsyms: get rid of code for absolute kallsyms 2024-07-20 16:33:21 +09:00
kallsyms_selftest.c kallsyms: Match symbols exactly with CONFIG_LTO_CLANG 2024-08-15 09:33:35 -07:00
kallsyms_selftest.h
kallsyms.c kallsyms: Match symbols exactly with CONFIG_LTO_CLANG 2024-08-15 09:33:35 -07:00
kcmp.c file: convert to SLAB_TYPESAFE_BY_RCU 2023-10-19 11:02:48 +02:00
Kconfig.freezer
Kconfig.hz
Kconfig.kexec crash: clean up kdump related config items 2024-02-23 17:48:22 -08:00
Kconfig.locks
Kconfig.preempt
kcov.c kcov: properly check for softirq context 2024-08-07 18:33:56 -07:00
kexec_core.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
kexec_elf.c
kexec_file.c crash: add a new kexec flag for hotplug support 2024-04-23 14:59:01 +10:00
kexec_internal.h crash: remove dependency of FA_DUMP on CRASH_DUMP 2024-02-23 17:48:22 -08:00
kexec.c crash: add a new kexec flag for hotplug support 2024-04-23 14:59:01 +10:00
kheaders.c
kprobes.c kprobes: Fix to check symbol prefixes correctly 2024-08-05 14:04:03 +09:00
ksyms_common.c
ksysfs.c profiling: remove prof_cpu_mask 2024-07-29 10:45:54 -07:00
kthread.c kunit: Handle test faults 2024-05-06 14:22:02 -06:00
latencytop.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
Makefile crash: split crash dumping code out from kexec_core.c 2024-02-23 17:48:22 -08:00
module_signature.c
notifier.c
nsproxy.c pidfd: add pidfs 2024-03-01 12:23:37 +01:00
numa.c kernel/numa.c: Move logging out of numa.h 2023-12-20 19:26:30 -05:00
padata.c padata: Fix possible divide-by-0 panic in padata_mt_helper() 2024-08-07 18:33:56 -07:00
panic.c printk/panic: Allow cpu backtraces to be written into ringbuffer during panic 2024-08-13 14:16:22 +02:00
params.c params: Fix multi-line comment style 2023-12-01 09:51:44 -08:00
pid_namespace.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
pid_sysctl.h sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
pid.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
profile.c profiling: remove profile=sleep support 2024-08-04 13:36:28 -07:00
ptrace.c ptrace_attach: shift send(SIGSTOP) into ptrace_set_stopped() 2024-02-22 15:38:52 -08:00
range.c
reboot.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
regset.c regset: use kvzalloc() for regset_get_alloc() 2024-04-25 21:07:03 -07:00
relay.c kernel: relay: remove relay_file_splice_read dead code, doesn't work 2023-12-29 12:22:27 -08:00
resource_kunit.c resource: add missing MODULE_DESCRIPTION() 2024-06-28 19:36:30 -07:00
resource.c mm: kvmalloc: align kvrealloc() with krealloc() 2024-09-01 20:25:44 -07:00
rseq.c
scftorture.c scftorture: Make torture_type static 2024-05-30 15:31:51 -07:00
scs.c
seccomp.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
signal.c kernel: rerun task_work while freezing in get_signal() 2024-07-11 01:51:44 -06:00
smp.c smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu() 2024-07-10 22:40:39 +02:00
smpboot.c kthread: add kthread_stop_put 2023-10-04 10:41:57 -07:00
smpboot.h
softirq.c softirq: Fix suspicious RCU usage in __do_softirq() 2024-04-29 05:03:51 +02:00
stackleak.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
stacktrace.c stacktrace: fix kernel-doc typo 2023-12-29 12:22:29 -08:00
static_call_inline.c
static_call.c
stop_machine.c
sys_ni.c Probes updates for v6.11: 2024-07-18 12:19:20 -07:00
sys.c RISC-V Patches for the 6.10 Merge Window, Part 1 2024-05-22 09:56:00 -07:00
sysctl-test.c sysctl: Add module description to sysctl-testing 2024-06-03 15:20:37 +02:00
sysctl.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
task_work.c task_work: make TWA_NMI_CURRENT handling conditional on IRQ_WORK 2024-07-29 12:05:06 -07:00
taskstats.c taskstats: fill_stats_for_tgid: use for_each_thread() 2023-10-04 10:41:57 -07:00
torture.c torture: Add MODULE_DESCRIPTION() 2024-05-30 15:31:38 -07:00
tracepoint.c
tsacct.c tsacct: replace strncpy() with strscpy() 2024-07-12 16:39:53 -07:00
ucount.c sysctl changes for v6.10-rc1 2024-05-17 17:31:24 -07:00
uid16.c
uid16.h
umh.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
up.c smp: Change function signatures to use call_single_data_t 2023-09-13 14:59:24 +02:00
user_namespace.c user_namespace: remove unnecessary NULL values from kbuf 2024-02-22 15:38:52 -08:00
user-return-notifier.c
user.c printk: Change type of CONFIG_BASE_SMALL to bool 2024-05-06 17:39:09 +02:00
usermode_driver.c
utsname_sysctl.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
utsname.c
vhost_task.c vhost_task: Handle SIGKILL by flushing work and exiting 2024-05-22 08:31:15 -04:00
vmcore_info.c kallsyms: get rid of code for absolute kallsyms 2024-07-20 16:33:21 +09:00
watch_queue.c watch_queue: fix kcalloc() arguments order 2023-12-21 13:17:54 +01:00
watchdog_buddy.c
watchdog_perf.c watchdog/perf: properly initialize the turbo mode timestamp and rearm counter 2024-07-17 21:11:34 -07:00
watchdog.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
workqueue_internal.h workqueue: Drop the special locking rule for worker->flags and worker_pool->flags 2023-08-07 15:57:22 -10:00
workqueue.c workqueue: Correct declaration of cpu_pwq in struct workqueue_struct 2024-08-05 18:34:02 -10:00