linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-11 12:28:41 +08:00

History

Yosry Ahmed 8d59d2214c mm: memcg: make stats flushing threshold per-memcg A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL \| MEAN \| MEDIAN \| STDDEV \| ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops \| \| \| \| (A) base \| 270249.164 \| 265437.000 \| 13451.836 \| (B) patched \| 261368.709 \| 255725.000 \| 13394.767 \| \| -3.29% \| -3.66% \| \| page_fault1_per_thread_ops \| \| \| \| (A) base \| 242111.345 \| 239737.000 \| 10026.031 \| (B) patched \| 237057.109 \| 235305.000 \| 9769.687 \| \| -2.09% \| -1.85% \| \| page_fault1_scalability \| \| \| (A) base \| 0.034387 \| 0.035168 \| 0.0018283 \| (B) patched \| 0.033988 \| 0.034573 \| 0.0018056 \| \| -1.16% \| -1.69% \| \| page_fault2_per_process_ops \| \| \| (A) base \| 203561.836 \| 203301.000 \| 2550.764 \| (B) patched \| 197195.945 \| 197746.000 \| 2264.263 \| \| -3.13% \| -2.73% \| \| page_fault2_per_thread_ops \| \| \| (A) base \| 171046.473 \| 170776.000 \| 1509.679 \| (B) patched \| 166626.327 \| 166406.000 \| 768.753 \| \| -2.58% \| -2.56% \| \| page_fault2_scalability \| \| \| (A) base \| 0.054026 \| 0.053821 \| 0.00062121 \| (B) patched \| 0.053329 \| 0.05306 \| 0.00048394 \| \| -1.29% \| -1.41% \| \| page_fault3_per_process_ops \| \| \| (A) base \| 1295807.782 \| 1297550.000 \| 5907.585 \| (B) patched \| 1275579.873 \| 1273359.000 \| 8759.160 \| \| -1.56% \| -1.86% \| \| page_fault3_per_thread_ops \| \| \| (A) base \| 391234.164 \| 390860.000 \| 1760.720 \| (B) patched \| 377231.273 \| 376369.000 \| 1874.971 \| \| -3.58% \| -3.71% \| \| page_fault3_scalability \| \| \| (A) base \| 0.60369 \| 0.60072 \| 0.0083029 \| (B) patched \| 0.61733 \| 0.61544 \| 0.009855 \| \| +2.26% \| +2.45% \| \| All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were no further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2023-12-20 14:48:11 -08:00
..
damon	sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes	2023-12-20 14:47:18 -08:00
kasan	sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes	2023-12-20 14:47:18 -08:00
kfence	LoongArch changes for v6.6	2023-09-08 12:16:52 -07:00
kmsan	kmsan: use stack_depot_save instead of __stack_depot_save	2023-12-10 16:51:46 -08:00
backing-dev.c	writeback: remove redundant checks for root memcg	2023-08-21 13:37:48 -07:00
balloon_compaction.c
bootmem_info.c	bootmem: use kmemleak_free_part_phys in put_page_bootmem	2023-10-25 16:47:13 -07:00
cma_debug.c
cma_sysfs.c	mm: cma: make kobj_type structure constant	2023-03-28 16:20:06 -07:00
cma.c	mm: cma: remove unnecessary initialization of ret	2023-12-12 10:57:08 -08:00
cma.h
compaction.c	mm: compaction: avoid fast_isolate_freepages blindly choose improper pageblock	2023-12-12 10:57:08 -08:00
debug_page_alloc.c	mm: page_alloc: split out DEBUG_PAGEALLOC	2023-06-09 16:25:23 -07:00
debug_page_ref.c
debug_vm_pgtable.c	mm: fix multiple typos in multiple files	2023-10-25 16:47:14 -07:00
debug.c	mm: update validate_mm() to use vma iterator	2023-06-09 16:25:31 -07:00
dmapool_test.c	dmapool: add alloc/free performance test	2023-04-05 19:42:38 -07:00
dmapool.c	dmapool: create/destroy cleanup	2023-06-09 16:25:17 -07:00
early_ioremap.c	mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup()	2023-06-09 16:25:56 -07:00
fadvise.c	mm: remove unnecessary pagevec includes	2023-06-23 16:59:31 -07:00
fail_page_alloc.c	mm: page_alloc: split out FAIL_PAGE_ALLOC	2023-06-09 16:25:23 -07:00
failslab.c	mm: fix unexpected changes to {failslab\|fail_page_alloc}.attr	2022-11-22 18:50:44 -08:00
filemap.c	sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes	2023-12-20 14:47:18 -08:00
folio-compat.c	mm: return void from folio_start_writeback() and related functions	2023-12-10 16:51:37 -08:00
gup_test.c	Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.	2023-06-23 16:58:19 -07:00
gup_test.h	mm/gup_test: start/stop/read functionality for PIN LONGTERM test	2022-11-08 17:37:15 -08:00
gup.c	mm/gup: fix follow_devmap_p[mu]d() on page==NULL handling	2023-12-10 16:51:52 -08:00
highmem.c	mm: ptep_get() conversion	2023-06-19 16:19:25 -07:00
hmm.c	mm: enable page walking API to lock vmas during the walk	2023-08-21 13:07:20 -07:00
huge_memory.c	mm: huge_memory: use more folio api in __split_huge_page_tail()	2023-12-12 10:57:07 -08:00
hugetlb_cgroup.c	mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER	2023-10-18 14:34:17 -07:00
hugetlb_vmemmap.c	mm: hugetlb_vmemmap: move mmap lock to vmemmap_remap_range()	2023-12-12 10:57:08 -08:00
hugetlb_vmemmap.h	mm: hugetlb_vmemmap: fix reference to nonexistent file	2023-10-25 16:47:14 -07:00
hugetlb.c	hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write	2023-12-06 16:12:43 -08:00
hwpoison-inject.c	mm/hwpoison: add __init/__exit annotations to module init/exit funcs	2022-10-03 14:03:05 -07:00
init-mm.c	mm: move dummy_vm_ops out of a header	2023-08-21 13:37:46 -07:00
internal.h	mm: use vma_pages() for vma objects	2023-12-12 10:57:08 -08:00
interval_tree.c
io-mapping.c
ioremap.c	mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed	2023-08-18 10:12:36 -07:00
Kconfig	mm/thp: add CONFIG_TRANSPARENT_HUGEPAGE_NEVER option	2023-12-12 10:57:07 -08:00
Kconfig.debug	mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM	2023-05-29 16:14:28 +01:00
khugepaged.c	As usual, lots of singleton and doubleton patches all over the tree and	2023-11-02 20:53:31 -10:00
kmemleak.c	kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers	2023-12-12 10:57:07 -08:00
ksm.c	mm: ksm: use more folio api in ksm_might_need_to_copy()	2023-12-12 10:57:05 -08:00
list_lru.c	mm/list_lru.c: remove unused list_lru_from_kmem()	2023-12-20 14:48:11 -08:00
maccess.c	mm: Fix copy_from_user_nofault().	2023-04-12 17:36:23 -07:00
madvise.c	mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()	2023-12-06 16:12:50 -08:00
Makefile	mm: vmscan: move shrinker-related code into a separate file	2023-10-04 10:32:23 -07:00
mapping_dirty_helpers.c	mm: fix clean_record_shared_mapping_range kernel-doc	2023-08-24 16:20:30 -07:00
memblock.c	NUMA: optimize detection of memory with no node id assigned by firmware	2023-12-10 16:51:34 -08:00
memcontrol.c	mm: memcg: make stats flushing threshold per-memcg	2023-12-20 14:48:11 -08:00
memfd.c	memfd: drop warning for missing exec-related flags	2023-10-04 10:32:22 -07:00
memory_hotplug.c	mm/memory_hotplug: split memmap_on_memory requests across memblocks	2023-12-10 16:51:34 -08:00
memory-failure.c	sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes	2023-12-20 14:47:18 -08:00
memory-tiers.c	dax, kmem: calculate abstract distance with general interface	2023-10-16 15:44:39 -07:00
memory.c	mm: memory: use folio_prealloc() in wp_page_copy()	2023-12-12 10:57:06 -08:00
mempolicy.c	Many singleton patches against the MM code. The patch series which are	2023-11-02 19:38:47 -10:00
mempool.c	mm/mempool: replace kmap_atomic() with kmap_local_page()	2023-12-10 16:51:49 -08:00
memremap.c	mm: use vmem_altmap code without CONFIG_ZONE_DEVICE	2023-12-10 16:51:48 -08:00
memtest.c	mm: memtest: convert to memtest_report_meminfo()	2023-08-21 13:37:47 -07:00
migrate_device.c	Add x86 shadow stack support	2023-08-31 12:20:12 -07:00
migrate.c	mm: migrate high-order folios in swap cache correctly	2023-12-20 13:46:19 -08:00
mincore.c	mm: enable page walking API to lock vmas during the walk	2023-08-21 13:07:20 -07:00
mlock.c	mm: mlock: avoid folio_within_range() on KSM pages	2023-10-25 16:47:14 -07:00
mm_init.c	mm/mm_init.c: append newline to the unavailable ranges log-message	2023-12-10 16:51:51 -08:00
mm_slot.h
mmap_lock.c
mmap.c	mmap: remove the IA64-specific vma expansion implementation	2023-12-10 16:51:39 -08:00
mmu_gather.c	mm: fix kernel-doc warning from tlb_flush_rmaps()	2023-08-24 16:20:30 -07:00
mmu_notifier.c	mmu_notifiers: rename invalidate_range notifier	2023-08-18 10:12:41 -07:00
mmzone.c	zswap: shrink zswap pool based on memory pressure	2023-12-12 10:57:02 -08:00
mprotect.c	mm: mprotect: use a folio in change_pte_range()	2023-10-25 16:47:12 -07:00
mremap.c	mm: abstract VMA merge and extend into vma_merge_extend() helper	2023-10-18 14:34:18 -07:00
msync.c
nommu.c	Many singleton patches against the MM code. The patch series which are	2023-11-02 19:38:47 -10:00
oom_kill.c	mm, oom:dump_tasks add rss detailed information printing	2023-12-10 16:51:53 -08:00
page_alloc.c	mm: page_alloc: unreserve highatomic page blocks before oom	2023-12-10 16:51:52 -08:00
page_counter.c
page_ext.c	mm/page_ext: move functions around for minor cleanups to page_ext	2023-08-18 10:12:31 -07:00
page_idle.c	mm: page_idle: convert page idle to use a folio	2023-01-18 17:12:52 -08:00
page_io.c	mm: memcg: add THP swap out info for anonymous reclaim	2023-10-04 10:32:27 -07:00
page_isolation.c	mm/hugetlb: get rid of page_hstate()	2023-08-18 10:12:39 -07:00
page_owner.c	mm/page_owner: record and dump free_pid and free_tgid	2023-12-10 16:51:40 -08:00
page_poison.c	mm/page_poison: replace kmap_atomic() with kmap_local_page()	2023-12-10 16:51:50 -08:00
page_reporting.c	mm, treewide: redefine MAX_ORDER sanely	2023-04-05 19:42:46 -07:00
page_reporting.h
page_table_check.c	mm: convert page_table_check_pte_set() to page_table_check_ptes_set()	2023-08-24 16:20:18 -07:00
page_vma_mapped.c	mm: correct stale comment of function check_pte	2023-08-18 10:12:13 -07:00
page-writeback.c	mm: return void from folio_start_writeback() and related functions	2023-12-10 16:51:37 -08:00
pagewalk.c	mm: pagewalk: assert write mmap lock only for walking the user page tables	2023-12-10 16:51:53 -08:00
percpu-internal.h	percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing	2023-06-19 16:19:29 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c	Many singleton patches against the MM code. The patch series which are	2023-11-02 19:38:47 -10:00
pgalloc-track.h
pgtable-generic.c	mm/pgtable: notes on pte_offset_map[_lock]()	2023-08-18 10:12:25 -07:00
process_vm_access.c	mm: fix process_vm_rw page counts	2023-12-10 16:51:39 -08:00
ptdump.c	mm: ptdump should use ptep_get_lockless()	2023-06-19 16:19:24 -07:00
readahead.c	mm/readahead: do not allow order-1 folio	2023-12-12 10:57:06 -08:00
rmap.c	mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()	2023-10-18 14:34:14 -07:00
rodata_test.c	mm/rodata_test: use PAGE_ALIGNED() helper	2022-10-03 14:03:05 -07:00
secretmem.c	mm/secretmem: use a folio in secretmem_fault()	2023-08-21 13:38:02 -07:00
shmem_quota.c	shmem: Add default quota limit mount options	2023-08-09 09:15:40 +02:00
shmem.c	sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes	2023-12-20 14:47:18 -08:00
show_mem.c	mm: refactor si_mem_available()	2023-10-04 10:32:19 -07:00
shrinker_debug.c	mm: shrinker: convert shrinker_rwsem to mutex	2023-10-04 10:32:26 -07:00
shrinker.c	mm: shrinker: convert shrinker_rwsem to mutex	2023-10-04 10:32:26 -07:00
shuffle.c	mm/shuffle: convert module_param_call to module_param_cb	2022-10-03 14:03:07 -07:00
shuffle.h	mm, treewide: redefine MAX_ORDER sanely	2023-04-05 19:42:46 -07:00
slab_common.c	RCU pull request for v6.7	2023-10-30 18:01:41 -10:00
slab.c	Randomized slab caches for kmalloc()	2023-07-18 10:07:47 +02:00
slab.h	mm: kmem: scoped objcg protection	2023-10-25 16:47:11 -07:00
slub.c	slub, kasan: improve interaction of KASAN and slub_debug poisoning	2023-12-10 16:51:48 -08:00
sparse-vmemmap.c	mm/vmemmap: allow architectures to override how vmemmap optimization works	2023-08-18 10:12:53 -07:00
sparse.c	mm/sparse: remove redundant judgments from macro for_each_present_section_nr	2023-08-18 10:12:14 -07:00
swap_cgroup.c	mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled	2022-10-03 14:03:36 -07:00
swap_slots.c
swap_state.c	zswap: shrink zswap pool based on memory pressure	2023-12-12 10:57:02 -08:00
swap.c	mm: remove references to pagevec	2023-06-23 16:59:30 -07:00
swap.h	zswap: make shrinking memcg-aware	2023-12-12 10:57:01 -08:00
swapfile.c	mm/swapfile: replace kmap_atomic() with kmap_local_page()	2023-12-10 16:51:53 -08:00
truncate.c	fs: convert error_remove_page to error_remove_folio	2023-12-10 16:51:42 -08:00
usercopy.c	mm: Fix copy_from_user_nofault().	2023-04-12 17:36:23 -07:00
userfaultfd.c	mm: more ptep_get() conversion	2023-11-15 15:30:09 -08:00
util.c	mm/util: use kmap_local_page() in memcmp_pages()	2023-12-10 16:51:49 -08:00
vmalloc.c	mm/vmalloc: fix the unchecked dereference warning in vread_iter()	2023-11-01 12:38:35 -07:00
vmpressure.c	net-memcg: Fix scope of sockmem pressure indicators	2023-08-16 12:21:32 +01:00
vmscan.c	sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes	2023-12-20 14:47:18 -08:00
vmstat.c	mm: memcg: add per-memcg zswap writeback stat	2023-12-12 10:57:02 -08:00
workingset.c	sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes	2023-12-20 14:47:18 -08:00
z3fold.c	mm/z3fold: remove obsolete comment for struct z3fold_pool	2023-08-21 13:37:51 -07:00
zbud.c	mm: zswap: remove shrink from zpool interface	2023-06-19 16:19:27 -07:00
zpool.c	mm: zswap: remove shrink from zpool interface	2023-06-19 16:19:27 -07:00
zsmalloc.c	zsmalloc: use copy_page for full page copy	2023-10-18 14:34:16 -07:00
zswap.c	zswap: shrink zswap pool based on memory pressure	2023-12-12 10:57:02 -08:00