linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-25 21:24:08 +08:00

History

Huang Ying c574bbe917 NUMA balancing: optimize page placement for memory tiering system With the advent of various new memory types, some machines will have multiple types of memory, e.g. DRAM and PMEM (persistent memory). The memory subsystem of these machines can be called memory tiering system, because the performance of the different types of memory are usually different. In such system, because of the memory accessing pattern changing etc, some pages in the slow memory may become hot globally. So in this patch, the NUMA balancing mechanism is enhanced to optimize the page placement among the different memory types according to hot/cold dynamically. In a typical memory tiering system, there are CPUs, fast memory and slow memory in each physical NUMA node. The CPUs and the fast memory will be put in one logical node (called fast memory node), while the slow memory will be put in another (faked) logical node (called slow memory node). That is, the fast memory is regarded as local while the slow memory is regarded as remote. So it's possible for the recently accessed pages in the slow memory node to be promoted to the fast memory node via the existing NUMA balancing mechanism. The original NUMA balancing mechanism will stop to migrate pages if the free memory of the target node becomes below the high watermark. This is a reasonable policy if there's only one memory type. But this makes the original NUMA balancing mechanism almost do not work to optimize page placement among different memory types. Details are as follows. It's the common cases that the working-set size of the workload is larger than the size of the fast memory nodes. Otherwise, it's unnecessary to use the slow memory at all. So, there are almost always no enough free pages in the fast memory nodes, so that the globally hot pages in the slow memory node cannot be promoted to the fast memory node. To solve the issue, we have 2 choices as follows, a. Ignore the free pages watermark checking when promoting hot pages from the slow memory node to the fast memory node. This will create some memory pressure in the fast memory node, thus trigger the memory reclaiming. So that, the cold pages in the fast memory node will be demoted to the slow memory node. b. Define a new watermark called wmark_promo which is higher than wmark_high, and have kswapd reclaiming pages until free pages reach such watermark. The scenario is as follows: when we want to promote hot-pages from a slow memory to a fast memory, but fast memory's free pages would go lower than high watermark with such promotion, we wake up kswapd with wmark_promo watermark in order to demote cold pages and free us up some space. So, next time we want to promote hot-pages we might have a chance of doing so. The choice "a" may create high memory pressure in the fast memory node. If the memory pressure of the workload is high, the memory pressure may become so high that the memory allocation latency of the workload is influenced, e.g. the direct reclaiming may be triggered. The choice "b" works much better at this aspect. If the memory pressure of the workload is high, the hot pages promotion will stop earlier because its allocation watermark is higher than that of the normal memory allocation. So in this patch, choice "b" is implemented. A new zone watermark (WMARK_PROMO) is added. Which is larger than the high watermark and can be controlled via watermark_scale_factor. In addition to the original page placement optimization among sockets, the NUMA balancing mechanism is extended to be used to optimize page placement according to hot/cold among different memory types. So the sysctl user space interface (numa_balancing) is extended in a backward compatible way as follow, so that the users can enable/disable these functionality individually. The sysctl is converted from a Boolean value to a bits field. The definition of the flags is, - 0: NUMA_BALANCING_DISABLED - 1: NUMA_BALANCING_NORMAL - 2: NUMA_BALANCING_MEMORY_TIERING We have tested the patch with the pmbench memory accessing benchmark with the 80:20 read/write ratio and the Gauss access address distribution on a 2 socket Intel server with Optane DC Persistent Memory Model. The test results shows that the pmbench score can improve up to 95.9%. Thanks Andrew Morton to help fix the document format error. Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2022-03-22 15:57:09 -07:00
..
damon	mm/damon: hide kernel pointer from tracepoint event	2022-01-15 16:30:33 +02:00
kasan	lib/stackdepot: always do filter_irq_stacks() in stack_depot_save()	2022-01-22 08:33:38 +02:00
kfence	kfence: make test case compatible with run time set sample interval	2022-02-11 17:55:00 -08:00
backing-dev.c	remove congestion tracking framework	2022-03-22 15:57:01 -07:00
balloon_compaction.c	mm: fix typos in comments	2021-05-07 00:26:35 -07:00
bootmem_info.c	bootmem: Use page->index instead of page->freelist	2022-01-06 12:27:03 +01:00
cma_debug.c
cma_sysfs.c
cma.c	mm/cma: provide option to opt out from exposing pages on activation failure	2022-03-22 15:57:09 -07:00
cma.h	mm/cma: provide option to opt out from exposing pages on activation failure	2022-03-22 15:57:09 -07:00
compaction.c	mm: compaction: cleanup the compaction trace events	2022-03-22 15:57:09 -07:00
debug_page_ref.c
debug_vm_pgtable.c	mm/debug_vm_pgtable: remove pte entry from the page table	2022-02-04 09:25:04 -08:00
debug.c	mm,fs: split dump_mapping() out from dump_page()	2022-01-15 16:30:26 +02:00
dmapool.c	mm/dmapool.c: revert "make dma pool to use kmalloc_node"	2022-01-15 16:30:28 +02:00
early_ioremap.c	mm/early_ioremap.c: remove redundant early_ioremap_shutdown()	2021-09-08 11:50:24 -07:00
fadvise.c	remove inode_congested()	2022-03-22 15:57:01 -07:00
failslab.c
filemap.c	tmpfs: do not allocate pages on read	2022-03-22 15:57:02 -07:00
folio-compat.c	filemap: Add filemap_release_folio()	2022-01-04 13:15:34 -05:00
frontswap.c	frontswap: remove support for multiple ops	2022-01-22 08:33:38 +02:00
gup_test.c	selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages	2021-05-05 11:27:26 -07:00
gup_test.h	selftests/vm: gup_test: fix test flag	2021-05-05 11:27:26 -07:00
gup.c	mm/gup: remove unused get_user_pages_locked()	2022-03-22 15:57:01 -07:00
highmem.c	Fixes for 5.16 folios:	2021-11-25 10:13:56 -08:00
hmm.c	mm/hmm.c: allow VM_MIXEDMAP to work with hmm_range_fault	2022-01-15 16:30:31 +02:00
huge_memory.c	mm/thp: refix __split_huge_pmd_locked() for migration PMD	2022-03-22 15:57:09 -07:00
hugetlb_cgroup.c	hugetlb: add hugetlb.*.numa_stat file	2022-01-15 16:30:29 +02:00
hugetlb_vmemmap.c	mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key	2022-03-22 15:57:08 -07:00
hugetlb_vmemmap.h	mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate	2021-06-30 20:47:25 -07:00
hugetlb.c	userfaultfd: provide unmasked address on page-fault	2022-03-22 15:57:08 -07:00
hwpoison-inject.c	mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler	2022-03-22 15:57:07 -07:00
init-mm.c	mm: add setup_initial_init_mm() helper	2021-07-08 11:48:21 -07:00
internal.h	mm/sparse: make mminit_validate_memmodel_limits() static	2022-03-22 15:57:05 -07:00
interval_tree.c
io-mapping.c
ioremap.c	mm: move ioremap_page_range to vmalloc.c	2021-09-08 11:50:24 -07:00
Kconfig	mm/hugetlb: generalize ARCH_WANT_GENERAL_HUGETLB	2022-03-22 15:57:08 -07:00
Kconfig.debug	mm: page table check	2022-01-15 16:30:28 +02:00
khugepaged.c	mm/page_table_check: check entries at pmd levels	2022-02-04 09:25:04 -08:00
kmemleak.c	mm/kmemleak: avoid scanning potential huge holes	2022-02-04 09:25:05 -08:00
ksm.c	mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy	2022-01-15 16:30:31 +02:00
list_lru.c	mm/list_lru: optimize memcg_reparent_list_lru_node()	2022-03-22 15:57:08 -07:00
maccess.c	ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault	2021-08-20 11:39:25 +01:00
madvise.c	mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler	2022-03-22 15:57:07 -07:00
Makefile	mm: remove cleancache	2022-01-22 08:33:38 +02:00
mapping_dirty_helpers.c	mm: move tlb_flush_pending inline helpers to mm_inline.h	2022-01-15 16:30:27 +02:00
memblock.c	memblock: use kfree() to release kmalloced memblock regions	2022-02-20 08:45:39 +02:00
memcontrol.c	mm: memcontrol: fix cannot alloc the maximum memcg ID	2022-03-22 15:57:03 -07:00
memfd.c	memfd: fix F_SEAL_WRITE after shmem huge page allocated	2022-03-05 11:08:32 -08:00
memory_hotplug.c	mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key	2022-03-22 15:57:08 -07:00
memory-failure.c	mm/memory-failure.c: make non-LRU movable pages unhandlable	2022-03-22 15:57:07 -07:00
memory.c	userfaultfd: provide unmasked address on page-fault	2022-03-22 15:57:08 -07:00
mempolicy.c	mempolicy: mbind_range() set_policy() after vma_merge()	2022-03-22 15:57:09 -07:00
mempool.c	mm: remove spurious blkdev.h includes	2021-10-18 06:17:01 -06:00
memremap.c	mm/memremap: avoid calling kasan_remove_zero_shadow() for device private memory	2022-03-22 15:57:01 -07:00
memtest.c
migrate.c	NUMA balancing: optimize page placement for memory tiering system	2022-03-22 15:57:09 -07:00
mincore.c
mlock.c	mm/mlock: fix potential imbalanced rlimit ucounts adjustment	2022-03-22 15:57:07 -07:00
mm_init.c
mmap_lock.c	mm: mmap_lock: fix disabling preemption directly	2021-07-23 17:43:28 -07:00
mmap.c	mm/mmap: remove obsolete comment in ksys_mmap_pgoff	2022-03-22 15:57:05 -07:00
mmu_gather.c	mm: move tlb_flush_pending inline helpers to mm_inline.h	2022-01-15 16:30:27 +02:00
mmu_notifier.c
mmzone.c	mm/mmzone.c: use try_cmpxchg() in page_cpupid_xchg_last()	2022-03-22 15:57:05 -07:00
mprotect.c	mm: refactor vm_area_struct::anon_vma_name usage code	2022-03-05 11:08:32 -08:00
mremap.c	mm/mremap:: use vma_lookup() instead of find_vma()	2022-03-22 15:57:05 -07:00
msync.c
nommu.c	Merge branch 'akpm' (patches from Andrew)	2021-11-06 14:08:17 -07:00
oom_kill.c	mm/oom_kill: remove unneeded is_memcg_oom check	2022-03-22 15:57:09 -07:00
page_alloc.c	NUMA balancing: optimize page placement for memory tiering system	2022-03-22 15:57:09 -07:00
page_counter.c	mm/page_counter: remove an incorrect call to propagate_protected_usage()	2022-01-15 16:30:27 +02:00
page_ext.c	mm: make some vars and functions static or __init	2022-01-15 16:30:31 +02:00
page_idle.c	mm/idle_page_tracking: make PG_idle reusable	2021-09-08 11:50:24 -07:00
page_io.c	delayacct: support swapin delay accounting for swapping without blkio	2022-01-20 08:52:55 +02:00
page_isolation.c	Revert "mm/page_isolation: unset migratetype directly for non Buddy page"	2022-02-04 09:25:04 -08:00
page_owner.c	lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()	2022-01-22 08:33:37 +02:00
page_poison.c
page_reporting.c	mm/page_reporting: allow driver to specify reporting order	2021-06-29 10:53:47 -07:00
page_reporting.h	mm/page_reporting: export reporting order as module parameter	2021-06-29 10:53:47 -07:00
page_table_check.c	mm/page_table_check: check entries at pmd levels	2022-02-04 09:25:04 -08:00
page_vma_mapped.c	mm: device exclusive memory access	2021-07-01 11:06:03 -07:00
page-writeback.c	mm/writeback: minor clean up for highmem_dirtyable_memory	2022-03-22 15:57:01 -07:00
pagewalk.c	mm: pagewalk: fix walk for hugepage tables	2021-06-29 10:53:49 -07:00
percpu-internal.h	mm: memcg/percpu: account extra objcg space to memory cgroups	2022-01-15 16:30:31 +02:00
percpu-km.c	percpu: flush tlb in pcpu_reclaim_populated()	2021-07-04 18:30:17 +00:00
percpu-stats.c	percpu: rework memcg accounting	2021-06-05 20:43:15 +00:00
percpu-vm.c	percpu: flush tlb in pcpu_reclaim_populated()	2021-07-04 18:30:17 +00:00
percpu.c	bitmap patches for 5.17-rc1	2022-01-23 06:20:44 +02:00
pgalloc-track.h	mm: fix typos in comments	2021-05-07 00:26:35 -07:00
pgtable-generic.c	mm: move tlb_flush_pending inline helpers to mm_inline.h	2022-01-15 16:30:27 +02:00
process_vm_access.c	mm/process_vm_access.c: remove duplicate include	2021-05-05 11:27:27 -07:00
ptdump.c	mm: sparsemem: use page table lock to protect kernel pmd operations	2022-03-22 15:57:08 -07:00
readahead.c	remove inode_congested()	2022-03-22 15:57:01 -07:00
rmap.c	mm/rmap: fix potential batched TLB flush race	2022-01-15 16:30:31 +02:00
rodata_test.c
secretmem.c	mm/secretmem: avoid letting secretmem_users drop to zero	2021-10-28 17:18:55 -07:00
shmem.c	mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte()	2022-03-22 15:57:04 -07:00
shuffle.c
shuffle.h	mm/shuffle: fix section mismatch warning	2021-05-22 15:09:07 -10:00
slab_common.c	Merge branch 'akpm' (patches from Andrew)	2022-01-15 20:37:06 +02:00
slab.c	mm: introduce kmem_cache_alloc_lru	2022-03-22 15:57:03 -07:00
slab.h	mm: introduce kmem_cache_alloc_lru	2022-03-22 15:57:03 -07:00
slob.c	mm: introduce kmem_cache_alloc_lru	2022-03-22 15:57:03 -07:00
slub.c	mm: introduce kmem_cache_alloc_lru	2022-03-22 15:57:03 -07:00
sparse-vmemmap.c	mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP	2022-03-22 15:57:08 -07:00
sparse.c	mm/sparse: make mminit_validate_memmodel_limits() static	2022-03-22 15:57:05 -07:00
swap_cgroup.c
swap_slots.c	treewide: Add missing includes masked by cgroup -> bpf dependency	2021-12-03 10:58:13 -08:00
swap_state.c	mm: swap: get rid of livelock in swapin readahead	2022-03-17 11:02:13 -07:00
swap.c	mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu	2022-03-22 15:57:08 -07:00
swapfile.c	userfaultfd: provide unmasked address on page-fault	2022-03-22 15:57:08 -07:00
truncate.c	mm: remove cleancache	2022-01-22 08:33:38 +02:00
usercopy.c	mm: Convert check_heap_object() to use struct slab	2022-01-06 12:25:51 +01:00
userfaultfd.c	mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()	2022-03-22 15:57:04 -07:00
util.c	mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls	2022-03-04 10:00:37 -08:00
vmacache.c
vmalloc.c	mm/vmalloc.c: fix "unused function" warning	2022-03-22 15:57:05 -07:00
vmpressure.c	mm/vmpressure: fix data-race with memcg->socket_pressure	2021-11-06 13:30:40 -07:00
vmscan.c	NUMA balancing: optimize page placement for memory tiering system	2022-03-22 15:57:09 -07:00
vmstat.c	NUMA Balancing: add page promotion counter	2022-03-22 15:57:09 -07:00
workingset.c	mm: workingset: replace IRQ-off check with a lockdep assert.	2022-03-22 15:57:08 -07:00
z3fold.c	mm/z3fold: add kerneldoc fields for z3fold_pool	2021-07-01 11:06:03 -07:00
zbud.c	mm/zbud: add kerneldoc fields for zbud_pool	2021-07-01 11:06:03 -07:00
zpool.c	zpool: remove the list of pools_head	2022-01-15 16:30:31 +02:00
zsmalloc.c	zsmalloc: replace get_cpu_var with local_lock	2022-01-22 08:33:37 +02:00
zswap.c	frontswap: remove support for multiple ops	2022-01-22 08:33:38 +02:00