linux/mm
Huang Ying 33024536ba memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4.

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified. 
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote.  But this isn't a perfect
algorithm to identify the hot pages.  Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g.  60 seconds).  So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patchset
as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed.  The test results are
as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.

Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.


This patch (of 3):

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified.  Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
identify the hot pages.  Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g.  60 seconds).  The most frequently
accessed (MFU) algorithm is better.

So, in this patch we implemented a better hot page selection algorithm. 
Which is based on NUMA balancing page table scanning and hint page fault
as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a better estimation of the page hot/cold.

It's hard to find some extra space in struct page to hold the scan time. 
Fortunately, we can reuse some bits used by the original NUMA balancing.

NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth.  But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node.  So the accessing CPU and PID information are unnecessary for the
slow memory pages.  We can reuse these bits in struct page to record the
scan time.  For the fast memory pages, these bits are used as before.

For the hot threshold, the default value is 1 second, which works well in
our performance test.  All pages with hint page fault latency < hot
threshold will be considered hot.

It's hard for users to determine the hot threshold.  So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment.  We will continue to work on a hot threshold automatic
adjustment mechanism.

The downside of the above method is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.

Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.

Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: osalvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:54 -07:00
..
damon mm/damon/core: simplify the parameter passing for region split operation 2022-09-11 20:25:51 -07:00
kasan - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
kfence kfence: add sysfs interface to disable kfence for selected slabs. 2022-09-11 20:25:52 -07:00
backing-dev.c writeback: avoid use-after-free after removing device 2022-08-28 14:02:43 -07:00
balloon_compaction.c mm: Convert all PageMovable users to movable_operations 2022-08-02 12:34:03 -04:00
bootmem_info.c bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem 2022-08-28 14:02:45 -07:00
cma_debug.c mm/cma_debug: show complete cma name in debugfs directories 2022-09-11 20:25:50 -07:00
cma_sysfs.c
cma.c Revert "mm/cma.c: remove redundant cma_mutex lock" 2022-05-13 15:11:26 -07:00
cma.h mm/cma: provide option to opt out from exposing pages on activation failure 2022-03-22 15:57:09 -07:00
compaction.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
debug_page_ref.c
debug_vm_pgtable.c docs: rename Documentation/vm to Documentation/mm 2022-06-27 12:52:53 -07:00
debug.c mm: unexport page_init_poison 2022-03-24 19:06:45 -07:00
dmapool.c mm/dmapool.c: revert "make dma pool to use kmalloc_node" 2022-01-15 16:30:28 +02:00
early_ioremap.c mm/early_ioremap: declare early_memremap_pgprot_adjust() 2022-03-22 15:57:11 -07:00
fadvise.c riscv: compat: syscall: Add compat_sys_call_table implementation 2022-04-26 13:36:25 -07:00
failslab.c mm: fix missing handler for __GFP_NOWARN 2022-05-19 14:08:55 -07:00
filemap.c mm/filemap.c: convert page_endio() to use a folio 2022-09-11 20:25:48 -07:00
folio-compat.c mm/folio-compat: Remove migration compatibility functions 2022-08-02 12:34:04 -04:00
frontswap.c docs: rename Documentation/vm to Documentation/mm 2022-06-27 12:52:53 -07:00
gup_test.c mm: rename is_pinnable_page() to is_longterm_pinnable_page() 2022-07-17 17:14:27 -07:00
gup_test.h
gup.c mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes 2022-09-11 20:25:53 -07:00
highmem.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
hmm.c mm/hmm: fault non-owner device private entries 2022-07-29 11:33:37 -07:00
huge_memory.c memory tiering: hot page selection with hint page fault latency 2022-09-11 20:25:54 -07:00
hugetlb_cgroup.c hugetlb_cgroup: use helper for_each_hstate and hstate_index 2022-09-11 20:25:53 -07:00
hugetlb_vmemmap.c mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE 2022-08-08 18:06:43 -07:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability 2022-08-08 18:06:43 -07:00
hugetlb.c mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process 2022-09-11 20:25:50 -07:00
hwpoison-inject.c mm/memory-failure: disable unpoison once hw error happens 2022-06-16 19:11:32 -07:00
init-mm.c kernel/fork: Initialize mm's PASID 2022-02-14 19:51:47 +01:00
internal.h mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage 2022-09-11 20:25:46 -07:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: Add ioremap/iounmap_allowed() 2022-06-27 12:22:31 +01:00
Kconfig cxl for 6.0 2022-08-10 11:07:26 -07:00
Kconfig.debug Two followon fixes for the post-5.19 series "Use pageblock_order for cma 2022-05-27 11:40:49 -07:00
khugepaged.c mm/khugepaged: rename prefix of shared collapse functions 2022-09-11 20:25:46 -07:00
kmemleak.c mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan() 2022-06-16 19:48:32 -07:00
ksm.c mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage 2022-09-11 20:25:46 -07:00
list_lru.c mm: kmem: make mem_cgroup_from_obj() vmalloc()-safe 2022-06-16 19:48:31 -07:00
maccess.c asm-generic updates for 5.18 2022-03-23 18:03:08 -07:00
madvise.c mm/madvise: add MADV_COLLAPSE to process_madvise() 2022-09-11 20:25:46 -07:00
Makefile mm: shrinkers: introduce debugfs interface for memory shrinkers 2022-07-03 18:08:40 -07:00
mapping_dirty_helpers.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
memblock.c memblock updates for v5.20 2022-08-09 09:48:30 -07:00
memcontrol.c mm: memcontrol: fix potential oom_lock recursion deadlock 2022-07-29 18:07:18 -07:00
memfd.c memfd: fix F_SEAL_WRITE after shmem huge page allocated 2022-03-05 11:08:32 -08:00
memory_hotplug.c mm: use is_zone_movable_page() helper 2022-07-29 18:07:20 -07:00
memory-failure.c mm: memory-failure: cleanup try_to_split_thp_page() 2022-09-11 20:25:48 -07:00
memory.c memory tiering: hot page selection with hint page fault latency 2022-09-11 20:25:54 -07:00
mempolicy.c mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process 2022-09-11 20:25:50 -07:00
mempool.c mm/mempool: use might_alloc() 2022-06-16 19:48:30 -07:00
memremap.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
memtest.c
migrate_device.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
migrate.c memory tiering: hot page selection with hint page fault latency 2022-09-11 20:25:54 -07:00
mincore.c mm: teach core mm about pte markers 2022-05-13 07:20:09 -07:00
mlock.c mm: handling Non-LRU pages returned by vm_normal_pages 2022-07-17 17:14:28 -07:00
mm_init.c
mmap_lock.c mm: mmap_lock: fix disabling preemption directly 2021-07-23 17:43:28 -07:00
mmap.c mm: align larger anonymous mappings on THP boundaries 2022-09-11 20:25:47 -07:00
mmu_gather.c mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush 2022-04-28 23:16:12 -07:00
mmu_notifier.c mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() 2022-04-21 20:01:10 -07:00
mmzone.c Folio changes for 5.18 2022-03-22 17:03:12 -07:00
mprotect.c memory tiering: hot page selection with hint page fault latency 2022-09-11 20:25:54 -07:00
mremap.c Yang Shi has improved the behaviour of khugepaged collapsing of readonly 2022-05-26 12:32:41 -07:00
msync.c
nommu.c mm: nommu: pass a pointer to virt_to_page() 2022-07-17 17:14:37 -07:00
oom_kill.c mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery 2022-06-01 15:57:16 -07:00
page_alloc.c mm/page_alloc: only search higher order when fallback 2022-09-11 20:25:51 -07:00
page_counter.c mm/page_counter: remove an incorrect call to propagate_protected_usage() 2022-01-15 16:30:27 +02:00
page_ext.c mm/page_ext: remove unused variable in offline_page_ext 2022-09-11 20:25:47 -07:00
page_idle.c mm: don't be stuck to rmap lock on reclaim path 2022-05-19 14:08:54 -07:00
page_io.c mm/swap: remove the end_write_func argument to __swap_writepage 2022-09-11 20:25:50 -07:00
page_isolation.c mm/page_isolation.c: fix one kernel-doc comment 2022-06-16 19:11:30 -07:00
page_owner.c Yang Shi has improved the behaviour of khugepaged collapsing of readonly 2022-05-26 12:32:41 -07:00
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c Six hotfixes. One from Miaohe Lin is considered a minor thing so it isn't 2022-05-27 11:29:35 -07:00
page_vma_mapped.c mm/page_vma_mapped.c: use helper function huge_pte_lock 2022-07-17 17:14:47 -07:00
page-writeback.c writeback: avoid use-after-free after removing device 2022-08-28 14:02:43 -07:00
pagewalk.c
percpu-internal.h percpu: improve percpu_alloc_percpu event trace 2022-05-13 07:20:18 -07:00
percpu-km.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu-stats.c mm: use vmalloc_array and vcalloc for array allocations 2022-03-08 09:30:46 -05:00
percpu-vm.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu.c mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free() 2022-07-17 17:14:47 -07:00
pgalloc-track.h
pgtable-generic.c mm: avoid unnecessary flush on change_huge_pmd() 2022-05-13 07:20:05 -07:00
process_vm_access.c
ptdump.c mm: sparsemem: use page table lock to protect kernel pmd operations 2022-03-22 15:57:08 -07:00
readahead.c filemap: Fix serialization adding transparent huge pages to page cache 2022-06-23 12:22:00 -04:00
rmap.c mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage 2022-09-11 20:25:46 -07:00
rodata_test.c
secretmem.c Folio changes for 6.0 2022-08-03 10:35:43 -07:00
shmem.c shmem: update folio if shmem_replace_page() updates the page 2022-08-28 14:02:43 -07:00
shrinker_debug.c mm: shrinkers: fix double kfree on shrinker name 2022-07-29 18:07:13 -07:00
shuffle.c
shuffle.h
slab_common.c mm/slab_common: move generic bulk alloc/free functions to SLOB 2022-07-20 13:30:12 +02:00
slab.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
slab.h mm/slab_common: move generic bulk alloc/free functions to SLOB 2022-07-20 13:30:12 +02:00
slob.c mm/slab_common: move generic bulk alloc/free functions to SLOB 2022-07-20 13:30:12 +02:00
slub.c kfence: add sysfs interface to disable kfence for selected slabs. 2022-09-11 20:25:52 -07:00
sparse-vmemmap.c mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c 2022-08-08 18:06:42 -07:00
sparse.c mm: memory_hotplug: enumerate all supported section flags 2022-07-03 18:08:49 -07:00
swap_cgroup.c mm: use vmalloc_array and vcalloc for array allocations 2022-03-08 09:30:46 -05:00
swap_slots.c arm64: enable THP_SWAP for arm64 2022-07-20 10:52:40 +01:00
swap_state.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
swap.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
swap.h mm/swap: remove the end_write_func argument to __swap_writepage 2022-09-11 20:25:50 -07:00
swapfile.c mm/swap: convert delete_from_swap_cache() to take a folio 2022-07-03 18:08:48 -07:00
truncate.c mm: Remove __delete_from_page_cache() 2022-06-29 08:51:05 -04:00
usercopy.c usercopy: use unsigned long instead of uintptr_t 2022-07-01 17:03:38 -07:00
userfaultfd.c mm/uffd: reset write protection when unregister with wp-mode 2022-08-20 15:17:45 -07:00
util.c mm/util.c: add warning if __vm_enough_memory fails 2022-09-11 20:25:54 -07:00
vmacache.c
vmalloc.c mm/vmalloc: extend __find_vmap_area() with one more argument 2022-07-03 18:08:41 -07:00
vmpressure.c mm/vmpressure: fix data-race with memcg->socket_pressure 2021-11-06 13:30:40 -07:00
vmscan.c mm/vmscan: make the annotations of refaults code at the right place 2022-09-11 20:25:51 -07:00
vmstat.c mm: add DEVICE_ZONE to FOR_ALL_ZONES 2022-08-20 15:17:45 -07:00
workingset.c mm: shrinkers: provide shrinkers with names 2022-07-03 18:08:40 -07:00
z3fold.c mm: Convert all PageMovable users to movable_operations 2022-08-02 12:34:03 -04:00
zbud.c mm/zbud: add kerneldoc fields for zbud_pool 2021-07-01 11:06:03 -07:00
zpool.c zpool: remove the list of pools_head 2022-01-15 16:30:31 +02:00
zsmalloc.c zsmalloc: remove unnecessary size_class NULL check 2022-09-11 20:25:50 -07:00
zswap.c mm/swap: remove the end_write_func argument to __swap_writepage 2022-09-11 20:25:50 -07:00