linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-11 12:28:41 +08:00

History

Yang Shi c6a7f445a2 mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Patch series "mm: userspace hugepage collapse", v7. Introduction -------------------------------- This series provides a mechanism for userspace to induce a collapse of eligible ranges of memory into transparent hugepages in process context, thus permitting users to more tightly control their own hugepage utilization policy at their own expense. This idea was introduced by David Rientjes[5]. Interface -------------------------------- The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and leverages the new process_madvise(2) call. process_madvise(2) Performs a synchronous collapse of the native pages mapped by the list of iovecs into transparent hugepages. This operation is independent of the system THP sysfs settings, but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail. THP allocation may enter direct reclaim and/or compaction. When a range spans multiple VMAs, the semantics of the collapse over of each VMA is independent from the others. Caller must have CAP_SYS_ADMIN if not acting on self. Return value follows existing process_madvise(2) conventions. A “success” indicates that all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already pmd-mapped THPs. madvise(2) Equivalent to process_madvise(2) on self, with 0 returned on “success”. Current Use-Cases -------------------------------- (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. With MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance and lower RAM footprints. Note that subsequent support for file-backed memory is required here. (2) malloc() implementations that manage memory in hugepage-sized chunks, but sometimes subrelease memory back to the system in native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage coverage and dTLB performance. TCMalloc is such an implementation that could benefit from this[6]. A prior study of Google internal workloads during evaluation of Temeraire, a hugepage-aware enhancement to TCMalloc, showed that nearly 20% of all cpu cycles were spent in dTLB stalls, and that increasing hugepage coverage by even small amount can help with that[7]. (3) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. Note that subsequent support for file/shmem-backed memory is required here. (4) HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to be mapped at different levels in the page tables[8]. As it's not "transparent" like THP, HugeTLB high-granularity mappings require an explicit user API. It is intended that MADV_COLLAPSE be co-opted for this use case[9]. Note that subsequent support for HugeTLB memory is required here. Future work -------------------------------- Only private anonymous memory is supported by this series. File and shmem memory support will be added later. One possible user of this functionality is a userspace agent that attempts to optimize THP utilization system-wide by allocating THPs based on, for example, task priority, task performance requirements, or heatmaps. For the latter, one idea that has already surfaced is using DAMON to identify hot regions, and driving THP collapse through a new DAMOS_COLLAPSE scheme[10]. This patch (of 17): The khugepaged has optimization to reduce huge page allocation calls for !CONFIG_NUMA by carrying the allocated but failed to collapse huge page to the next loop. CONFIG_NUMA doesn't do so since the next loop may try to collapse huge page from a different node, so it doesn't make too much sense to carry it. But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page() before scanning the address space, so it means huge page may be allocated even though there is no suitable range for collapsing. Then the page would be just freed if khugepaged already made enough progress. This could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y run. This problem actually makes things worse due to the way more pointless THP allocations and makes the optimization pointless. This could be fixed by carrying the huge page across scans, but it will complicate the code further and the huge page may be carried indefinitely. But if we take one step back, the optimization itself seems not worth keeping nowadays since: * Not too many users build NUMA=n kernel nowadays even though the kernel is actually running on a non-NUMA machine. Some small devices may run NUMA=n kernel, but I don't think they actually use THP. * Since commit `44042b4498` ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists"), THP could be cached by pcp. This actually somehow does the job done by the optimization. Link: https://lkml.kernel.org/r/20220706235936.2197195-1-zokeefe@google.com Link: https://lkml.kernel.org/r/20220706235936.2197195-3-zokeefe@google.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Zach O'Keefe <zokeefe@google.com> Co-developed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2022-09-11 20:25:44 -07:00
..
damon	mm/damon/dbgfs: avoid duplicate context directory creation	2022-08-28 14:02:45 -07:00
kasan	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
kfence	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
backing-dev.c	writeback: avoid use-after-free after removing device	2022-08-28 14:02:43 -07:00
balloon_compaction.c	mm: Convert all PageMovable users to movable_operations	2022-08-02 12:34:03 -04:00
bootmem_info.c	bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem	2022-08-28 14:02:45 -07:00
cma_debug.c	mm/cma_debug.c: align the name buffer length as struct cma	2022-07-29 18:07:16 -07:00
cma_sysfs.c
cma.c	Revert "mm/cma.c: remove redundant cma_mutex lock"	2022-05-13 15:11:26 -07:00
cma.h	mm/cma: provide option to opt out from exposing pages on activation failure	2022-03-22 15:57:09 -07:00
compaction.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
debug_page_ref.c
debug_vm_pgtable.c	docs: rename Documentation/vm to Documentation/mm	2022-06-27 12:52:53 -07:00
debug.c	mm: unexport page_init_poison	2022-03-24 19:06:45 -07:00
dmapool.c	mm/dmapool.c: revert "make dma pool to use kmalloc_node"	2022-01-15 16:30:28 +02:00
early_ioremap.c	mm/early_ioremap: declare early_memremap_pgprot_adjust()	2022-03-22 15:57:11 -07:00
fadvise.c	riscv: compat: syscall: Add compat_sys_call_table implementation	2022-04-26 13:36:25 -07:00
failslab.c	mm: fix missing handler for __GFP_NOWARN	2022-05-19 14:08:55 -07:00
filemap.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
folio-compat.c	mm/folio-compat: Remove migration compatibility functions	2022-08-02 12:34:04 -04:00
frontswap.c	docs: rename Documentation/vm to Documentation/mm	2022-06-27 12:52:53 -07:00
gup_test.c	mm: rename is_pinnable_page() to is_longterm_pinnable_page()	2022-07-17 17:14:27 -07:00
gup_test.h
gup.c	mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW	2022-08-20 15:17:44 -07:00
highmem.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
hmm.c	mm/hmm: fault non-owner device private entries	2022-07-29 11:33:37 -07:00
huge_memory.c	mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW	2022-08-20 15:17:44 -07:00
hugetlb_cgroup.c	hugetlb_cgroup: fix wrong hugetlb cgroup numa stat	2022-07-29 18:07:17 -07:00
hugetlb_vmemmap.c	mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE	2022-08-08 18:06:43 -07:00
hugetlb_vmemmap.h	mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability	2022-08-08 18:06:43 -07:00
hugetlb.c	mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte	2022-08-28 14:02:43 -07:00
hwpoison-inject.c	mm/memory-failure: disable unpoison once hw error happens	2022-06-16 19:11:32 -07:00
init-mm.c	kernel/fork: Initialize mm's PASID	2022-02-14 19:51:47 +01:00
internal.h	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
interval_tree.c
io-mapping.c
ioremap.c	mm: ioremap: Add ioremap/iounmap_allowed()	2022-06-27 12:22:31 +01:00
Kconfig	cxl for 6.0	2022-08-10 11:07:26 -07:00
Kconfig.debug	Two followon fixes for the post-5.19 series "Use pageblock_order for cma	2022-05-27 11:40:49 -07:00
khugepaged.c	mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA	2022-09-11 20:25:44 -07:00
kmemleak.c	mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan()	2022-06-16 19:48:32 -07:00
ksm.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
list_lru.c	mm: kmem: make mem_cgroup_from_obj() vmalloc()-safe	2022-06-16 19:48:31 -07:00
maccess.c	asm-generic updates for 5.18	2022-03-23 18:03:08 -07:00
madvise.c	mm: handling Non-LRU pages returned by vm_normal_pages	2022-07-17 17:14:28 -07:00
Makefile	mm: shrinkers: introduce debugfs interface for memory shrinkers	2022-07-03 18:08:40 -07:00
mapping_dirty_helpers.c	mm: move tlb_flush_pending inline helpers to mm_inline.h	2022-01-15 16:30:27 +02:00
memblock.c	memblock updates for v5.20	2022-08-09 09:48:30 -07:00
memcontrol.c	mm: memcontrol: fix potential oom_lock recursion deadlock	2022-07-29 18:07:18 -07:00
memfd.c	memfd: fix F_SEAL_WRITE after shmem huge page allocated	2022-03-05 11:08:32 -08:00
memory_hotplug.c	mm: use is_zone_movable_page() helper	2022-07-29 18:07:20 -07:00
memory-failure.c	mm, hwpoison: enable memory error handling on 1GB hugepage	2022-08-08 18:06:44 -07:00
memory.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
mempolicy.c	mm/mempolicy: remove unneeded out label	2022-07-29 18:07:16 -07:00
mempool.c	mm/mempool: use might_alloc()	2022-06-16 19:48:30 -07:00
memremap.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
memtest.c
migrate_device.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
migrate.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
mincore.c	mm: teach core mm about pte markers	2022-05-13 07:20:09 -07:00
mlock.c	mm: handling Non-LRU pages returned by vm_normal_pages	2022-07-17 17:14:28 -07:00
mm_init.c
mmap_lock.c	mm: mmap_lock: fix disabling preemption directly	2021-07-23 17:43:28 -07:00
mmap.c	mm/hugetlb: fix hugetlb not supporting softdirty tracking	2022-08-20 15:17:45 -07:00
mmu_gather.c	mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush	2022-04-28 23:16:12 -07:00
mmu_notifier.c	mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()	2022-04-21 20:01:10 -07:00
mmzone.c	Folio changes for 5.18	2022-03-22 17:03:12 -07:00
mprotect.c	mm/mprotect: only reference swap pfn page if type match	2022-08-28 14:02:46 -07:00
mremap.c	Yang Shi has improved the behaviour of khugepaged collapsing of readonly	2022-05-26 12:32:41 -07:00
msync.c
nommu.c	mm: nommu: pass a pointer to virt_to_page()	2022-07-17 17:14:37 -07:00
oom_kill.c	mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery	2022-06-01 15:57:16 -07:00
page_alloc.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
page_counter.c	mm/page_counter: remove an incorrect call to propagate_protected_usage()	2022-01-15 16:30:27 +02:00
page_ext.c	mm: use for_each_online_node and node_online instead of open coding	2022-04-29 14:36:58 -07:00
page_idle.c	mm: don't be stuck to rmap lock on reclaim path	2022-05-19 14:08:54 -07:00
page_io.c	Yang Shi has improved the behaviour of khugepaged collapsing of readonly	2022-05-26 12:32:41 -07:00
page_isolation.c	mm/page_isolation.c: fix one kernel-doc comment	2022-06-16 19:11:30 -07:00
page_owner.c	Yang Shi has improved the behaviour of khugepaged collapsing of readonly	2022-05-26 12:32:41 -07:00
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c	Six hotfixes. One from Miaohe Lin is considered a minor thing so it isn't	2022-05-27 11:29:35 -07:00
page_vma_mapped.c	mm/page_vma_mapped.c: use helper function huge_pte_lock	2022-07-17 17:14:47 -07:00
page-writeback.c	writeback: avoid use-after-free after removing device	2022-08-28 14:02:43 -07:00
pagewalk.c
percpu-internal.h	percpu: improve percpu_alloc_percpu event trace	2022-05-13 07:20:18 -07:00
percpu-km.c	percpu: flush tlb in pcpu_reclaim_populated()	2021-07-04 18:30:17 +00:00
percpu-stats.c	mm: use vmalloc_array and vcalloc for array allocations	2022-03-08 09:30:46 -05:00
percpu-vm.c	percpu: flush tlb in pcpu_reclaim_populated()	2021-07-04 18:30:17 +00:00
percpu.c	mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free()	2022-07-17 17:14:47 -07:00
pgalloc-track.h
pgtable-generic.c	mm: avoid unnecessary flush on change_huge_pmd()	2022-05-13 07:20:05 -07:00
process_vm_access.c
ptdump.c	mm: sparsemem: use page table lock to protect kernel pmd operations	2022-03-22 15:57:08 -07:00
readahead.c	filemap: Fix serialization adding transparent huge pages to page cache	2022-06-23 12:22:00 -04:00
rmap.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
rodata_test.c
secretmem.c	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
shmem.c	shmem: update folio if shmem_replace_page() updates the page	2022-08-28 14:02:43 -07:00
shrinker_debug.c	mm: shrinkers: fix double kfree on shrinker name	2022-07-29 18:07:13 -07:00
shuffle.c
shuffle.h
slab_common.c	mm/slab_common: move generic bulk alloc/free functions to SLOB	2022-07-20 13:30:12 +02:00
slab.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
slab.h	mm/slab_common: move generic bulk alloc/free functions to SLOB	2022-07-20 13:30:12 +02:00
slob.c	mm/slab_common: move generic bulk alloc/free functions to SLOB	2022-07-20 13:30:12 +02:00
slub.c	mm/sl[au]b: use own bulk free function when bulk alloc failed	2022-07-20 13:30:11 +02:00
sparse-vmemmap.c	mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c	2022-08-08 18:06:42 -07:00
sparse.c	mm: memory_hotplug: enumerate all supported section flags	2022-07-03 18:08:49 -07:00
swap_cgroup.c	mm: use vmalloc_array and vcalloc for array allocations	2022-03-08 09:30:46 -05:00
swap_slots.c	arm64: enable THP_SWAP for arm64	2022-07-20 10:52:40 +01:00
swap_state.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
swap.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
swap.h	mm/khugepaged: try to free transhuge swapcache when possible	2022-07-03 18:08:52 -07:00
swapfile.c	mm/swap: convert delete_from_swap_cache() to take a folio	2022-07-03 18:08:48 -07:00
truncate.c	mm: Remove __delete_from_page_cache()	2022-06-29 08:51:05 -04:00
usercopy.c	usercopy: use unsigned long instead of uintptr_t	2022-07-01 17:03:38 -07:00
userfaultfd.c	mm/uffd: reset write protection when unregister with wp-mode	2022-08-20 15:17:45 -07:00
util.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
vmacache.c
vmalloc.c	mm/vmalloc: extend __find_vmap_area() with one more argument	2022-07-03 18:08:41 -07:00
vmpressure.c	mm/vmpressure: fix data-race with memcg->socket_pressure	2021-11-06 13:30:40 -07:00
vmscan.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
vmstat.c	mm: add DEVICE_ZONE to FOR_ALL_ZONES	2022-08-20 15:17:45 -07:00
workingset.c	mm: shrinkers: provide shrinkers with names	2022-07-03 18:08:40 -07:00
z3fold.c	mm: Convert all PageMovable users to movable_operations	2022-08-02 12:34:03 -04:00
zbud.c	mm/zbud: add kerneldoc fields for zbud_pool	2021-07-01 11:06:03 -07:00
zpool.c	zpool: remove the list of pools_head	2022-01-15 16:30:31 +02:00
zsmalloc.c	mm/zsmalloc: do not attempt to free IS_ERR handle	2022-08-28 14:02:44 -07:00
zswap.c	zswap: memcg accounting	2022-05-19 14:08:53 -07:00