linux/mm
Zach O'Keefe 7d8faaf155 mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
This idea was introduced by David Rientjes[1].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
a synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* Avoid unpredictable timing of khugepaged collapse

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage may enter direct reclaim and/or
compaction, regardless of VMA flags.  When the system has multiple NUMA
nodes, the hugepage will be allocated from the node providing the most
native pages.  This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how
pages will be mapped, constructed, or faulted in the future

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, process_madvise(2)
returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
is returned and errno is set to indicate the error for the most-recently
attempted hugepage collapse.  Note that many failures might have occurred,
since the operation may continue to collapse in the event a single
hugepage-sized/aligned region fails.

	ENOMEM	Memory allocation failed or VMA not found
	EBUSY	Memcg charging failed
	EAGAIN	Required resource temporarily unavailable.  Try again
		might succeed.
	EINVAL	Other error: No PMD found, subpage doesn't have Present
		bit set, "Special" page no backed by struct page, VMA
		incorrectly sized, address not page-aligned, ...

Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
to provide the caller with actionable feedback so they may take an
appropriate fallback measure.

Use Cases

An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd.  Later, when the memory is hot, the implementation could
madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
coverage and dTLB performance.  TCMalloc is such an implementation that
could benefit from this[2].

Only privately-mapped anon memory is supported for now, but additional
support for file, shmem, and HugeTLB high-granularity mappings[2] is
expected.  File and tmpfs/shmem support would permit:

* Backing executable text by THPs.  Current support provided by
  CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
  might impair services from serving at their full rated load after
  (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
  immediately realize iTLB performance prevents page sharing and demand
  paging, both of which increase steady state memory footprint.  With
  MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
  and lower RAM footprints.
* Backing guest memory by hugapages after the memory contents have been
  migrated in native-page-sized chunks to a new host, in a
  userfaultfd-based live-migration stack.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc

[jrdr.linux@gmail.com: avoid possible memory leak in failure path]
  Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com
[zokeefe@google.com add missing kfree() to madvise_collapse()]
  Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/
  Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com
[zokeefe@google.com: delay computation of hpage boundaries until use]]
  Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
Suggested-by: David Rientjes <rientjes@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:46 -07:00
..
damon mm/damon/dbgfs: avoid duplicate context directory creation 2022-08-28 14:02:45 -07:00
kasan - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
kfence - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
backing-dev.c writeback: avoid use-after-free after removing device 2022-08-28 14:02:43 -07:00
balloon_compaction.c mm: Convert all PageMovable users to movable_operations 2022-08-02 12:34:03 -04:00
bootmem_info.c bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem 2022-08-28 14:02:45 -07:00
cma_debug.c mm/cma_debug.c: align the name buffer length as struct cma 2022-07-29 18:07:16 -07:00
cma_sysfs.c
cma.c Revert "mm/cma.c: remove redundant cma_mutex lock" 2022-05-13 15:11:26 -07:00
cma.h mm/cma: provide option to opt out from exposing pages on activation failure 2022-03-22 15:57:09 -07:00
compaction.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
debug_page_ref.c
debug_vm_pgtable.c docs: rename Documentation/vm to Documentation/mm 2022-06-27 12:52:53 -07:00
debug.c mm: unexport page_init_poison 2022-03-24 19:06:45 -07:00
dmapool.c mm/dmapool.c: revert "make dma pool to use kmalloc_node" 2022-01-15 16:30:28 +02:00
early_ioremap.c mm/early_ioremap: declare early_memremap_pgprot_adjust() 2022-03-22 15:57:11 -07:00
fadvise.c riscv: compat: syscall: Add compat_sys_call_table implementation 2022-04-26 13:36:25 -07:00
failslab.c mm: fix missing handler for __GFP_NOWARN 2022-05-19 14:08:55 -07:00
filemap.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
folio-compat.c mm/folio-compat: Remove migration compatibility functions 2022-08-02 12:34:04 -04:00
frontswap.c docs: rename Documentation/vm to Documentation/mm 2022-06-27 12:52:53 -07:00
gup_test.c mm: rename is_pinnable_page() to is_longterm_pinnable_page() 2022-07-17 17:14:27 -07:00
gup_test.h
gup.c mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW 2022-08-20 15:17:44 -07:00
highmem.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
hmm.c mm/hmm: fault non-owner device private entries 2022-07-29 11:33:37 -07:00
huge_memory.c mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage 2022-09-11 20:25:46 -07:00
hugetlb_cgroup.c hugetlb_cgroup: fix wrong hugetlb cgroup numa stat 2022-07-29 18:07:17 -07:00
hugetlb_vmemmap.c mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE 2022-08-08 18:06:43 -07:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability 2022-08-08 18:06:43 -07:00
hugetlb.c mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte 2022-08-28 14:02:43 -07:00
hwpoison-inject.c mm/memory-failure: disable unpoison once hw error happens 2022-06-16 19:11:32 -07:00
init-mm.c kernel/fork: Initialize mm's PASID 2022-02-14 19:51:47 +01:00
internal.h mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage 2022-09-11 20:25:46 -07:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: Add ioremap/iounmap_allowed() 2022-06-27 12:22:31 +01:00
Kconfig cxl for 6.0 2022-08-10 11:07:26 -07:00
Kconfig.debug Two followon fixes for the post-5.19 series "Use pageblock_order for cma 2022-05-27 11:40:49 -07:00
khugepaged.c mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse 2022-09-11 20:25:46 -07:00
kmemleak.c mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan() 2022-06-16 19:48:32 -07:00
ksm.c mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage 2022-09-11 20:25:46 -07:00
list_lru.c mm: kmem: make mem_cgroup_from_obj() vmalloc()-safe 2022-06-16 19:48:31 -07:00
maccess.c asm-generic updates for 5.18 2022-03-23 18:03:08 -07:00
madvise.c mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse 2022-09-11 20:25:46 -07:00
Makefile mm: shrinkers: introduce debugfs interface for memory shrinkers 2022-07-03 18:08:40 -07:00
mapping_dirty_helpers.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
memblock.c memblock updates for v5.20 2022-08-09 09:48:30 -07:00
memcontrol.c mm: memcontrol: fix potential oom_lock recursion deadlock 2022-07-29 18:07:18 -07:00
memfd.c memfd: fix F_SEAL_WRITE after shmem huge page allocated 2022-03-05 11:08:32 -08:00
memory_hotplug.c mm: use is_zone_movable_page() helper 2022-07-29 18:07:20 -07:00
memory-failure.c mm, hwpoison: enable memory error handling on 1GB hugepage 2022-08-08 18:06:44 -07:00
memory.c mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() 2022-09-11 20:25:45 -07:00
mempolicy.c mm/mempolicy: remove unneeded out label 2022-07-29 18:07:16 -07:00
mempool.c mm/mempool: use might_alloc() 2022-06-16 19:48:30 -07:00
memremap.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
memtest.c
migrate_device.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
migrate.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
mincore.c mm: teach core mm about pte markers 2022-05-13 07:20:09 -07:00
mlock.c mm: handling Non-LRU pages returned by vm_normal_pages 2022-07-17 17:14:28 -07:00
mm_init.c
mmap_lock.c mm: mmap_lock: fix disabling preemption directly 2021-07-23 17:43:28 -07:00
mmap.c mm/hugetlb: fix hugetlb not supporting softdirty tracking 2022-08-20 15:17:45 -07:00
mmu_gather.c mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush 2022-04-28 23:16:12 -07:00
mmu_notifier.c mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() 2022-04-21 20:01:10 -07:00
mmzone.c Folio changes for 5.18 2022-03-22 17:03:12 -07:00
mprotect.c mm/mprotect: only reference swap pfn page if type match 2022-08-28 14:02:46 -07:00
mremap.c Yang Shi has improved the behaviour of khugepaged collapsing of readonly 2022-05-26 12:32:41 -07:00
msync.c
nommu.c mm: nommu: pass a pointer to virt_to_page() 2022-07-17 17:14:37 -07:00
oom_kill.c mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery 2022-06-01 15:57:16 -07:00
page_alloc.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
page_counter.c mm/page_counter: remove an incorrect call to propagate_protected_usage() 2022-01-15 16:30:27 +02:00
page_ext.c mm: use for_each_online_node and node_online instead of open coding 2022-04-29 14:36:58 -07:00
page_idle.c mm: don't be stuck to rmap lock on reclaim path 2022-05-19 14:08:54 -07:00
page_io.c Yang Shi has improved the behaviour of khugepaged collapsing of readonly 2022-05-26 12:32:41 -07:00
page_isolation.c mm/page_isolation.c: fix one kernel-doc comment 2022-06-16 19:11:30 -07:00
page_owner.c Yang Shi has improved the behaviour of khugepaged collapsing of readonly 2022-05-26 12:32:41 -07:00
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c Six hotfixes. One from Miaohe Lin is considered a minor thing so it isn't 2022-05-27 11:29:35 -07:00
page_vma_mapped.c mm/page_vma_mapped.c: use helper function huge_pte_lock 2022-07-17 17:14:47 -07:00
page-writeback.c writeback: avoid use-after-free after removing device 2022-08-28 14:02:43 -07:00
pagewalk.c
percpu-internal.h percpu: improve percpu_alloc_percpu event trace 2022-05-13 07:20:18 -07:00
percpu-km.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu-stats.c mm: use vmalloc_array and vcalloc for array allocations 2022-03-08 09:30:46 -05:00
percpu-vm.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu.c mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free() 2022-07-17 17:14:47 -07:00
pgalloc-track.h
pgtable-generic.c mm: avoid unnecessary flush on change_huge_pmd() 2022-05-13 07:20:05 -07:00
process_vm_access.c
ptdump.c mm: sparsemem: use page table lock to protect kernel pmd operations 2022-03-22 15:57:08 -07:00
readahead.c filemap: Fix serialization adding transparent huge pages to page cache 2022-06-23 12:22:00 -04:00
rmap.c mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage 2022-09-11 20:25:46 -07:00
rodata_test.c
secretmem.c Folio changes for 6.0 2022-08-03 10:35:43 -07:00
shmem.c shmem: update folio if shmem_replace_page() updates the page 2022-08-28 14:02:43 -07:00
shrinker_debug.c mm: shrinkers: fix double kfree on shrinker name 2022-07-29 18:07:13 -07:00
shuffle.c
shuffle.h
slab_common.c mm/slab_common: move generic bulk alloc/free functions to SLOB 2022-07-20 13:30:12 +02:00
slab.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
slab.h mm/slab_common: move generic bulk alloc/free functions to SLOB 2022-07-20 13:30:12 +02:00
slob.c mm/slab_common: move generic bulk alloc/free functions to SLOB 2022-07-20 13:30:12 +02:00
slub.c mm/sl[au]b: use own bulk free function when bulk alloc failed 2022-07-20 13:30:11 +02:00
sparse-vmemmap.c mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c 2022-08-08 18:06:42 -07:00
sparse.c mm: memory_hotplug: enumerate all supported section flags 2022-07-03 18:08:49 -07:00
swap_cgroup.c mm: use vmalloc_array and vcalloc for array allocations 2022-03-08 09:30:46 -05:00
swap_slots.c arm64: enable THP_SWAP for arm64 2022-07-20 10:52:40 +01:00
swap_state.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
swap.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
swap.h mm/khugepaged: try to free transhuge swapcache when possible 2022-07-03 18:08:52 -07:00
swapfile.c mm/swap: convert delete_from_swap_cache() to take a folio 2022-07-03 18:08:48 -07:00
truncate.c mm: Remove __delete_from_page_cache() 2022-06-29 08:51:05 -04:00
usercopy.c usercopy: use unsigned long instead of uintptr_t 2022-07-01 17:03:38 -07:00
userfaultfd.c mm/uffd: reset write protection when unregister with wp-mode 2022-08-20 15:17:45 -07:00
util.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
vmacache.c
vmalloc.c mm/vmalloc: extend __find_vmap_area() with one more argument 2022-07-03 18:08:41 -07:00
vmpressure.c mm/vmpressure: fix data-race with memcg->socket_pressure 2021-11-06 13:30:40 -07:00
vmscan.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
vmstat.c mm: add DEVICE_ZONE to FOR_ALL_ZONES 2022-08-20 15:17:45 -07:00
workingset.c mm: shrinkers: provide shrinkers with names 2022-07-03 18:08:40 -07:00
z3fold.c mm: Convert all PageMovable users to movable_operations 2022-08-02 12:34:03 -04:00
zbud.c mm/zbud: add kerneldoc fields for zbud_pool 2021-07-01 11:06:03 -07:00
zpool.c zpool: remove the list of pools_head 2022-01-15 16:30:31 +02:00
zsmalloc.c mm/zsmalloc: do not attempt to free IS_ERR handle 2022-08-28 14:02:44 -07:00
zswap.c zswap: memcg accounting 2022-05-19 14:08:53 -07:00