linux/mm
Baolin Wang 43e027e414 mm: memory: extend finish_fault() to support large folio
Patch series "add mTHP support for anonymous shmem", v5.

Anonymous pages have already been supported for multi-size (mTHP)
allocation through commit 19eaf44954, that can allow THP to be
configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable.  Many implement anonymous page sharing
through mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
scenarios, therefore, users expect to apply an unified mTHP strategy for
anonymous pages, also including the anonymous shared pages, in order to
enjoy the benefits of mTHP.  For example, lower latency than PMD-mapped
THP, smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM
architecture to reduce TLB miss etc.

As discussed in the bi-weekly MM meeting[1], the mTHP controls should
control all of shmem, not only anonymous shmem, but support will be added
iteratively.  Therefore, this patch set starts with support for anonymous
shmem.

The primary strategy is similar to supporting anonymous mTHP.  Introduce a
new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'.  By default all sizes will be set to "never" except PMD size,
which is set to "inherit".  This ensures backward compatibility with the
anonymous shmem enabled of the top level, meanwhile also allows
independent control of anonymous shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

From the data above, it is observed that the patchset has a minimal impact
when mTHP is not enabled (some fluctuations observed during testing). 
When enabling 64K mTHP, there is a significant improvement of the page
fault latency.

[1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/


This patch (of 6):

Add large folio mapping establishment support for finish_fault() as a
preparation, to support multi-size THP allocation of anonymous shmem pages
in the following patches.

Keep the same behavior (per-page fault) for non-anon shmem to avoid
inflating the RSS unintentionally, and we can discuss what size of mapping
to build when extending mTHP to control non-anon shmem in the future.

[baolin.wang@linux.alibaba.com: avoid going beyond the PMD pagetable size]
  Link: https://lkml.kernel.org/r/b0e6a8b1-a32c-459e-ae67-fde5d28773e6@linux.alibaba.com
[baolin.wang@linux.alibaba.com: use 'PTRS_PER_PTE' instead of 'PTRS_PER_PTE - 1']
  Link: https://lkml.kernel.org/r/e1f5767a-2c9b-4e37-afe6-1de26fe54e41@linux.alibaba.com
Link: https://lkml.kernel.org/r/cover.1718090413.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/3a190892355989d42f59cf9f2f98b94694b0d24d.1718090413.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:03 -07:00
..
damon mm/damon/core: fix return value from damos_wmark_metric_value 2024-05-11 15:41:36 -07:00
kasan kasan: fix bad call to unpoison_slab_object 2024-06-24 20:52:09 -07:00
kfence mm/kfence: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
kmsan kmsan: introduce test_unpoison_memory() 2024-07-03 19:30:02 -07:00
backing-dev.c writeback: support retrieving per group debug writeback stats of bdi 2024-05-05 17:53:51 -07:00
balloon_compaction.c mm: remove MIGRATE_SYNC_NO_COPY mode 2024-07-03 19:30:00 -07:00
bootmem_info.c bootmem: use kmemleak_free_part_phys in put_page_bootmem 2023-10-25 16:47:13 -07:00
cma_debug.c
cma_sysfs.c mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
cma.c mm/cma: drop incorrect alignment check in cma_init_reserved_mem 2024-04-25 20:56:42 -07:00
cma.h mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
compaction.c mm: handle profiling for fake memory allocations during compaction 2024-06-24 20:52:09 -07:00
debug_page_alloc.c mm: page_alloc: consolidate free page accounting 2024-04-25 20:56:04 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick 2024-06-15 10:43:08 -07:00
debug.c mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio() 2024-05-05 17:53:31 -07:00
dmapool_test.c mm/dmapool: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
dmapool.c mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs 2023-12-05 11:17:58 +01:00
early_ioremap.c
execmem.c mm/execmem, arch: convert remaining overrides of module_alloc to execmem 2024-05-14 00:31:43 -07:00
fadvise.c
fail_page_alloc.c
failslab.c
filemap.c mm: fix xyz_noprof functions calling profiled functions 2024-06-05 19:19:26 -07:00
folio-compat.c mm: remove page_mapping() 2024-07-03 19:29:59 -07:00
gup_test.c
gup_test.h
gup.c mm/gup: fix hugepd handling in hugetlb rework 2024-05-07 10:37:01 -07:00
highmem.c x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs 2023-12-29 12:22:28 -08:00
hmm.c mm/treewide: replace pXd_huge() with pXd_leaf() 2024-04-25 20:55:46 -07:00
huge_memory.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
hugetlb_cgroup.c mm/hugetlb: assert hugetlb_lock in __hugetlb_cgroup_commit_charge 2024-05-05 17:53:41 -07:00
hugetlb_vmemmap.c memory: remove the now superfluous sentinel element from ctl_table array 2024-04-25 20:56:32 -07:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: fix reference to nonexistent file 2023-10-25 16:47:14 -07:00
hugetlb.c mm/hugetlb: drop node_alloc_noretry from alloc_fresh_hugetlb_folio 2024-07-03 19:29:52 -07:00
hwpoison-inject.c mm/hwpoison: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
init-mm.c mm: Deprecate pasid field 2023-12-12 10:11:32 +01:00
internal.h mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally 2024-07-03 19:30:01 -07:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed 2023-08-18 10:12:36 -07:00
Kconfig The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
Kconfig.debug mm/slub: unify all sl[au]b parameters with "slab_$param" 2024-01-22 10:31:08 +01:00
khugepaged.c mm: simplify thp_vma_allowable_order 2024-05-05 17:53:53 -07:00
kmemleak.c mm: lift gfp_kmemleak_mask() to gfp.h 2024-05-19 14:40:44 -07:00
ksm.c mm/ksm: fix ksm_zero_pages accounting 2024-06-05 19:19:26 -07:00
list_lru.c mm/zswap: stop lru list shrinking when encounter warm region 2024-02-22 10:24:54 -08:00
maccess.c
madvise.c mm/madvise: add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON) 2024-07-03 19:29:57 -07:00
Makefile mseal: add mseal syscall 2024-05-23 19:40:26 -07:00
mapping_dirty_helpers.c mm: fix clean_record_shared_mapping_range kernel-doc 2023-08-24 16:20:30 -07:00
memblock.c memblock: use numa_valid_node() helper to check for invalid node ID 2024-06-16 10:17:57 +03:00
memcontrol.c mm: memcontrol: remove page_memcg() 2024-07-03 19:29:59 -07:00
memfd.c mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins() 2024-03-04 17:01:21 -08:00
memory_hotplug.c mm/hugetlb: mm/memory_hotplug: use a folio in scan_movable_pages() 2024-07-03 19:30:02 -07:00
memory-failure.c mm/memory-failure: stop setting the folio error flag 2024-07-03 19:30:03 -07:00
memory-tiers.c memory tier: create CPUless memory tiers after obtaining HMAT info 2024-05-05 17:53:26 -07:00
memory.c mm: memory: extend finish_fault() to support large folio 2024-07-03 19:30:03 -07:00
mempolicy.c mm: mempolicy: use folio_alloc_mpol() in alloc_migration_target_by_mpol() 2024-07-03 19:29:53 -07:00
mempool.c mm: fix xyz_noprof functions calling profiled functions 2024-06-05 19:19:26 -07:00
memremap.c mm: convert put_devmap_managed_page_refs() to put_devmap_managed_folio_refs() 2024-05-05 17:53:49 -07:00
memtest.c memtest: use {READ,WRITE}_ONCE in memory scanning 2024-03-13 12:12:21 -07:00
migrate_device.c mm: migrate_device: unify migrate folio for MIGRATE_SYNC_NO_COPY 2024-07-03 19:30:00 -07:00
migrate.c mm: remove MIGRATE_SYNC_NO_COPY mode 2024-07-03 19:30:00 -07:00
mincore.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
mlock.c mm: add pmd_folio() 2024-04-25 20:56:19 -07:00
mm_init.c mm/mm_init: use node's number of cpus in deferred_page_init_max_threads 2024-07-03 19:29:58 -07:00
mm_slot.h
mmap_lock.c
mmap.c mm: batch unlink_file_vma calls in free_pgd_range 2024-07-03 19:29:58 -07:00
mmu_gather.c mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing 2024-02-22 15:27:17 -08:00
mmu_notifier.c mmu_notifier: remove the .change_pte() callback 2024-04-11 13:18:36 -04:00
mmzone.c zswap: shrink zswap pool based on memory pressure 2023-12-12 10:57:02 -08:00
mprotect.c mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed() 2024-07-03 19:29:56 -07:00
mremap.c mseal: add mseal syscall 2024-05-23 19:40:26 -07:00
mseal.c mseal: add mseal syscall 2024-05-23 19:40:26 -07:00
msync.c
nommu.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
oom_kill.c memory: remove the now superfluous sentinel element from ctl_table array 2024-04-25 20:56:32 -07:00
page_alloc.c mm/page_alloc: Separate THP PCP into movable and non-movable categories 2024-06-24 20:52:11 -07:00
page_counter.c
page_ext.c mm: make page_ext_get() take a const argument 2024-04-25 20:56:14 -07:00
page_idle.c
page_io.c mm/swap: get the swap device offset directly 2024-07-03 19:29:55 -07:00
page_isolation.c mm: page_isolation: prepare for hygienic freelists 2024-04-25 20:56:04 -07:00
page_owner.c mm/page-owner: use gfp_nested_mask() instead of open coded masking 2024-05-19 14:40:44 -07:00
page_poison.c mm/page_poison: replace kmap_atomic() with kmap_local_page() 2023-12-10 16:51:50 -08:00
page_reporting.c mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
page_reporting.h
page_table_check.c mm/page_table_check: fix crash on ZONE_DEVICE 2024-06-15 10:43:04 -07:00
page_vma_mapped.c mm: make page_mapped_in_vma conditional on CONFIG_MEMORY_FAILURE 2024-05-05 17:53:45 -07:00
page-writeback.c writeback: factor out balance_wb_limits to remove repeated code 2024-07-03 19:29:54 -07:00
pagewalk.c mm: pagewalk: assert write mmap lock only for walking the user page tables 2023-12-10 16:51:53 -08:00
percpu-internal.h mm: percpu: add codetag reference into pcpuobj_ext 2024-04-25 20:55:56 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c percpu: clean up all mappings when pcpu_map_pages() fails 2024-04-25 20:55:49 -07:00
percpu.c mm: percpu: enable per-cpu allocation tagging 2024-04-25 20:55:56 -07:00
pgalloc-track.h
pgtable-generic.c mm: fix race between __split_huge_pmd_locked() and GUP-fast 2024-05-07 10:37:00 -07:00
process_vm_access.c mm: fix process_vm_rw page counts 2023-12-10 16:51:39 -08:00
ptdump.c mm: ptdump: add check_wx_pages debugfs attribute 2024-02-22 10:24:47 -08:00
readahead.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
rmap.c mm: rmap: abstract updating per-node and per-memcg stats 2024-07-03 19:30:01 -07:00
rodata_test.c
secretmem.c mm/secretmem: use a folio in secretmem_fault() 2023-08-21 13:38:02 -07:00
shmem_quota.c tmpfs: fix race on handling dquot rbtree 2024-03-26 11:07:23 -07:00
shmem.c mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
show_mem.c lib: add memory allocations report in show_mem() 2024-04-25 20:55:57 -07:00
shrinker_debug.c mm: shrinker: convert shrinker_rwsem to mutex 2023-10-04 10:32:26 -07:00
shrinker.c mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info() 2024-01-05 09:58:32 -08:00
shuffle.c
shuffle.h mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
slab_common.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
slab.h The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
slub.c mm/slab: fix 'variable obj_exts set but not used' warning 2024-06-24 20:52:09 -07:00
sparse-vmemmap.c mm/vmemmap: allow architectures to override how vmemmap optimization works 2023-08-18 10:12:53 -07:00
sparse.c mm: sparse: consistently use _nr 2024-07-03 19:30:02 -07:00
swap_cgroup.c
swap_slots.c mm: swap: update get_swap_pages() to take folio order 2024-04-25 20:56:37 -07:00
swap_state.c mm,swap: simplify VMA based swap readahead window calculation 2024-07-03 19:30:03 -07:00
swap.c mm: add kernel-doc for folio_mark_accessed() 2024-05-05 17:53:50 -07:00
swap.h mm/swap: reduce swap cache search space 2024-07-03 19:29:56 -07:00
swapfile.c mm: remove the implementation of swap_free() and always use swap_free_nr() 2024-07-03 19:30:01 -07:00
truncate.c mm/vmscan: update stale references to shrink_page_list 2024-07-03 19:29:52 -07:00
usercopy.c
userfaultfd.c mm: userfaultfd: use swap() in double_pt_lock() 2024-07-03 19:30:03 -07:00
util.c hardening fixes for v6.10-rc5 2024-06-17 12:00:22 -07:00
vmalloc.c mm/vmalloc: use __this_cpu_try_cmpxchg() in preload_this_cpu_lock() 2024-07-03 19:30:02 -07:00
vmpressure.c eventfd: simplify eventfd_signal() 2023-11-28 14:08:38 +01:00
vmscan.c mm: vmscan: reset sc->priority on retry 2024-07-03 19:29:53 -07:00
vmstat.c iommu: observability of the IOMMU allocations 2024-04-15 14:31:47 +02:00
workingset.c mm: cleanup WORKINGSET_NODES in workingset 2024-05-07 10:36:59 -07:00
z3fold.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zbud.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zpool.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zsmalloc.c mm: remove MIGRATE_SYNC_NO_COPY mode 2024-07-03 19:30:00 -07:00
zswap.c mm: zswap: make same_filled functions folio-friendly 2024-07-03 19:30:00 -07:00