linux/mm
Hugh Dickins 1d65b771bc mm/khugepaged: retract_page_tables() without mmap or vma lock
Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result
code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail: so
retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(), it
does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be freed
immediately, but rather by the recently added pte_free_defer().

Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
when PAE, and pmdp_collapse_flush() did not already do so: to make sure
that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
cannot pick up a pmd entry with mismatched pmd_low and pmd_high.

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that enhancement does
raise some more questions: leave it until a later release.

Link: https://lkml.kernel.org/r/f88970d9-d347-9762-ae6d-da978e8a4df@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:24 -07:00
..
damon mm/damon/core-test: initialise context before test in damon_test_set_attrs() 2023-07-27 13:07:03 -07:00
kasan kasan, slub: fix HW_TAGS zeroing with slub_debug 2023-07-08 09:29:32 -07:00
kfence mm/slab: introduce kmem_cache flag SLAB_NO_MERGE 2023-06-02 10:24:33 +02:00
kmsan kasan,kmsan: remove __GFP_KSWAPD_RECLAIM usage from kasan/kmsan 2023-06-23 16:59:26 -07:00
backing-dev.c mm: backing-dev: make bdi_class a static const structure 2023-06-23 16:59:27 -07:00
balloon_compaction.c
bootmem_info.c
cma_debug.c
cma_sysfs.c mm: cma: make kobj_type structure constant 2023-03-28 16:20:06 -07:00
cma.c mm: cma: print cma name as well in cma_alloc debug 2023-08-18 10:12:12 -07:00
cma.h
compaction.c mm: compaction: skip the memory hole rapidly when isolating free pages 2023-08-18 10:12:14 -07:00
debug_page_alloc.c mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable,page_table_check: warn pte map fails 2023-06-19 16:19:15 -07:00
debug.c mm: update validate_mm() to use vma iterator 2023-06-09 16:25:31 -07:00
dmapool_test.c dmapool: add alloc/free performance test 2023-04-05 19:42:38 -07:00
dmapool.c dmapool: create/destroy cleanup 2023-06-09 16:25:17 -07:00
early_ioremap.c mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup() 2023-06-09 16:25:56 -07:00
fadvise.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
fail_page_alloc.c mm: page_alloc: split out FAIL_PAGE_ALLOC 2023-06-09 16:25:23 -07:00
failslab.c
filemap.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2023-08-18 10:12:12 -07:00
folio-compat.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
frontswap.c mm: zswap: support exclusive loads 2023-06-19 16:19:05 -07:00
gup_test.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
gup_test.h
gup.c mm/gup: retire follow_hugetlb_page() 2023-08-18 10:12:04 -07:00
highmem.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hmm.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
huge_memory.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2023-08-18 10:12:12 -07:00
hugetlb_cgroup.c mm/hugetlb: increase use of folios in alloc_huge_page() 2023-02-13 15:54:27 -08:00
hugetlb_vmemmap.c mm: hugetlb_vmemmap: fix a race between vmemmap pmd split 2023-08-18 10:12:14 -07:00
hugetlb_vmemmap.h
hugetlb.c mm: userfaultfd: support UFFDIO_POISON for hugetlbfs 2023-08-18 10:12:17 -07:00
hwpoison-inject.c
init-mm.c IOMMU Updates for Linux 6.4 2023-04-30 13:00:38 -07:00
internal.h mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2023-08-18 10:12:13 -07:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig mm: make MEMFD_CREATE into a selectable config option 2023-08-18 10:12:01 -07:00
Kconfig.debug mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM 2023-05-29 16:14:28 +01:00
khugepaged.c mm/khugepaged: retract_page_tables() without mmap or vma lock 2023-08-18 10:12:24 -07:00
kmemleak.c lib/stackdepot, mm: rename stack_depot_want_early_init 2023-02-16 20:43:49 -08:00
ksm.c ksm: consider KSM-placed zeropages when calculating KSM profit 2023-08-18 10:12:10 -07:00
list_lru.c
maccess.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
madvise.c mm: make PTE_MARKER_SWAPIN_ERROR more general 2023-08-18 10:12:16 -07:00
Makefile mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
mapping_dirty_helpers.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
memblock.c Revert "mm,memblock: reset memblock.reserved to system init state to prevent UAF" 2023-07-28 09:47:06 -07:00
memcontrol.c mm/memcg: minor cleanup for MEM_CGROUP_ID_MAX 2023-08-18 10:12:15 -07:00
memfd.c mm/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2023-08-18 10:12:11 -07:00
memory_hotplug.c mm/memory_hotplug: document the signal_pending() check in offline_pages() 2023-08-18 10:12:19 -07:00
memory-failure.c mm: memory-failure: fix race window when trying to get hugetlb folio 2023-08-18 10:12:20 -07:00
memory-tiers.c memory tier: rename destroy_memory_type() to put_memory_type() 2023-08-18 10:12:11 -07:00
memory.c mm/memory: pass folio into do_page_mkwrite() 2023-08-18 10:12:21 -07:00
mempolicy.c mm/mempolicy: Take VMA lock before replacing policy 2023-07-28 09:44:06 -07:00
mempool.c mempool: do not use ksize() for poisoning 2022-11-30 15:58:41 -08:00
memremap.c mm/memremap.c: fix outdated comment in devm_memremap_pages 2023-02-09 16:51:46 -08:00
memtest.c mm/memtest: add results of early memtest to /proc/meminfo 2023-04-05 19:42:55 -07:00
migrate_device.c mm/migrate_device: try to handle swapcache pages 2023-08-18 10:12:09 -07:00
migrate.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2023-08-18 10:12:12 -07:00
mincore.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
mlock.c mm/mlock: fix vma iterator conversion of apply_vma_lock_flags() 2023-07-17 12:53:21 -07:00
mm_init.c mm/mm_init.c: mark check_for_memory() as __init 2023-08-18 10:12:18 -07:00
mm_slot.h
mmap_lock.c
mmap.c mm: lock VMA in dup_anon_vma() before setting ->anon_vma 2023-07-27 13:07:04 -07:00
mmu_gather.c mm: prefer xxx_page() alloc/free functions for order-0 pages 2023-03-28 16:20:16 -07:00
mmu_notifier.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
mmzone.c
mprotect.c mm: make PTE_MARKER_SWAPIN_ERROR more general 2023-08-18 10:12:16 -07:00
mremap.c mm: Update do_vmi_align_munmap() return semantics 2023-07-01 08:10:56 -07:00
msync.c
nommu.c mm: make show_free_areas() static 2023-08-18 10:12:02 -07:00
oom_kill.c mm, oom: do not check 0 mask in out_of_memory() 2023-06-09 16:25:20 -07:00
page_alloc.c mm: page_alloc: avoid false page outside zone error info 2023-08-18 10:12:10 -07:00
page_counter.c
page_ext.c mm/page_ext: init page_ext early if there are no deferred struct pages 2023-02-02 22:33:22 -08:00
page_idle.c mm: page_idle: convert page idle to use a folio 2023-01-18 17:12:52 -08:00
page_io.c swap: use __bio_add_page to add page to bio 2023-05-31 09:50:02 -06:00
page_isolation.c mm: page_isolation: write proper kerneldoc 2023-06-19 16:18:59 -07:00
page_owner.c mm/page_owner/cma: show pfn in cma/page_owner with hex format 2023-06-19 16:19:32 -07:00
page_poison.c
page_reporting.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_reporting.h
page_table_check.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
page_vma_mapped.c mm: correct stale comment of function check_pte 2023-08-18 10:12:13 -07:00
page-writeback.c writeback: account the number of pages written back 2023-07-08 09:29:30 -07:00
pagewalk.c mm/pagewalk: fix EFI_PGT_DUMP of espfix area 2023-07-27 13:07:04 -07:00
percpu-internal.h percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing 2023-06-19 16:19:29 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm: memcontrol: rename memcg_kmem_enabled() 2023-02-16 20:43:56 -08:00
pgalloc-track.h
pgtable-generic.c mm/pgtable: add pte_free_defer() for pgtable as page 2023-08-18 10:12:24 -07:00
process_vm_access.c mm/gup: remove unused vmas parameter from pin_user_pages_remote() 2023-06-09 16:25:25 -07:00
ptdump.c mm: ptdump should use ptep_get_lockless() 2023-06-19 16:19:24 -07:00
readahead.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
rmap.c rmap: pass the folio to __page_check_anon_rmap() 2023-08-18 10:12:12 -07:00
rodata_test.c
secretmem.c mm/mlock: rename mlock_future_check() to mlock_future_ok() 2023-06-09 16:25:38 -07:00
shmem.c mm: make PTE_MARKER_SWAPIN_ERROR more general 2023-08-18 10:12:16 -07:00
show_mem.c mm: make show_free_areas() static 2023-08-18 10:12:02 -07:00
shrinker_debug.c Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless" 2023-06-19 13:19:34 -07:00
shuffle.c
shuffle.h mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
slab_common.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
slab.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
slab.h kasan, slub: fix HW_TAGS zeroing with slub_debug 2023-07-08 09:29:32 -07:00
slub.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
sparse-vmemmap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
sparse.c mm/sparse: remove redundant judgments from macro for_each_present_section_nr 2023-08-18 10:12:14 -07:00
swap_cgroup.c
swap_slots.c
swap_state.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
swap.c mm: remove references to pagevec 2023-06-23 16:59:30 -07:00
swap.h mm: remove the __swap_writepage return value 2023-02-02 22:33:33 -08:00
swapfile.c mm: make PTE_MARKER_SWAPIN_ERROR more general 2023-08-18 10:12:16 -07:00
truncate.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2023-08-18 10:12:12 -07:00
usercopy.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
userfaultfd.c mm: userfaultfd: support UFFDIO_POISON for hugetlbfs 2023-08-18 10:12:17 -07:00
util.c mm: remove page_rmapping() 2023-08-18 10:12:01 -07:00
vmalloc.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
vmpressure.c
vmscan.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2023-08-18 10:12:12 -07:00
vmstat.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
workingset.c Multi-gen LRU: fix workingset accounting 2023-06-09 16:25:46 -07:00
z3fold.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zbud.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zpool.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zsmalloc.c zsmalloc: remove obj_tagged() 2023-08-18 10:12:18 -07:00
zswap.c mm: zswap: fix double invalidate with exclusive loads 2023-06-23 16:59:31 -07:00