linux/mm
Zach O'Keefe edb5d0cf55 mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups
In commit 34488399fa ("mm/madvise: add file and shmem support to
MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():

	-       if (!pmd_present(pmde))
	-               return SCAN_PMD_NULL;
	+       if (pmd_none(pmde))
	+               return SCAN_PMD_NONE;

This was for-use by MADV_COLLAPSE file/shmem codepaths, where
MADV_COLLAPSE might identify a pte-mapped hugepage, only to have
khugepaged race-in, free the pte table, and clear the pmd.  Such codepaths
include:

A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
   already in the pagecache.
B) In retract_page_tables(), if we fail to grab mmap_lock for the target
   mm/address.

In these cases, collapse_pte_mapped_thp() really does expect a none (not
just !present) pmd, and we want to suitably identify that case separate
from the case where no pmd is found, or it's a bad-pmd (of course, many
things could happen once we drop mmap_lock, and the pmd could plausibly
undergo multiple transitions due to intervening fault, split, etc). 
Regardless, the code is prepared install a huge-pmd only when the existing
pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.

However, the commit introduces a logical hole; namely, that we've allowed
!none- && !huge- && !bad-pmds to be classified as genuine
pte-table-mapping-pmds.  One such example that could leak through are swap
entries.  The pmd values aren't checked again before use in
pte_offset_map_lock(), which is expecting nothing less than a genuine
pte-table-mapping-pmd.

We want to put back the !pmd_present() check (below the pmd_none() check),
but need to be careful to deal with subtleties in pmd transitions and
treatments by various arch.

The issue is that __split_huge_pmd_locked() temporarily clears the present
bit (or otherwise marks the entry as invalid), but pmd_present() and
pmd_trans_huge() still need to return true while the pmd is in this
transitory state.  For example, x86's pmd_present() also checks the
_PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
checks a PMD_PRESENT_INVALID bit.

Covering all 4 cases for x86 (all checks done on the same pmd value):

1) pmd_present() && pmd_trans_huge()
   All we actually know here is that the PSE bit is set. Either:
   a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
      is set.
      => huge-pmd
   b) We are currently racing with __split_huge_page().  The danger here
      is that we proceed as-if we have a huge-pmd, but really we are
      looking at a pte-mapping-pmd.  So, what is the risk of this
      danger?

      The only relevant path is:

	madvise_collapse() -> collapse_pte_mapped_thp()

      Where we might just incorrectly report back "success", when really
      the memory isn't pmd-backed.  This is fine, since split could
      happen immediately after (actually) successful madvise_collapse().
      So, it should be safe to just assume huge-pmd here.

2) pmd_present() && !pmd_trans_huge()
   Either:
   a) PSE not set and either PRESENT or PROTNONE is.
      => pte-table-mapping pmd (or PROT_NONE)
   b) devmap.  This routine can be called immediately after
      unlocking/locking mmap_lock -- or called with no locks held (see
      khugepaged_scan_mm_slot()), so previous VMA checks have since been
      invalidated.

3) !pmd_present() && pmd_trans_huge()
  Not possible.

4) !pmd_present() && !pmd_trans_huge()
  Neither PRESENT nor PROTNONE set
  => not present

I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
powerpc, longarch, x86, mips, s390) and this logic roughly translates
(though devmap treatment is unique to x86 and powerpc, and (3) doesn't
necessarily hold in general -- but that doesn't matter since
!pmd_present() always takes failure path).

Also, add a comment above find_pmd_or_thp_or_none() to help future
travelers reason about the validity of the code; namely, the possible
mutations that might happen out from under us, depending on how mmap_lock
is held (if at all).

Link: https://lkml.kernel.org/r/20230125225358.2576151-1-zokeefe@google.com
Fixes: 34488399fa ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31 16:44:09 -08:00
..
damon mm/damon: use kstrtobool() instead of strtobool() 2022-11-30 15:58:45 -08:00
kasan kasan: mark kasan_kunit_executing as static 2023-01-11 16:14:21 -08:00
kfence hardening updates for v6.2-rc1 2022-12-14 12:20:00 -08:00
kmsan kmsan: export kmsan_handle_urb 2022-12-21 14:31:52 -08:00
backing-dev.c mm: add /sys/class/bdi/<bdi>/min_ratio_fine knob 2022-11-30 15:59:06 -08:00
balloon_compaction.c mm: Convert all PageMovable users to movable_operations 2022-08-02 12:34:03 -04:00
bootmem_info.c bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem 2022-08-28 14:02:45 -07:00
cma_debug.c mm/cma_debug: show complete cma name in debugfs directories 2022-09-11 20:25:50 -07:00
cma_sysfs.c
cma.c Revert "mm/cma.c: remove redundant cma_mutex lock" 2022-05-13 15:11:26 -07:00
cma.h mm/cma: provide option to opt out from exposing pages on activation failure 2022-03-22 15:57:09 -07:00
compaction.c mm, compaction: fix fast_isolate_around() to stay within boundaries 2022-11-30 15:59:07 -08:00
debug_page_ref.c
debug_vm_pgtable.c mm: remove unused savedwrite infrastructure 2022-11-30 15:58:49 -08:00
debug.c mm,thp,rmap: simplify compound page mapcount handling 2022-11-30 15:58:46 -08:00
dmapool.c
early_ioremap.c mm/early_ioremap: declare early_memremap_pgprot_adjust() 2022-03-22 15:57:11 -07:00
fadvise.c mm/fadvise: use LLONG_MAX instead of -1 for eof 2022-12-11 18:12:11 -08:00
failslab.c mm: fix unexpected changes to {failslab|fail_page_alloc}.attr 2022-11-22 18:50:44 -08:00
filemap.c filemap: convert replace_page_cache_page() to replace_page_cache_folio() 2022-12-11 18:12:12 -08:00
folio-compat.c folio-compat: remove lru_cache_add() 2022-12-11 18:12:13 -08:00
frontswap.c frontswap: don't call ->init if no ops are registered 2022-09-26 12:14:34 -07:00
gup_test.c mm/gup_test: free memory allocated via kvcalloc() using kvfree() 2022-12-15 16:37:48 -08:00
gup_test.h mm/gup_test: start/stop/read functionality for PIN LONGTERM test 2022-11-08 17:37:15 -08:00
gup.c New Feature: 2022-12-17 14:06:53 -06:00
highmem.c highmem: fix kmap_to_page() for kmap_local_page() addresses 2022-10-12 18:51:51 -07:00
hmm.c mm: Remove pointless barrier() after pmdp_get_lockless() 2022-12-15 10:37:27 -08:00
huge_memory.c ARM64: 2022-12-15 11:12:21 -08:00
hugetlb_cgroup.c mm/hugeltb_cgroup: convert hugetlb_cgroup_commit_charge*() to folios 2022-11-30 15:58:43 -08:00
hugetlb_vmemmap.c mm/hugetlb_vmemmap: remap head page to newly allocated page 2022-11-30 15:58:47 -08:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability 2022-08-08 18:06:43 -07:00
hugetlb.c mm: fix a few rare cases of using swapin error pte marker 2023-01-18 17:02:19 -08:00
hwpoison-inject.c mm/hwpoison: add __init/__exit annotations to module init/exit funcs 2022-10-03 14:03:05 -07:00
init-mm.c mm: remove rb tree. 2022-09-26 19:46:16 -07:00
internal.h mm/hwpoison: introduce per-memory_block hwpoison counter 2022-11-08 17:37:22 -08:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: Add ioremap/iounmap_allowed() 2022-06-27 12:22:31 +01:00
Kconfig New Feature: 2022-12-17 14:06:53 -06:00
Kconfig.debug mm, slub: add CONFIG_SLUB_TINY 2022-11-27 23:38:02 +01:00
khugepaged.c mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups 2023-01-31 16:44:09 -08:00
kmemleak.c mm/kmemleak: use %pK to display kernel pointers in backtrace 2022-12-15 16:37:49 -08:00
ksm.c mm/ksm: convert break_ksm() to use walk_page_range_vma() 2022-12-11 18:12:09 -08:00
list_lru.c mm: kmem: make mem_cgroup_from_obj() vmalloc()-safe 2022-06-16 19:48:31 -07:00
maccess.c maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() 2022-11-11 11:44:46 -08:00
madvise.c mm: update mmap_sem comments to refer to mmap_lock 2023-01-11 16:14:22 -08:00
Makefile mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol 2022-10-03 14:03:36 -07:00
mapping_dirty_helpers.c mm: Rename pmd_read_atomic() 2022-12-15 10:37:27 -08:00
memblock.c mm: Always release pages to the buddy allocator in memblock_free_late(). 2023-01-08 18:49:33 +02:00
memcontrol.c Revert "mm: add nodes= arg to memory.reclaim" 2023-01-31 16:44:07 -08:00
memfd.c
memory_hotplug.c mm: add pageblock_aligned() macro 2022-10-03 14:03:04 -07:00
memory-failure.c mm/memory-failure.c: cleanup in unpoison_memory 2022-11-30 15:59:08 -08:00
memory-tiers.c mm/demotion: fix NULL vs IS_ERR checking in memory_tier_init 2022-11-30 15:58:54 -08:00
memory.c mm: fix a few rare cases of using swapin error pte marker 2023-01-18 17:02:19 -08:00
mempolicy.c mm/mempolicy: fix memory leak in set_mempolicy_home_node system call 2022-12-21 14:31:51 -08:00
mempool.c mempool: do not use ksize() for poisoning 2022-11-30 15:58:41 -08:00
memremap.c mm/memremap.c: map FS_DAX device memory as decrypted 2022-11-08 15:57:23 -08:00
memtest.c
migrate_device.c mm/migrate_device: return number of migrating pages in args->cpages 2022-11-22 18:50:43 -08:00
migrate.c MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
mincore.c mm: convert find_get_incore_page() to filemap_get_incore_folio() 2022-11-08 17:37:18 -08:00
mlock.c mm/mlock: drop dead code in count_mm_mlocked_page_nr() 2022-09-26 19:46:27 -07:00
mm_init.c memory: move hotplug memory notifier priority to same file for easy sorting 2022-11-08 17:37:17 -08:00
mm_slot.h mm: introduce common struct mm_slot 2022-10-03 14:02:43 -07:00
mmap_lock.c
mmap.c mm: update mmap_sem comments to refer to mmap_lock 2023-01-11 16:14:22 -08:00
mmu_gather.c mm: mmu_gather: allow more than one batch of delayed rmaps 2022-12-11 18:12:21 -08:00
mmu_notifier.c mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() 2022-04-21 20:01:10 -07:00
mmzone.c mm: multi-gen LRU: groundwork 2022-09-26 19:46:09 -07:00
mprotect.c mm: fix a few rare cases of using swapin error pte marker 2023-01-18 17:02:19 -08:00
mremap.c mm, mremap: fix mremap() expanding for vma's with vm_ops->close() 2023-01-31 16:44:08 -08:00
msync.c mm/msync: use vma_find() instead of vma linked list 2022-09-26 19:46:25 -07:00
nommu.c nommu: fix split_vma() map_count error 2023-01-11 16:14:23 -08:00
oom_kill.c mm: reduce noise in show_mem for lowmem allocations 2022-09-26 19:46:29 -07:00
page_alloc.c mm/page_alloc: update comments in __free_pages_ok() 2022-12-11 18:12:17 -08:00
page_counter.c mm: page_counter: remove unneeded atomic ops for low/min 2022-09-11 20:26:01 -07:00
page_ext.c Merge branch 'mm-hotfixes-stable' into mm-stable 2022-11-30 14:58:42 -08:00
page_idle.c mm: don't be stuck to rmap lock on reclaim path 2022-05-19 14:08:54 -07:00
page_io.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
page_isolation.c mm/page_isolation: fix clang deadcode warning 2022-10-28 13:37:22 -07:00
page_owner.c mm: reuse pageblock_start/end_pfn() macro 2022-10-03 14:03:03 -07:00
page_poison.c
page_reporting.c mm/page_reporting: Add checks for page_reporting_order param 2022-11-28 16:48:19 +00:00
page_reporting.h
page_table_check.c mm: use kstrtobool() instead of strtobool() 2022-11-30 15:58:45 -08:00
page_vma_mapped.c mm/swap: add swp_offset_pfn() to fetch PFN from swap entry 2022-09-26 19:46:05 -07:00
page-writeback.c mm: add bdi_set_min_ratio_no_scale() function 2022-11-30 15:59:06 -08:00
pagewalk.c mm/pagewalk: add walk_page_range_vma() 2022-12-11 18:12:08 -08:00
percpu-internal.h percpu: improve percpu_alloc_percpu event trace 2022-05-13 07:20:18 -07:00
percpu-km.c
percpu-stats.c mm: use vmalloc_array and vcalloc for array allocations 2022-03-08 09:30:46 -05:00
percpu-vm.c
percpu.c mm/percpu.c: remove the lcm code since block size is fixed at page size 2022-11-07 22:59:18 -08:00
pgalloc-track.h
pgtable-generic.c mm: avoid unnecessary flush on change_huge_pmd() 2022-05-13 07:20:05 -07:00
process_vm_access.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
ptdump.c mm: pagewalk: Fix race between unmap and page walker 2022-09-03 10:13:13 -07:00
readahead.c mm: add PSI accounting around ->read_folio and ->readahead calls 2022-09-20 08:24:38 -06:00
rmap.c mm,thp,rmap: fix races between updates of subpages_mapcount 2022-12-11 18:12:20 -08:00
rodata_test.c mm/rodata_test: use PAGE_ALIGNED() helper 2022-10-03 14:03:05 -07:00
secretmem.c mm/secretmem: remove reduntant return value 2022-10-03 14:03:36 -07:00
shmem.c mm/shmem: restore SHMEM_HUGE_DENY precedence over MADV_COLLAPSE 2023-01-11 16:14:20 -08:00
shrinker_debug.c mm: shrinkers: fix double kfree on shrinker name 2022-07-29 18:07:13 -07:00
shuffle.c mm/shuffle: convert module_param_call to module_param_cb 2022-10-03 14:03:07 -07:00
shuffle.h
slab_common.c hardening updates for v6.2-rc1 2022-12-14 12:20:00 -08:00
slab.c Merge branch 'slab/for-6.2/kmalloc_redzone' into slab/for-next 2022-11-21 10:36:09 +01:00
slab.h Merge branch 'slub-tiny-v1r6' into slab/for-next 2022-12-01 00:14:00 +01:00
slob.c Merge branch 'slab/for-6.1/kmalloc_size_roundup' into slab/for-next 2022-09-29 11:30:55 +02:00
slub.c MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
sparse-vmemmap.c mm/sparse-vmemmap: generalise vmemmap_populate_hugepages() 2022-12-11 18:12:12 -08:00
sparse.c mm/hwpoison: introduce per-memory_block hwpoison counter 2022-11-08 17:37:22 -08:00
swap_cgroup.c mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled 2022-10-03 14:03:36 -07:00
swap_slots.c mm/swap: convert put_swap_page() to put_swap_folio() 2022-10-03 14:02:46 -07:00
swap_state.c mm: mmu_gather: prepare to gather encoded page pointers with flags 2022-11-30 15:58:50 -08:00
swap.c mm: teach release_pages() to take an array of encoded page pointers too 2022-11-30 15:58:50 -08:00
swap.h mm: convert find_get_incore_page() to filemap_get_incore_folio() 2022-11-08 17:37:18 -08:00
swapfile.c MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
truncate.c folio-compat: remove lru_cache_add() 2022-12-11 18:12:13 -08:00
usercopy.c mm: use kstrtobool() instead of strtobool() 2022-11-30 15:58:45 -08:00
userfaultfd.c New Feature: 2022-12-17 14:06:53 -06:00
util.c mm,thp,rmap: simplify compound page mapcount handling 2022-11-30 15:58:46 -08:00
vmalloc.c mm: vmalloc: use trace_free_vmap_area_noflush event 2022-11-08 17:37:17 -08:00
vmpressure.c
vmscan.c mm: multi-gen LRU: fix crash during cgroup migration 2023-01-31 16:44:08 -08:00
vmstat.c mm: vmscan: split khugepaged stats from direct reclaim stats 2022-11-30 15:58:41 -08:00
workingset.c folio-compat: remove lru_cache_add() 2022-12-11 18:12:13 -08:00
z3fold.c zpool: clean out dead code 2022-12-11 18:12:10 -08:00
zbud.c zpool: clean out dead code 2022-12-11 18:12:10 -08:00
zpool.c zpool: clean out dead code 2022-12-11 18:12:10 -08:00
zsmalloc.c zsmalloc: fix a race with deferred_handles storing 2023-01-31 16:44:07 -08:00
zswap.c zswap: fix writeback lock ordering for zsmalloc 2022-12-11 18:12:09 -08:00