mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-12-03 00:54:09 +08:00
79c28a4167
Patch series "Migrate Pages in lieu of discard", v11. We're starting to see systems with more and more kinds of memory such as Intel's implementation of persistent memory. Let's say you have a system with some DRAM and some persistent memory. Today, once DRAM fills up, reclaim will start and some of the DRAM contents will be thrown out. Allocations will, at some point, start falling over to the slower persistent memory. That has two nasty properties. First, the newer allocations can end up in the slower persistent memory. Second, reclaimed data in DRAM are just discarded even if there are gobs of space in persistent memory that could be used. This patchset implements a solution to these problems. At the end of the reclaim process in shrink_page_list() just before the last page refcount is dropped, the page is migrated to persistent memory instead of being dropped. While I've talked about a DRAM/PMEM pairing, this approach would function in any environment where memory tiers exist. This is not perfect. It "strands" pages in slower memory and never brings them back to fast DRAM. Huang Ying has follow-on work which repurposes NUMA balancing to promote hot pages back to DRAM. This is also all based on an upstream mechanism that allows persistent memory to be onlined and used as if it were volatile: http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com With that, the DRAM and PMEM in each socket will be represented as 2 separate NUMA nodes, with the CPUs sit in the DRAM node. So the general inter-NUMA demotion mechanism introduced in the patchset can migrate the cold DRAM pages to the PMEM node. We have tested the patchset with the postgresql and pgbench. On a 2-socket server machine with DRAM and PMEM, the kernel with the patchset can improve the score of pgbench up to 22.1% compared with that of the DRAM only + disk case. This comes from the reduced disk read throughput (which reduces up to 70.8%). == Open Issues == * Memory policies and cpusets that, for instance, restrict allocations to DRAM can be demoted to PMEM whenever they opt in to this new mechanism. A cgroup-level API to opt-in or opt-out of these migrations will likely be required as a follow-on. * Could be more aggressive about where anon LRU scanning occurs since it no longer necessarily involves I/O. get_scan_count() for instance says: "If we have no swap space, do not bother scanning anon pages" This patch (of 9): Prepare for the kernel to auto-migrate pages to other memory nodes with a node migration table. This allows creating single migration target for each NUMA node to enable the kernel to do NUMA page migrations instead of simply discarding colder pages. A node with no target is a "terminal node", so reclaim acts normally there. The migration target does not fundamentally _need_ to be a single node, but this implementation starts there to limit complexity. When memory fills up on a node, memory contents can be automatically migrated to another node. The biggest problems are knowing when to migrate and to where the migration should be targeted. The most straightforward way to generate the "to where" list would be to follow the page allocator fallback lists. Those lists already tell us if memory is full where to look next. It would also be logical to move memory in that order. But, the allocator fallback lists have a fatal flaw: most nodes appear in all the lists. This would potentially lead to migration cycles (A->B, B->A, A->B, ...). Instead of using the allocator fallback lists directly, keep a separate node migration ordering. But, reuse the same data used to generate page allocator fallback in the first place: find_next_best_node(). This means that the firmware data used to populate node distances essentially dictates the ordering for now. It should also be architecture-neutral since all NUMA architectures have a working find_next_best_node(). RCU is used to allow lock-less read of node_demotion[] and prevent demotion cycles been observed. If multiple reads of node_demotion[] are performed, a single rcu_read_lock() must be held over all reads to ensure no cycles are observed. Details are as follows. === What does RCU provide? === Imagine a simple loop which walks down the demotion path looking for the last node: terminal_node = start_node; while (node_demotion[terminal_node] != NUMA_NO_NODE) { terminal_node = node_demotion[terminal_node]; } The initial values are: node_demotion[0] = 1; node_demotion[1] = NUMA_NO_NODE; and are updated to: node_demotion[0] = NUMA_NO_NODE; node_demotion[1] = 0; What guarantees that the cycle is not observed: node_demotion[0] = 1; node_demotion[1] = 0; and would loop forever? With RCU, a rcu_read_lock/unlock() can be placed around the loop. Since the write side does a synchronize_rcu(), the loop that observed the old contents is known to be complete before the synchronize_rcu() has completed. RCU, combined with disable_all_migrate_targets(), ensures that the old migration state is not visible by the time __set_migration_target_nodes() is called. === What does READ_ONCE() provide? === READ_ONCE() forbids the compiler from merging or reordering successive reads of node_demotion[]. This ensures that any updates are *eventually* observed. Consider the above loop again. The compiler could theoretically read the entirety of node_demotion[] into local storage (registers) and never go back to memory, and *permanently* observe bad values for node_demotion[]. Note: RCU does not provide any universal compiler-ordering guarantees: https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/ This code is unused for now. It will be called later in the series. Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
675 lines
21 KiB
C
675 lines
21 KiB
C
/* SPDX-License-Identifier: GPL-2.0-or-later */
|
|
/* internal.h: mm/ internal definitions
|
|
*
|
|
* Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
|
|
* Written by David Howells (dhowells@redhat.com)
|
|
*/
|
|
#ifndef __MM_INTERNAL_H
|
|
#define __MM_INTERNAL_H
|
|
|
|
#include <linux/fs.h>
|
|
#include <linux/mm.h>
|
|
#include <linux/pagemap.h>
|
|
#include <linux/tracepoint-defs.h>
|
|
|
|
/*
|
|
* The set of flags that only affect watermark checking and reclaim
|
|
* behaviour. This is used by the MM to obey the caller constraints
|
|
* about IO, FS and watermark checking while ignoring placement
|
|
* hints such as HIGHMEM usage.
|
|
*/
|
|
#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
|
|
__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
|
|
__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
|
|
__GFP_ATOMIC)
|
|
|
|
/* The GFP flags allowed during early boot */
|
|
#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
|
|
|
|
/* Control allocation cpuset and node placement constraints */
|
|
#define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
|
|
|
|
/* Do not use these with a slab allocator */
|
|
#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
|
|
|
|
void page_writeback_init(void);
|
|
|
|
vm_fault_t do_swap_page(struct vm_fault *vmf);
|
|
|
|
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
|
|
unsigned long floor, unsigned long ceiling);
|
|
|
|
static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
|
|
{
|
|
return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
|
|
}
|
|
|
|
void unmap_page_range(struct mmu_gather *tlb,
|
|
struct vm_area_struct *vma,
|
|
unsigned long addr, unsigned long end,
|
|
struct zap_details *details);
|
|
|
|
void do_page_cache_ra(struct readahead_control *, unsigned long nr_to_read,
|
|
unsigned long lookahead_size);
|
|
void force_page_cache_ra(struct readahead_control *, unsigned long nr);
|
|
static inline void force_page_cache_readahead(struct address_space *mapping,
|
|
struct file *file, pgoff_t index, unsigned long nr_to_read)
|
|
{
|
|
DEFINE_READAHEAD(ractl, file, &file->f_ra, mapping, index);
|
|
force_page_cache_ra(&ractl, nr_to_read);
|
|
}
|
|
|
|
unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
|
|
pgoff_t end, struct pagevec *pvec, pgoff_t *indices);
|
|
|
|
/**
|
|
* page_evictable - test whether a page is evictable
|
|
* @page: the page to test
|
|
*
|
|
* Test whether page is evictable--i.e., should be placed on active/inactive
|
|
* lists vs unevictable list.
|
|
*
|
|
* Reasons page might not be evictable:
|
|
* (1) page's mapping marked unevictable
|
|
* (2) page is part of an mlocked VMA
|
|
*
|
|
*/
|
|
static inline bool page_evictable(struct page *page)
|
|
{
|
|
bool ret;
|
|
|
|
/* Prevent address_space of inode and swap cache from being freed */
|
|
rcu_read_lock();
|
|
ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
|
|
rcu_read_unlock();
|
|
return ret;
|
|
}
|
|
|
|
/*
|
|
* Turn a non-refcounted page (->_refcount == 0) into refcounted with
|
|
* a count of one.
|
|
*/
|
|
static inline void set_page_refcounted(struct page *page)
|
|
{
|
|
VM_BUG_ON_PAGE(PageTail(page), page);
|
|
VM_BUG_ON_PAGE(page_ref_count(page), page);
|
|
set_page_count(page, 1);
|
|
}
|
|
|
|
extern unsigned long highest_memmap_pfn;
|
|
|
|
/*
|
|
* Maximum number of reclaim retries without progress before the OOM
|
|
* killer is consider the only way forward.
|
|
*/
|
|
#define MAX_RECLAIM_RETRIES 16
|
|
|
|
/*
|
|
* in mm/vmscan.c:
|
|
*/
|
|
extern int isolate_lru_page(struct page *page);
|
|
extern void putback_lru_page(struct page *page);
|
|
|
|
/*
|
|
* in mm/rmap.c:
|
|
*/
|
|
extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
|
|
|
|
/*
|
|
* in mm/memcontrol.c:
|
|
*/
|
|
extern bool cgroup_memory_nokmem;
|
|
|
|
/*
|
|
* in mm/page_alloc.c
|
|
*/
|
|
|
|
/*
|
|
* Structure for holding the mostly immutable allocation parameters passed
|
|
* between functions involved in allocations, including the alloc_pages*
|
|
* family of functions.
|
|
*
|
|
* nodemask, migratetype and highest_zoneidx are initialized only once in
|
|
* __alloc_pages() and then never change.
|
|
*
|
|
* zonelist, preferred_zone and highest_zoneidx are set first in
|
|
* __alloc_pages() for the fast path, and might be later changed
|
|
* in __alloc_pages_slowpath(). All other functions pass the whole structure
|
|
* by a const pointer.
|
|
*/
|
|
struct alloc_context {
|
|
struct zonelist *zonelist;
|
|
nodemask_t *nodemask;
|
|
struct zoneref *preferred_zoneref;
|
|
int migratetype;
|
|
|
|
/*
|
|
* highest_zoneidx represents highest usable zone index of
|
|
* the allocation request. Due to the nature of the zone,
|
|
* memory on lower zone than the highest_zoneidx will be
|
|
* protected by lowmem_reserve[highest_zoneidx].
|
|
*
|
|
* highest_zoneidx is also used by reclaim/compaction to limit
|
|
* the target zone since higher zone than this index cannot be
|
|
* usable for this allocation request.
|
|
*/
|
|
enum zone_type highest_zoneidx;
|
|
bool spread_dirty_pages;
|
|
};
|
|
|
|
/*
|
|
* Locate the struct page for both the matching buddy in our
|
|
* pair (buddy1) and the combined O(n+1) page they form (page).
|
|
*
|
|
* 1) Any buddy B1 will have an order O twin B2 which satisfies
|
|
* the following equation:
|
|
* B2 = B1 ^ (1 << O)
|
|
* For example, if the starting buddy (buddy2) is #8 its order
|
|
* 1 buddy is #10:
|
|
* B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
|
|
*
|
|
* 2) Any buddy B will have an order O+1 parent P which
|
|
* satisfies the following equation:
|
|
* P = B & ~(1 << O)
|
|
*
|
|
* Assumption: *_mem_map is contiguous at least up to MAX_ORDER
|
|
*/
|
|
static inline unsigned long
|
|
__find_buddy_pfn(unsigned long page_pfn, unsigned int order)
|
|
{
|
|
return page_pfn ^ (1 << order);
|
|
}
|
|
|
|
extern struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
|
|
unsigned long end_pfn, struct zone *zone);
|
|
|
|
static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
|
|
unsigned long end_pfn, struct zone *zone)
|
|
{
|
|
if (zone->contiguous)
|
|
return pfn_to_page(start_pfn);
|
|
|
|
return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
|
|
}
|
|
|
|
extern int __isolate_free_page(struct page *page, unsigned int order);
|
|
extern void __putback_isolated_page(struct page *page, unsigned int order,
|
|
int mt);
|
|
extern void memblock_free_pages(struct page *page, unsigned long pfn,
|
|
unsigned int order);
|
|
extern void __free_pages_core(struct page *page, unsigned int order);
|
|
extern void prep_compound_page(struct page *page, unsigned int order);
|
|
extern void post_alloc_hook(struct page *page, unsigned int order,
|
|
gfp_t gfp_flags);
|
|
extern int user_min_free_kbytes;
|
|
|
|
extern void free_unref_page(struct page *page, unsigned int order);
|
|
extern void free_unref_page_list(struct list_head *list);
|
|
|
|
extern void zone_pcp_update(struct zone *zone, int cpu_online);
|
|
extern void zone_pcp_reset(struct zone *zone);
|
|
extern void zone_pcp_disable(struct zone *zone);
|
|
extern void zone_pcp_enable(struct zone *zone);
|
|
|
|
extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
|
|
phys_addr_t min_addr,
|
|
int nid, bool exact_nid);
|
|
|
|
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
|
|
|
|
/*
|
|
* in mm/compaction.c
|
|
*/
|
|
/*
|
|
* compact_control is used to track pages being migrated and the free pages
|
|
* they are being migrated to during memory compaction. The free_pfn starts
|
|
* at the end of a zone and migrate_pfn begins at the start. Movable pages
|
|
* are moved to the end of a zone during a compaction run and the run
|
|
* completes when free_pfn <= migrate_pfn
|
|
*/
|
|
struct compact_control {
|
|
struct list_head freepages; /* List of free pages to migrate to */
|
|
struct list_head migratepages; /* List of pages being migrated */
|
|
unsigned int nr_freepages; /* Number of isolated free pages */
|
|
unsigned int nr_migratepages; /* Number of pages to migrate */
|
|
unsigned long free_pfn; /* isolate_freepages search base */
|
|
/*
|
|
* Acts as an in/out parameter to page isolation for migration.
|
|
* isolate_migratepages uses it as a search base.
|
|
* isolate_migratepages_block will update the value to the next pfn
|
|
* after the last isolated one.
|
|
*/
|
|
unsigned long migrate_pfn;
|
|
unsigned long fast_start_pfn; /* a pfn to start linear scan from */
|
|
struct zone *zone;
|
|
unsigned long total_migrate_scanned;
|
|
unsigned long total_free_scanned;
|
|
unsigned short fast_search_fail;/* failures to use free list searches */
|
|
short search_order; /* order to start a fast search at */
|
|
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
|
|
int order; /* order a direct compactor needs */
|
|
int migratetype; /* migratetype of direct compactor */
|
|
const unsigned int alloc_flags; /* alloc flags of a direct compactor */
|
|
const int highest_zoneidx; /* zone index of a direct compactor */
|
|
enum migrate_mode mode; /* Async or sync migration mode */
|
|
bool ignore_skip_hint; /* Scan blocks even if marked skip */
|
|
bool no_set_skip_hint; /* Don't mark blocks for skipping */
|
|
bool ignore_block_suitable; /* Scan blocks considered unsuitable */
|
|
bool direct_compaction; /* False from kcompactd or /proc/... */
|
|
bool proactive_compaction; /* kcompactd proactive compaction */
|
|
bool whole_zone; /* Whole zone should/has been scanned */
|
|
bool contended; /* Signal lock or sched contention */
|
|
bool rescan; /* Rescanning the same pageblock */
|
|
bool alloc_contig; /* alloc_contig_range allocation */
|
|
};
|
|
|
|
/*
|
|
* Used in direct compaction when a page should be taken from the freelists
|
|
* immediately when one is created during the free path.
|
|
*/
|
|
struct capture_control {
|
|
struct compact_control *cc;
|
|
struct page *page;
|
|
};
|
|
|
|
unsigned long
|
|
isolate_freepages_range(struct compact_control *cc,
|
|
unsigned long start_pfn, unsigned long end_pfn);
|
|
int
|
|
isolate_migratepages_range(struct compact_control *cc,
|
|
unsigned long low_pfn, unsigned long end_pfn);
|
|
#endif
|
|
int find_suitable_fallback(struct free_area *area, unsigned int order,
|
|
int migratetype, bool only_stealable, bool *can_steal);
|
|
|
|
/*
|
|
* This function returns the order of a free page in the buddy system. In
|
|
* general, page_zone(page)->lock must be held by the caller to prevent the
|
|
* page from being allocated in parallel and returning garbage as the order.
|
|
* If a caller does not hold page_zone(page)->lock, it must guarantee that the
|
|
* page cannot be allocated or merged in parallel. Alternatively, it must
|
|
* handle invalid values gracefully, and use buddy_order_unsafe() below.
|
|
*/
|
|
static inline unsigned int buddy_order(struct page *page)
|
|
{
|
|
/* PageBuddy() must be checked by the caller */
|
|
return page_private(page);
|
|
}
|
|
|
|
/*
|
|
* Like buddy_order(), but for callers who cannot afford to hold the zone lock.
|
|
* PageBuddy() should be checked first by the caller to minimize race window,
|
|
* and invalid values must be handled gracefully.
|
|
*
|
|
* READ_ONCE is used so that if the caller assigns the result into a local
|
|
* variable and e.g. tests it for valid range before using, the compiler cannot
|
|
* decide to remove the variable and inline the page_private(page) multiple
|
|
* times, potentially observing different values in the tests and the actual
|
|
* use of the result.
|
|
*/
|
|
#define buddy_order_unsafe(page) READ_ONCE(page_private(page))
|
|
|
|
/*
|
|
* These three helpers classifies VMAs for virtual memory accounting.
|
|
*/
|
|
|
|
/*
|
|
* Executable code area - executable, not writable, not stack
|
|
*/
|
|
static inline bool is_exec_mapping(vm_flags_t flags)
|
|
{
|
|
return (flags & (VM_EXEC | VM_WRITE | VM_STACK)) == VM_EXEC;
|
|
}
|
|
|
|
/*
|
|
* Stack area - automatically grows in one direction
|
|
*
|
|
* VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
|
|
* do_mmap() forbids all other combinations.
|
|
*/
|
|
static inline bool is_stack_mapping(vm_flags_t flags)
|
|
{
|
|
return (flags & VM_STACK) == VM_STACK;
|
|
}
|
|
|
|
/*
|
|
* Data area - private, writable, not stack
|
|
*/
|
|
static inline bool is_data_mapping(vm_flags_t flags)
|
|
{
|
|
return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE;
|
|
}
|
|
|
|
/* mm/util.c */
|
|
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
|
|
struct vm_area_struct *prev);
|
|
void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma);
|
|
|
|
#ifdef CONFIG_MMU
|
|
extern long populate_vma_page_range(struct vm_area_struct *vma,
|
|
unsigned long start, unsigned long end, int *locked);
|
|
extern long faultin_vma_page_range(struct vm_area_struct *vma,
|
|
unsigned long start, unsigned long end,
|
|
bool write, int *locked);
|
|
extern void munlock_vma_pages_range(struct vm_area_struct *vma,
|
|
unsigned long start, unsigned long end);
|
|
static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
|
|
{
|
|
munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
|
|
}
|
|
|
|
/*
|
|
* must be called with vma's mmap_lock held for read or write, and page locked.
|
|
*/
|
|
extern void mlock_vma_page(struct page *page);
|
|
extern unsigned int munlock_vma_page(struct page *page);
|
|
|
|
extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
|
|
unsigned long len);
|
|
|
|
/*
|
|
* Clear the page's PageMlocked(). This can be useful in a situation where
|
|
* we want to unconditionally remove a page from the pagecache -- e.g.,
|
|
* on truncation or freeing.
|
|
*
|
|
* It is legal to call this function for any page, mlocked or not.
|
|
* If called for a page that is still mapped by mlocked vmas, all we do
|
|
* is revert to lazy LRU behaviour -- semantics are not broken.
|
|
*/
|
|
extern void clear_page_mlock(struct page *page);
|
|
|
|
extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
|
|
|
|
/*
|
|
* At what user virtual address is page expected in vma?
|
|
* Returns -EFAULT if all of the page is outside the range of vma.
|
|
* If page is a compound head, the entire compound page is considered.
|
|
*/
|
|
static inline unsigned long
|
|
vma_address(struct page *page, struct vm_area_struct *vma)
|
|
{
|
|
pgoff_t pgoff;
|
|
unsigned long address;
|
|
|
|
VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */
|
|
pgoff = page_to_pgoff(page);
|
|
if (pgoff >= vma->vm_pgoff) {
|
|
address = vma->vm_start +
|
|
((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
|
|
/* Check for address beyond vma (or wrapped through 0?) */
|
|
if (address < vma->vm_start || address >= vma->vm_end)
|
|
address = -EFAULT;
|
|
} else if (PageHead(page) &&
|
|
pgoff + compound_nr(page) - 1 >= vma->vm_pgoff) {
|
|
/* Test above avoids possibility of wrap to 0 on 32-bit */
|
|
address = vma->vm_start;
|
|
} else {
|
|
address = -EFAULT;
|
|
}
|
|
return address;
|
|
}
|
|
|
|
/*
|
|
* Then at what user virtual address will none of the page be found in vma?
|
|
* Assumes that vma_address() already returned a good starting address.
|
|
* If page is a compound head, the entire compound page is considered.
|
|
*/
|
|
static inline unsigned long
|
|
vma_address_end(struct page *page, struct vm_area_struct *vma)
|
|
{
|
|
pgoff_t pgoff;
|
|
unsigned long address;
|
|
|
|
VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */
|
|
pgoff = page_to_pgoff(page) + compound_nr(page);
|
|
address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
|
|
/* Check for address beyond vma (or wrapped through 0?) */
|
|
if (address < vma->vm_start || address > vma->vm_end)
|
|
address = vma->vm_end;
|
|
return address;
|
|
}
|
|
|
|
static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
|
|
struct file *fpin)
|
|
{
|
|
int flags = vmf->flags;
|
|
|
|
if (fpin)
|
|
return fpin;
|
|
|
|
/*
|
|
* FAULT_FLAG_RETRY_NOWAIT means we don't want to wait on page locks or
|
|
* anything, so we only pin the file and drop the mmap_lock if only
|
|
* FAULT_FLAG_ALLOW_RETRY is set, while this is the first attempt.
|
|
*/
|
|
if (fault_flag_allow_retry_first(flags) &&
|
|
!(flags & FAULT_FLAG_RETRY_NOWAIT)) {
|
|
fpin = get_file(vmf->vma->vm_file);
|
|
mmap_read_unlock(vmf->vma->vm_mm);
|
|
}
|
|
return fpin;
|
|
}
|
|
|
|
#else /* !CONFIG_MMU */
|
|
static inline void clear_page_mlock(struct page *page) { }
|
|
static inline void mlock_vma_page(struct page *page) { }
|
|
static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
|
|
{
|
|
}
|
|
#endif /* !CONFIG_MMU */
|
|
|
|
/*
|
|
* Return the mem_map entry representing the 'offset' subpage within
|
|
* the maximally aligned gigantic page 'base'. Handle any discontiguity
|
|
* in the mem_map at MAX_ORDER_NR_PAGES boundaries.
|
|
*/
|
|
static inline struct page *mem_map_offset(struct page *base, int offset)
|
|
{
|
|
if (unlikely(offset >= MAX_ORDER_NR_PAGES))
|
|
return nth_page(base, offset);
|
|
return base + offset;
|
|
}
|
|
|
|
/*
|
|
* Iterator over all subpages within the maximally aligned gigantic
|
|
* page 'base'. Handle any discontiguity in the mem_map.
|
|
*/
|
|
static inline struct page *mem_map_next(struct page *iter,
|
|
struct page *base, int offset)
|
|
{
|
|
if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
|
|
unsigned long pfn = page_to_pfn(base) + offset;
|
|
if (!pfn_valid(pfn))
|
|
return NULL;
|
|
return pfn_to_page(pfn);
|
|
}
|
|
return iter + 1;
|
|
}
|
|
|
|
/* Memory initialisation debug and verification */
|
|
enum mminit_level {
|
|
MMINIT_WARNING,
|
|
MMINIT_VERIFY,
|
|
MMINIT_TRACE
|
|
};
|
|
|
|
#ifdef CONFIG_DEBUG_MEMORY_INIT
|
|
|
|
extern int mminit_loglevel;
|
|
|
|
#define mminit_dprintk(level, prefix, fmt, arg...) \
|
|
do { \
|
|
if (level < mminit_loglevel) { \
|
|
if (level <= MMINIT_WARNING) \
|
|
pr_warn("mminit::" prefix " " fmt, ##arg); \
|
|
else \
|
|
printk(KERN_DEBUG "mminit::" prefix " " fmt, ##arg); \
|
|
} \
|
|
} while (0)
|
|
|
|
extern void mminit_verify_pageflags_layout(void);
|
|
extern void mminit_verify_zonelist(void);
|
|
#else
|
|
|
|
static inline void mminit_dprintk(enum mminit_level level,
|
|
const char *prefix, const char *fmt, ...)
|
|
{
|
|
}
|
|
|
|
static inline void mminit_verify_pageflags_layout(void)
|
|
{
|
|
}
|
|
|
|
static inline void mminit_verify_zonelist(void)
|
|
{
|
|
}
|
|
#endif /* CONFIG_DEBUG_MEMORY_INIT */
|
|
|
|
/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
|
|
#if defined(CONFIG_SPARSEMEM)
|
|
extern void mminit_validate_memmodel_limits(unsigned long *start_pfn,
|
|
unsigned long *end_pfn);
|
|
#else
|
|
static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
|
|
unsigned long *end_pfn)
|
|
{
|
|
}
|
|
#endif /* CONFIG_SPARSEMEM */
|
|
|
|
#define NODE_RECLAIM_NOSCAN -2
|
|
#define NODE_RECLAIM_FULL -1
|
|
#define NODE_RECLAIM_SOME 0
|
|
#define NODE_RECLAIM_SUCCESS 1
|
|
|
|
#ifdef CONFIG_NUMA
|
|
extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
|
|
extern int find_next_best_node(int node, nodemask_t *used_node_mask);
|
|
#else
|
|
static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
|
|
unsigned int order)
|
|
{
|
|
return NODE_RECLAIM_NOSCAN;
|
|
}
|
|
static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
|
|
{
|
|
return NUMA_NO_NODE;
|
|
}
|
|
#endif
|
|
|
|
extern int hwpoison_filter(struct page *p);
|
|
|
|
extern u32 hwpoison_filter_dev_major;
|
|
extern u32 hwpoison_filter_dev_minor;
|
|
extern u64 hwpoison_filter_flags_mask;
|
|
extern u64 hwpoison_filter_flags_value;
|
|
extern u64 hwpoison_filter_memcg;
|
|
extern u32 hwpoison_filter_enable;
|
|
|
|
extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long,
|
|
unsigned long, unsigned long,
|
|
unsigned long, unsigned long);
|
|
|
|
extern void set_pageblock_order(void);
|
|
unsigned int reclaim_clean_pages_from_list(struct zone *zone,
|
|
struct list_head *page_list);
|
|
/* The ALLOC_WMARK bits are used as an index to zone->watermark */
|
|
#define ALLOC_WMARK_MIN WMARK_MIN
|
|
#define ALLOC_WMARK_LOW WMARK_LOW
|
|
#define ALLOC_WMARK_HIGH WMARK_HIGH
|
|
#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */
|
|
|
|
/* Mask to get the watermark bits */
|
|
#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)
|
|
|
|
/*
|
|
* Only MMU archs have async oom victim reclaim - aka oom_reaper so we
|
|
* cannot assume a reduced access to memory reserves is sufficient for
|
|
* !MMU
|
|
*/
|
|
#ifdef CONFIG_MMU
|
|
#define ALLOC_OOM 0x08
|
|
#else
|
|
#define ALLOC_OOM ALLOC_NO_WATERMARKS
|
|
#endif
|
|
|
|
#define ALLOC_HARDER 0x10 /* try to alloc harder */
|
|
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
|
|
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
|
|
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
|
|
#ifdef CONFIG_ZONE_DMA32
|
|
#define ALLOC_NOFRAGMENT 0x100 /* avoid mixing pageblock types */
|
|
#else
|
|
#define ALLOC_NOFRAGMENT 0x0
|
|
#endif
|
|
#define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
|
|
|
|
enum ttu_flags;
|
|
struct tlbflush_unmap_batch;
|
|
|
|
|
|
/*
|
|
* only for MM internal work items which do not depend on
|
|
* any allocations or locks which might depend on allocations
|
|
*/
|
|
extern struct workqueue_struct *mm_percpu_wq;
|
|
|
|
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
|
|
void try_to_unmap_flush(void);
|
|
void try_to_unmap_flush_dirty(void);
|
|
void flush_tlb_batched_pending(struct mm_struct *mm);
|
|
#else
|
|
static inline void try_to_unmap_flush(void)
|
|
{
|
|
}
|
|
static inline void try_to_unmap_flush_dirty(void)
|
|
{
|
|
}
|
|
static inline void flush_tlb_batched_pending(struct mm_struct *mm)
|
|
{
|
|
}
|
|
#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
|
|
|
|
extern const struct trace_print_flags pageflag_names[];
|
|
extern const struct trace_print_flags vmaflag_names[];
|
|
extern const struct trace_print_flags gfpflag_names[];
|
|
|
|
static inline bool is_migrate_highatomic(enum migratetype migratetype)
|
|
{
|
|
return migratetype == MIGRATE_HIGHATOMIC;
|
|
}
|
|
|
|
static inline bool is_migrate_highatomic_page(struct page *page)
|
|
{
|
|
return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
|
|
}
|
|
|
|
void setup_zone_pageset(struct zone *zone);
|
|
|
|
struct migration_target_control {
|
|
int nid; /* preferred node id */
|
|
nodemask_t *nmask;
|
|
gfp_t gfp_mask;
|
|
};
|
|
|
|
/*
|
|
* mm/vmalloc.c
|
|
*/
|
|
#ifdef CONFIG_MMU
|
|
int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
|
|
pgprot_t prot, struct page **pages, unsigned int page_shift);
|
|
#else
|
|
static inline
|
|
int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
|
|
pgprot_t prot, struct page **pages, unsigned int page_shift)
|
|
{
|
|
return -EINVAL;
|
|
}
|
|
#endif
|
|
|
|
void vunmap_range_noflush(unsigned long start, unsigned long end);
|
|
|
|
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
|
|
unsigned long addr, int page_nid, int *flags);
|
|
|
|
#endif /* __MM_INTERNAL_H */
|