mirror of
https://github.com/edk2-porting/linux-next.git
synced 2025-01-04 19:54:03 +08:00
444eb2a449
THP defrag is enabled by default to direct reclaim/compact but not wake kswapd in the event of a THP allocation failure. The problem is that THP allocation requests potentially enter reclaim/compaction. This potentially incurs a severe stall that is not guaranteed to be offset by reduced TLB misses. While there has been considerable effort to reduce the impact of reclaim/compaction, it is still a high cost and workloads that should fit in memory fail to do so. Specifically, a simple anon/file streaming workload will enter direct reclaim on NUMA at least even though the working set size is 80% of RAM. It's been years and it's time to throw in the towel. First, this patch defines THP defrag as follows; madvise: A failed allocation will direct reclaim/compact if the application requests it never: Neither reclaim/compact nor wake kswapd defer: A failed allocation will wake kswapd/kcompactd always: A failed allocation will direct reclaim/compact (historical behaviour) khugepaged defrag will enter direct/reclaim but not wake kswapd. Next it sets the default defrag option to be "madvise" to only enter direct reclaim/compaction for applications that specifically requested it. Lastly, it removes a check from the page allocator slowpath that is related to __GFP_THISNODE to allow "defer" to work. The callers that really cares are slub/slab and they are updated accordingly. The slab one may be surprising because it also corrects a comment as kswapd was never woken up by that path. This means that a THP fault will no longer stall for most applications by default and the ideal for most users that get THP if they are immediately available. There are still options for users that prefer a stall at startup of a new application by either restoring historical behaviour with "always" or pick a half-way point with "defer" where kswapd does some of the work in the background and wakes kcompactd if necessary. THP defrag for khugepaged remains enabled and will enter direct/reclaim but no wakeup kswapd or kcompactd. After this patch a THP allocation failure will quickly fallback and rely on khugepaged to recover the situation at some time in the future. In some cases, this will reduce THP usage but the benefit of THP is hard to measure and not a universal win where as a stall to reclaim/compaction is definitely measurable and can be painful. The first test for this is using "usemem" to read a large file and write a large anonymous mapping (to avoid the zero page) multiple times. The total size of the mappings is 80% of RAM and the benchmark simply measures how long it takes to complete. It uses multiple threads to see if that is a factor. On UMA, the performance is almost identical so is not reported but on NUMA, we see this usemem 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%) Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%) Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%) Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%) Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%) Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%) Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%) Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%) Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%) Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%) Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%) Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%) Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%) Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%) For a single thread, the benchmark completes 43.23% faster with this patch applied with smaller benefits as the thread increases. Similar, notice the large reduction in most cases in system CPU usage. The overall CPU time is 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 User 10357.65 10438.33 System 3988.88 3543.94 Elapsed 2203.01 1634.41 Which is substantial. Now, the reclaim figures 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 128458477 278352931 Major Faults 2174976 225 Swap Ins 16904701 0 Swap Outs 17359627 0 Allocation stalls 43611 0 DMA allocs 0 0 DMA32 allocs 19832646 19448017 Normal allocs 614488453 580941839 Movable allocs 0 0 Direct pages scanned 24163800 0 Kswapd pages scanned 0 0 Kswapd pages reclaimed 0 0 Direct pages reclaimed 20691346 0 Compaction stalls 42263 0 Compaction success 938 0 Compaction failures 41325 0 This patch eliminates almost all swapping and direct reclaim activity. There is still overhead but it's from NUMA balancing which does not identify that it's pointless trying to do anything with this workload. I also tried the thpscale benchmark which forces a corner case where compaction can be used heavily and measures the latency of whether base or huge pages were used thpscale Fault Latencies 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%) Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%) Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%) Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%) Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%) Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%) Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%) Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%) Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%) Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%) Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%) Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%) Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%) Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%) Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%) Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%) Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%) Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%) The average time to fault pages is substantially reduced in the majority of caseds but with the obvious caveat that fewer THPs are actually used in this adverse workload 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%) Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%) Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%) Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%) Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%) Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%) Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%) Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%) Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%) 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 37429143 47564000 Major Faults 1916 1558 Swap Ins 1466 1079 Swap Outs 2936863 149626 Allocation stalls 62510 3 DMA allocs 0 0 DMA32 allocs 6566458 6401314 Normal allocs 216361697 216538171 Movable allocs 0 0 Direct pages scanned 25977580 17998 Kswapd pages scanned 0 3638931 Kswapd pages reclaimed 0 207236 Direct pages reclaimed 8833714 88 Compaction stalls 103349 5 Compaction success 270 4 Compaction failures 103079 1 Note again that while this does swap as it's an aggressive workload, the direct relcim activity and allocation stalls is substantially reduced. There is some kswapd activity but ftrace showed that the kswapd activity was due to normal wakeups from 4K pages being allocated. Compaction-related stalls and activity are almost eliminated. I also tried the stutter benchmark. For this, I do not have figures for NUMA but it's something that does impact UMA so I'll report what is available stutter 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%) 1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%) 2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%) 3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%) Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%) Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%) Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%) Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%) Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%) Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%) This benchmark is trying to fault an anonymous mapping while there is a heavy IO load -- a scenario that desktop users used to complain about frequently. This shows a mix because the ideal case of mapping with THP is not hit as often. However, note that 99% of the mappings complete 13.79% faster. The CPU usage here is particularly interesting 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 User 67.50 0.99 System 1327.88 91.30 Elapsed 2079.00 2128.98 And once again we look at the reclaim figures 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 335241922 1314582827 Major Faults 715 819 Swap Ins 0 0 Swap Outs 0 0 Allocation stalls 532723 0 DMA allocs 0 0 DMA32 allocs 1822364341 1177950222 Normal allocs 1815640808 1517844854 Movable allocs 0 0 Direct pages scanned 21892772 0 Kswapd pages scanned 20015890 41879484 Kswapd pages reclaimed 19961986 41822072 Direct pages reclaimed 21892741 0 Compaction stalls 1065755 0 Compaction success 514 0 Compaction failures 1065241 0 Allocation stalls and all direct reclaim activity is eliminated as well as compaction-related stalls. THP gives impressive gains in some cases but only if they are quickly available. We're not going to reach the point where they are completely free so lets take the costs out of the fast paths finally and defer the cost to kswapd, kcompactd and khugepaged where it belongs. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
559 lines
21 KiB
C
559 lines
21 KiB
C
#ifndef __LINUX_GFP_H
|
|
#define __LINUX_GFP_H
|
|
|
|
#include <linux/mmdebug.h>
|
|
#include <linux/mmzone.h>
|
|
#include <linux/stddef.h>
|
|
#include <linux/linkage.h>
|
|
#include <linux/topology.h>
|
|
|
|
struct vm_area_struct;
|
|
|
|
/*
|
|
* In case of changes, please don't forget to update
|
|
* include/trace/events/mmflags.h and tools/perf/builtin-kmem.c
|
|
*/
|
|
|
|
/* Plain integer GFP bitmasks. Do not use this directly. */
|
|
#define ___GFP_DMA 0x01u
|
|
#define ___GFP_HIGHMEM 0x02u
|
|
#define ___GFP_DMA32 0x04u
|
|
#define ___GFP_MOVABLE 0x08u
|
|
#define ___GFP_RECLAIMABLE 0x10u
|
|
#define ___GFP_HIGH 0x20u
|
|
#define ___GFP_IO 0x40u
|
|
#define ___GFP_FS 0x80u
|
|
#define ___GFP_COLD 0x100u
|
|
#define ___GFP_NOWARN 0x200u
|
|
#define ___GFP_REPEAT 0x400u
|
|
#define ___GFP_NOFAIL 0x800u
|
|
#define ___GFP_NORETRY 0x1000u
|
|
#define ___GFP_MEMALLOC 0x2000u
|
|
#define ___GFP_COMP 0x4000u
|
|
#define ___GFP_ZERO 0x8000u
|
|
#define ___GFP_NOMEMALLOC 0x10000u
|
|
#define ___GFP_HARDWALL 0x20000u
|
|
#define ___GFP_THISNODE 0x40000u
|
|
#define ___GFP_ATOMIC 0x80000u
|
|
#define ___GFP_ACCOUNT 0x100000u
|
|
#define ___GFP_NOTRACK 0x200000u
|
|
#define ___GFP_DIRECT_RECLAIM 0x400000u
|
|
#define ___GFP_OTHER_NODE 0x800000u
|
|
#define ___GFP_WRITE 0x1000000u
|
|
#define ___GFP_KSWAPD_RECLAIM 0x2000000u
|
|
/* If the above are modified, __GFP_BITS_SHIFT may need updating */
|
|
|
|
/*
|
|
* Physical address zone modifiers (see linux/mmzone.h - low four bits)
|
|
*
|
|
* Do not put any conditional on these. If necessary modify the definitions
|
|
* without the underscores and use them consistently. The definitions here may
|
|
* be used in bit comparisons.
|
|
*/
|
|
#define __GFP_DMA ((__force gfp_t)___GFP_DMA)
|
|
#define __GFP_HIGHMEM ((__force gfp_t)___GFP_HIGHMEM)
|
|
#define __GFP_DMA32 ((__force gfp_t)___GFP_DMA32)
|
|
#define __GFP_MOVABLE ((__force gfp_t)___GFP_MOVABLE) /* ZONE_MOVABLE allowed */
|
|
#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
|
|
|
|
/*
|
|
* Page mobility and placement hints
|
|
*
|
|
* These flags provide hints about how mobile the page is. Pages with similar
|
|
* mobility are placed within the same pageblocks to minimise problems due
|
|
* to external fragmentation.
|
|
*
|
|
* __GFP_MOVABLE (also a zone modifier) indicates that the page can be
|
|
* moved by page migration during memory compaction or can be reclaimed.
|
|
*
|
|
* __GFP_RECLAIMABLE is used for slab allocations that specify
|
|
* SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.
|
|
*
|
|
* __GFP_WRITE indicates the caller intends to dirty the page. Where possible,
|
|
* these pages will be spread between local zones to avoid all the dirty
|
|
* pages being in one zone (fair zone allocation policy).
|
|
*
|
|
* __GFP_HARDWALL enforces the cpuset memory allocation policy.
|
|
*
|
|
* __GFP_THISNODE forces the allocation to be satisified from the requested
|
|
* node with no fallbacks or placement policy enforcements.
|
|
*
|
|
* __GFP_ACCOUNT causes the allocation to be accounted to kmemcg (only relevant
|
|
* to kmem allocations).
|
|
*/
|
|
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
|
|
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
|
|
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
|
|
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
|
|
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
|
|
|
|
/*
|
|
* Watermark modifiers -- controls access to emergency reserves
|
|
*
|
|
* __GFP_HIGH indicates that the caller is high-priority and that granting
|
|
* the request is necessary before the system can make forward progress.
|
|
* For example, creating an IO context to clean pages.
|
|
*
|
|
* __GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
|
|
* high priority. Users are typically interrupt handlers. This may be
|
|
* used in conjunction with __GFP_HIGH
|
|
*
|
|
* __GFP_MEMALLOC allows access to all memory. This should only be used when
|
|
* the caller guarantees the allocation will allow more memory to be freed
|
|
* very shortly e.g. process exiting or swapping. Users either should
|
|
* be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
|
|
*
|
|
* __GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
|
|
* This takes precedence over the __GFP_MEMALLOC flag if both are set.
|
|
*/
|
|
#define __GFP_ATOMIC ((__force gfp_t)___GFP_ATOMIC)
|
|
#define __GFP_HIGH ((__force gfp_t)___GFP_HIGH)
|
|
#define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)
|
|
#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
|
|
|
|
/*
|
|
* Reclaim modifiers
|
|
*
|
|
* __GFP_IO can start physical IO.
|
|
*
|
|
* __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
|
|
* allocator recursing into the filesystem which might already be holding
|
|
* locks.
|
|
*
|
|
* __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
|
|
* This flag can be cleared to avoid unnecessary delays when a fallback
|
|
* option is available.
|
|
*
|
|
* __GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd when
|
|
* the low watermark is reached and have it reclaim pages until the high
|
|
* watermark is reached. A caller may wish to clear this flag when fallback
|
|
* options are available and the reclaim is likely to disrupt the system. The
|
|
* canonical example is THP allocation where a fallback is cheap but
|
|
* reclaim/compaction may cause indirect stalls.
|
|
*
|
|
* __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
|
|
*
|
|
* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
|
|
* _might_ fail. This depends upon the particular VM implementation.
|
|
*
|
|
* __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
|
|
* cannot handle allocation failures. New users should be evaluated carefully
|
|
* (and the flag should be used only when there is no reasonable failure
|
|
* policy) but it is definitely preferable to use the flag rather than
|
|
* opencode endless loop around allocator.
|
|
*
|
|
* __GFP_NORETRY: The VM implementation must not retry indefinitely and will
|
|
* return NULL when direct reclaim and memory compaction have failed to allow
|
|
* the allocation to succeed. The OOM killer is not called with the current
|
|
* implementation.
|
|
*/
|
|
#define __GFP_IO ((__force gfp_t)___GFP_IO)
|
|
#define __GFP_FS ((__force gfp_t)___GFP_FS)
|
|
#define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
|
|
#define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
|
|
#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
|
|
#define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT)
|
|
#define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL)
|
|
#define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY)
|
|
|
|
/*
|
|
* Action modifiers
|
|
*
|
|
* __GFP_COLD indicates that the caller does not expect to be used in the near
|
|
* future. Where possible, a cache-cold page will be returned.
|
|
*
|
|
* __GFP_NOWARN suppresses allocation failure reports.
|
|
*
|
|
* __GFP_COMP address compound page metadata.
|
|
*
|
|
* __GFP_ZERO returns a zeroed page on success.
|
|
*
|
|
* __GFP_NOTRACK avoids tracking with kmemcheck.
|
|
*
|
|
* __GFP_NOTRACK_FALSE_POSITIVE is an alias of __GFP_NOTRACK. It's a means of
|
|
* distinguishing in the source between false positives and allocations that
|
|
* cannot be supported (e.g. page tables).
|
|
*
|
|
* __GFP_OTHER_NODE is for allocations that are on a remote node but that
|
|
* should not be accounted for as a remote allocation in vmstat. A
|
|
* typical user would be khugepaged collapsing a huge page on a remote
|
|
* node.
|
|
*/
|
|
#define __GFP_COLD ((__force gfp_t)___GFP_COLD)
|
|
#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN)
|
|
#define __GFP_COMP ((__force gfp_t)___GFP_COMP)
|
|
#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)
|
|
#define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK)
|
|
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
|
|
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
|
|
|
|
/* Room for N __GFP_FOO bits */
|
|
#define __GFP_BITS_SHIFT 26
|
|
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
|
|
|
|
/*
|
|
* Useful GFP flag combinations that are commonly used. It is recommended
|
|
* that subsystems start with one of these combinations and then set/clear
|
|
* __GFP_FOO flags as necessary.
|
|
*
|
|
* GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
|
|
* watermark is applied to allow access to "atomic reserves"
|
|
*
|
|
* GFP_KERNEL is typical for kernel-internal allocations. The caller requires
|
|
* ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
|
|
*
|
|
* GFP_KERNEL_ACCOUNT is the same as GFP_KERNEL, except the allocation is
|
|
* accounted to kmemcg.
|
|
*
|
|
* GFP_NOWAIT is for kernel allocations that should not stall for direct
|
|
* reclaim, start physical IO or use any filesystem callback.
|
|
*
|
|
* GFP_NOIO will use direct reclaim to discard clean pages or slab pages
|
|
* that do not require the starting of any physical IO.
|
|
*
|
|
* GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
|
|
*
|
|
* GFP_USER is for userspace allocations that also need to be directly
|
|
* accessibly by the kernel or hardware. It is typically used by hardware
|
|
* for buffers that are mapped to userspace (e.g. graphics) that hardware
|
|
* still must DMA to. cpuset limits are enforced for these allocations.
|
|
*
|
|
* GFP_DMA exists for historical reasons and should be avoided where possible.
|
|
* The flags indicates that the caller requires that the lowest zone be
|
|
* used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but
|
|
* it would require careful auditing as some users really require it and
|
|
* others use the flag to avoid lowmem reserves in ZONE_DMA and treat the
|
|
* lowest zone as a type of emergency reserve.
|
|
*
|
|
* GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit
|
|
* address.
|
|
*
|
|
* GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
|
|
* do not need to be directly accessible by the kernel but that cannot
|
|
* move once in use. An example may be a hardware allocation that maps
|
|
* data directly into userspace but has no addressing limitations.
|
|
*
|
|
* GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not
|
|
* need direct access to but can use kmap() when access is required. They
|
|
* are expected to be movable via page reclaim or page migration. Typically,
|
|
* pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
|
|
*
|
|
* GFP_TRANSHUGE is used for THP allocations. They are compound allocations
|
|
* that will fail quickly if memory is not available and will not wake
|
|
* kswapd on failure.
|
|
*/
|
|
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
|
|
#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
|
|
#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
|
|
#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
|
|
#define GFP_NOIO (__GFP_RECLAIM)
|
|
#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
|
|
#define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
|
|
__GFP_RECLAIMABLE)
|
|
#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
|
|
#define GFP_DMA __GFP_DMA
|
|
#define GFP_DMA32 __GFP_DMA32
|
|
#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
|
|
#define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
|
|
#define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
|
|
__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
|
|
~__GFP_RECLAIM)
|
|
|
|
/* Convert GFP flags to their corresponding migrate type */
|
|
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
|
|
#define GFP_MOVABLE_SHIFT 3
|
|
|
|
static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
|
|
{
|
|
VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
|
|
BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
|
|
BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
|
|
|
|
if (unlikely(page_group_by_mobility_disabled))
|
|
return MIGRATE_UNMOVABLE;
|
|
|
|
/* Group based on mobility */
|
|
return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
|
|
}
|
|
#undef GFP_MOVABLE_MASK
|
|
#undef GFP_MOVABLE_SHIFT
|
|
|
|
static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
|
|
{
|
|
return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
|
|
}
|
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
#define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
|
|
#else
|
|
#define OPT_ZONE_HIGHMEM ZONE_NORMAL
|
|
#endif
|
|
|
|
#ifdef CONFIG_ZONE_DMA
|
|
#define OPT_ZONE_DMA ZONE_DMA
|
|
#else
|
|
#define OPT_ZONE_DMA ZONE_NORMAL
|
|
#endif
|
|
|
|
#ifdef CONFIG_ZONE_DMA32
|
|
#define OPT_ZONE_DMA32 ZONE_DMA32
|
|
#else
|
|
#define OPT_ZONE_DMA32 ZONE_NORMAL
|
|
#endif
|
|
|
|
/*
|
|
* GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
|
|
* zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long
|
|
* and there are 16 of them to cover all possible combinations of
|
|
* __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM.
|
|
*
|
|
* The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA.
|
|
* But GFP_MOVABLE is not only a zone specifier but also an allocation
|
|
* policy. Therefore __GFP_MOVABLE plus another zone selector is valid.
|
|
* Only 1 bit of the lowest 3 bits (DMA,DMA32,HIGHMEM) can be set to "1".
|
|
*
|
|
* bit result
|
|
* =================
|
|
* 0x0 => NORMAL
|
|
* 0x1 => DMA or NORMAL
|
|
* 0x2 => HIGHMEM or NORMAL
|
|
* 0x3 => BAD (DMA+HIGHMEM)
|
|
* 0x4 => DMA32 or DMA or NORMAL
|
|
* 0x5 => BAD (DMA+DMA32)
|
|
* 0x6 => BAD (HIGHMEM+DMA32)
|
|
* 0x7 => BAD (HIGHMEM+DMA32+DMA)
|
|
* 0x8 => NORMAL (MOVABLE+0)
|
|
* 0x9 => DMA or NORMAL (MOVABLE+DMA)
|
|
* 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too)
|
|
* 0xb => BAD (MOVABLE+HIGHMEM+DMA)
|
|
* 0xc => DMA32 (MOVABLE+DMA32)
|
|
* 0xd => BAD (MOVABLE+DMA32+DMA)
|
|
* 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
|
|
* 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
|
|
*
|
|
* ZONES_SHIFT must be <= 2 on 32 bit platforms.
|
|
*/
|
|
|
|
#if 16 * ZONES_SHIFT > BITS_PER_LONG
|
|
#error ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
|
|
#endif
|
|
|
|
#define GFP_ZONE_TABLE ( \
|
|
(ZONE_NORMAL << 0 * ZONES_SHIFT) \
|
|
| (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT) \
|
|
| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT) \
|
|
| (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT) \
|
|
| (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT) \
|
|
| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT) \
|
|
| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT) \
|
|
| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT) \
|
|
)
|
|
|
|
/*
|
|
* GFP_ZONE_BAD is a bitmap for all combinations of __GFP_DMA, __GFP_DMA32
|
|
* __GFP_HIGHMEM and __GFP_MOVABLE that are not permitted. One flag per
|
|
* entry starting with bit 0. Bit is set if the combination is not
|
|
* allowed.
|
|
*/
|
|
#define GFP_ZONE_BAD ( \
|
|
1 << (___GFP_DMA | ___GFP_HIGHMEM) \
|
|
| 1 << (___GFP_DMA | ___GFP_DMA32) \
|
|
| 1 << (___GFP_DMA32 | ___GFP_HIGHMEM) \
|
|
| 1 << (___GFP_DMA | ___GFP_DMA32 | ___GFP_HIGHMEM) \
|
|
| 1 << (___GFP_MOVABLE | ___GFP_HIGHMEM | ___GFP_DMA) \
|
|
| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA) \
|
|
| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_HIGHMEM) \
|
|
| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA | ___GFP_HIGHMEM) \
|
|
)
|
|
|
|
static inline enum zone_type gfp_zone(gfp_t flags)
|
|
{
|
|
enum zone_type z;
|
|
int bit = (__force int) (flags & GFP_ZONEMASK);
|
|
|
|
z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
|
|
((1 << ZONES_SHIFT) - 1);
|
|
VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
|
|
return z;
|
|
}
|
|
|
|
/*
|
|
* There is only one page-allocator function, and two main namespaces to
|
|
* it. The alloc_page*() variants return 'struct page *' and as such
|
|
* can allocate highmem pages, the *get*page*() variants return
|
|
* virtual kernel addresses to the allocated page(s).
|
|
*/
|
|
|
|
static inline int gfp_zonelist(gfp_t flags)
|
|
{
|
|
#ifdef CONFIG_NUMA
|
|
if (unlikely(flags & __GFP_THISNODE))
|
|
return ZONELIST_NOFALLBACK;
|
|
#endif
|
|
return ZONELIST_FALLBACK;
|
|
}
|
|
|
|
/*
|
|
* We get the zone list from the current node and the gfp_mask.
|
|
* This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
|
|
* There are two zonelists per node, one for all zones with memory and
|
|
* one containing just zones from the node the zonelist belongs to.
|
|
*
|
|
* For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
|
|
* optimized to &contig_page_data at compile-time.
|
|
*/
|
|
static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
|
|
{
|
|
return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
|
|
}
|
|
|
|
#ifndef HAVE_ARCH_FREE_PAGE
|
|
static inline void arch_free_page(struct page *page, int order) { }
|
|
#endif
|
|
#ifndef HAVE_ARCH_ALLOC_PAGE
|
|
static inline void arch_alloc_page(struct page *page, int order) { }
|
|
#endif
|
|
|
|
struct page *
|
|
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
|
|
struct zonelist *zonelist, nodemask_t *nodemask);
|
|
|
|
static inline struct page *
|
|
__alloc_pages(gfp_t gfp_mask, unsigned int order,
|
|
struct zonelist *zonelist)
|
|
{
|
|
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
|
|
}
|
|
|
|
/*
|
|
* Allocate pages, preferring the node given as nid. The node must be valid and
|
|
* online. For more general interface, see alloc_pages_node().
|
|
*/
|
|
static inline struct page *
|
|
__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
|
|
{
|
|
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
|
|
VM_WARN_ON(!node_online(nid));
|
|
|
|
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
|
|
}
|
|
|
|
/*
|
|
* Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
|
|
* prefer the current CPU's closest node. Otherwise node must be valid and
|
|
* online.
|
|
*/
|
|
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
|
|
unsigned int order)
|
|
{
|
|
if (nid == NUMA_NO_NODE)
|
|
nid = numa_mem_id();
|
|
|
|
return __alloc_pages_node(nid, gfp_mask, order);
|
|
}
|
|
|
|
#ifdef CONFIG_NUMA
|
|
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
|
|
|
|
static inline struct page *
|
|
alloc_pages(gfp_t gfp_mask, unsigned int order)
|
|
{
|
|
return alloc_pages_current(gfp_mask, order);
|
|
}
|
|
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
|
|
struct vm_area_struct *vma, unsigned long addr,
|
|
int node, bool hugepage);
|
|
#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
|
|
alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
|
|
#else
|
|
#define alloc_pages(gfp_mask, order) \
|
|
alloc_pages_node(numa_node_id(), gfp_mask, order)
|
|
#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
|
|
alloc_pages(gfp_mask, order)
|
|
#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
|
|
alloc_pages(gfp_mask, order)
|
|
#endif
|
|
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
|
|
#define alloc_page_vma(gfp_mask, vma, addr) \
|
|
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
|
|
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
|
|
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
|
|
|
|
extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
|
|
extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
|
|
unsigned int order);
|
|
|
|
extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
|
|
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
|
|
|
|
void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
|
|
void free_pages_exact(void *virt, size_t size);
|
|
void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
|
|
|
|
#define __get_free_page(gfp_mask) \
|
|
__get_free_pages((gfp_mask), 0)
|
|
|
|
#define __get_dma_pages(gfp_mask, order) \
|
|
__get_free_pages((gfp_mask) | GFP_DMA, (order))
|
|
|
|
extern void __free_pages(struct page *page, unsigned int order);
|
|
extern void free_pages(unsigned long addr, unsigned int order);
|
|
extern void free_hot_cold_page(struct page *page, bool cold);
|
|
extern void free_hot_cold_page_list(struct list_head *list, bool cold);
|
|
|
|
struct page_frag_cache;
|
|
extern void *__alloc_page_frag(struct page_frag_cache *nc,
|
|
unsigned int fragsz, gfp_t gfp_mask);
|
|
extern void __free_page_frag(void *addr);
|
|
|
|
extern void __free_kmem_pages(struct page *page, unsigned int order);
|
|
extern void free_kmem_pages(unsigned long addr, unsigned int order);
|
|
|
|
#define __free_page(page) __free_pages((page), 0)
|
|
#define free_page(addr) free_pages((addr), 0)
|
|
|
|
void page_alloc_init(void);
|
|
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
|
|
void drain_all_pages(struct zone *zone);
|
|
void drain_local_pages(struct zone *zone);
|
|
|
|
void page_alloc_init_late(void);
|
|
|
|
/*
|
|
* gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
|
|
* GFP flags are used before interrupts are enabled. Once interrupts are
|
|
* enabled, it is set to __GFP_BITS_MASK while the system is running. During
|
|
* hibernation, it is used by PM to avoid I/O during memory allocation while
|
|
* devices are suspended.
|
|
*/
|
|
extern gfp_t gfp_allowed_mask;
|
|
|
|
/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
|
|
bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
|
|
|
|
extern void pm_restrict_gfp_mask(void);
|
|
extern void pm_restore_gfp_mask(void);
|
|
|
|
#ifdef CONFIG_PM_SLEEP
|
|
extern bool pm_suspended_storage(void);
|
|
#else
|
|
static inline bool pm_suspended_storage(void)
|
|
{
|
|
return false;
|
|
}
|
|
#endif /* CONFIG_PM_SLEEP */
|
|
|
|
#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || defined(CONFIG_CMA)
|
|
/* The below functions must be run on a range from a single zone. */
|
|
extern int alloc_contig_range(unsigned long start, unsigned long end,
|
|
unsigned migratetype);
|
|
extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
|
|
#endif
|
|
|
|
#ifdef CONFIG_CMA
|
|
/* CMA stuff */
|
|
extern void init_cma_reserved_pageblock(struct page *page);
|
|
#endif
|
|
|
|
#endif /* __LINUX_GFP_H */
|