2019-05-19 20:08:55 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* linux/mm/swap.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2007-10-20 07:27:18 +08:00
|
|
|
* This file contains the default values for the operation of the
|
2005-04-17 06:20:36 +08:00
|
|
|
* Linux VM subsystem. Fine-tuning documentation can be found in
|
2019-04-23 03:48:00 +08:00
|
|
|
* Documentation/admin-guide/sysctl/vm.rst.
|
2005-04-17 06:20:36 +08:00
|
|
|
* Started 18.12.91
|
|
|
|
* Swap aging added 23.2.95, Stephen Tweedie.
|
|
|
|
* Buffermem limits added 12.3.98, Rik van Riel.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/sched.h>
|
|
|
|
#include <linux/kernel_stat.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/mman.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/pagevec.h>
|
|
|
|
#include <linux/init.h>
|
2011-10-16 14:01:52 +08:00
|
|
|
#include <linux/export.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/mm_inline.h>
|
|
|
|
#include <linux/percpu_counter.h>
|
2016-01-16 08:56:55 +08:00
|
|
|
#include <linux/memremap.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/percpu.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/notifier.h>
|
2007-10-17 14:25:46 +08:00
|
|
|
#include <linux/backing-dev.h>
|
2008-02-07 16:13:56 +08:00
|
|
|
#include <linux/memcontrol.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/gfp.h>
|
2013-05-08 07:19:08 +08:00
|
|
|
#include <linux/uio.h>
|
2015-04-16 07:14:35 +08:00
|
|
|
#include <linux/hugetlb.h>
|
mm: introduce idle page tracking
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:
- it does not count unmapped file pages
- it affects the reclaimer logic
To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.
The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.
Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 06:35:45 +08:00
|
|
|
#include <linux/page_idle.h>
|
2020-05-28 04:11:15 +08:00
|
|
|
#include <linux/local_lock.h>
|
2021-05-05 09:37:00 +08:00
|
|
|
#include <linux/buffer_head.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-10-19 11:26:52 +08:00
|
|
|
#include "internal.h"
|
|
|
|
|
2013-07-04 06:02:26 +08:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/pagemap.h>
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* How many pages do we try to swap or page in/out together? */
|
|
|
|
int page_cluster;
|
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
/* Protecting only lru_rotate.pvec which requires disabling interrupts */
|
|
|
|
struct lru_rotate {
|
|
|
|
local_lock_t lock;
|
|
|
|
struct pagevec pvec;
|
|
|
|
};
|
|
|
|
static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
|
|
|
|
.lock = INIT_LOCAL_LOCK(lock),
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The following struct pagevec are grouped together because they are protected
|
|
|
|
* by disabling preemption (and interrupts remain enabled).
|
|
|
|
*/
|
|
|
|
struct lru_pvecs {
|
|
|
|
local_lock_t lock;
|
|
|
|
struct pagevec lru_add;
|
|
|
|
struct pagevec lru_deactivate_file;
|
|
|
|
struct pagevec lru_deactivate;
|
|
|
|
struct pagevec lru_lazyfree;
|
2016-05-21 07:57:56 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2020-05-28 04:11:15 +08:00
|
|
|
struct pagevec activate_page;
|
2016-05-21 07:57:56 +08:00
|
|
|
#endif
|
2020-05-28 04:11:15 +08:00
|
|
|
};
|
|
|
|
static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
|
|
|
|
.lock = INIT_LOCAL_LOCK(lock),
|
|
|
|
};
|
mm: use pagevec to rotate reclaimable page
While running some memory intensive load, system response deteriorated just
after swap-out started.
The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.
Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.
The test result is:
-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms
-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms
Max round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.
I did ping test again to observe performance deterioration caused by taking
a ref.
-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
The result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
The influence to response was small.
[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:24:52 +08:00
|
|
|
|
2006-09-26 14:31:02 +08:00
|
|
|
/*
|
2022-02-15 10:28:05 +08:00
|
|
|
* This path almost never happens for VM activity - pages are normally freed
|
|
|
|
* via pagevecs. But it gets used by networking - and for compound pages.
|
2006-09-26 14:31:02 +08:00
|
|
|
*/
|
2008-02-05 14:29:26 +08:00
|
|
|
static void __page_cache_release(struct page *page)
|
2006-09-26 14:31:02 +08:00
|
|
|
{
|
|
|
|
if (PageLRU(page)) {
|
2021-06-29 09:59:47 +08:00
|
|
|
struct folio *folio = page_folio(page);
|
2012-05-30 06:07:09 +08:00
|
|
|
struct lruvec *lruvec;
|
|
|
|
unsigned long flags;
|
2006-09-26 14:31:02 +08:00
|
|
|
|
2021-06-29 09:59:47 +08:00
|
|
|
lruvec = folio_lruvec_lock_irqsave(folio, &flags);
|
2021-02-25 04:08:25 +08:00
|
|
|
del_page_from_lru_list(page, lruvec);
|
2021-02-25 04:08:28 +08:00
|
|
|
__clear_page_lru_flags(page);
|
2020-12-16 04:34:29 +08:00
|
|
|
unlock_page_lruvec_irqrestore(lruvec, flags);
|
2006-09-26 14:31:02 +08:00
|
|
|
}
|
2022-02-15 10:28:05 +08:00
|
|
|
/* See comment on PageMlocked in release_pages() */
|
|
|
|
if (unlikely(PageMlocked(page))) {
|
|
|
|
int nr_pages = thp_nr_pages(page);
|
|
|
|
|
|
|
|
__ClearPageMlocked(page);
|
|
|
|
mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
|
|
|
|
count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
|
|
|
|
}
|
2016-12-25 11:00:30 +08:00
|
|
|
__ClearPageWaiters(page);
|
2011-01-14 07:46:32 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void __put_single_page(struct page *page)
|
|
|
|
{
|
|
|
|
__page_cache_release(page);
|
2021-05-02 08:42:23 +08:00
|
|
|
mem_cgroup_uncharge(page_folio(page));
|
2021-06-29 10:43:08 +08:00
|
|
|
free_unref_page(page, 0);
|
2006-09-26 14:31:02 +08:00
|
|
|
}
|
|
|
|
|
2011-01-14 07:46:32 +08:00
|
|
|
static void __put_compound_page(struct page *page)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2015-04-16 07:14:35 +08:00
|
|
|
/*
|
|
|
|
* __page_cache_release() is supposed to be called for thp, not for
|
|
|
|
* hugetlb. This is because hugetlb page does never have PageLRU set
|
|
|
|
* (it's never listed to any LRU lists) and no memcg routines should
|
|
|
|
* be called for hugetlb (it has a separate hugetlb_cgroup.)
|
|
|
|
*/
|
|
|
|
if (!PageHuge(page))
|
|
|
|
__page_cache_release(page);
|
2020-06-04 07:01:09 +08:00
|
|
|
destroy_compound_page(page);
|
2011-01-14 07:46:32 +08:00
|
|
|
}
|
|
|
|
|
2016-01-16 08:52:56 +08:00
|
|
|
void __put_page(struct page *page)
|
2006-02-08 04:58:52 +08:00
|
|
|
{
|
2017-04-29 01:23:37 +08:00
|
|
|
if (is_zone_device_page(page)) {
|
|
|
|
put_dev_pagemap(page->pgmap);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The page belongs to the device that created pgmap. Do
|
|
|
|
* not return it to page allocator.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-02-08 04:58:52 +08:00
|
|
|
if (unlikely(PageCompound(page)))
|
2016-01-16 08:52:56 +08:00
|
|
|
__put_compound_page(page);
|
|
|
|
else
|
2011-01-14 07:46:32 +08:00
|
|
|
__put_single_page(page);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2016-01-16 08:52:56 +08:00
|
|
|
EXPORT_SYMBOL(__put_page);
|
2011-11-03 04:36:59 +08:00
|
|
|
|
2006-08-14 14:24:27 +08:00
|
|
|
/**
|
2008-03-20 08:00:40 +08:00
|
|
|
* put_pages_list() - release a list of pages
|
|
|
|
* @pages: list of pages threaded on page->lru
|
2006-08-14 14:24:27 +08:00
|
|
|
*
|
2021-11-06 04:37:25 +08:00
|
|
|
* Release a list of pages which are strung together on page.lru.
|
2006-08-14 14:24:27 +08:00
|
|
|
*/
|
|
|
|
void put_pages_list(struct list_head *pages)
|
|
|
|
{
|
2021-11-06 04:37:25 +08:00
|
|
|
struct page *page, *next;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(page, next, pages, lru) {
|
|
|
|
if (!put_page_testzero(page)) {
|
|
|
|
list_del(&page->lru);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (PageHead(page)) {
|
|
|
|
list_del(&page->lru);
|
|
|
|
__put_compound_page(page);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* Cannot be PageLRU because it's passed to us using the lru */
|
|
|
|
__ClearPageWaiters(page);
|
2006-08-14 14:24:27 +08:00
|
|
|
}
|
2021-11-06 04:37:25 +08:00
|
|
|
|
|
|
|
free_unref_page_list(pages);
|
2021-11-20 08:43:15 +08:00
|
|
|
INIT_LIST_HEAD(pages);
|
2006-08-14 14:24:27 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(put_pages_list);
|
|
|
|
|
2012-08-01 07:44:51 +08:00
|
|
|
/*
|
|
|
|
* get_kernel_pages() - pin kernel pages in memory
|
|
|
|
* @kiov: An array of struct kvec structures
|
|
|
|
* @nr_segs: number of segments to pin
|
|
|
|
* @write: pinning for read/write, currently ignored
|
|
|
|
* @pages: array that receives pointers to the pages pinned.
|
|
|
|
* Should be at least nr_segs long.
|
|
|
|
*
|
|
|
|
* Returns number of pages pinned. This may be fewer than the number
|
|
|
|
* requested. If nr_pages is 0 or negative, returns 0. If no pages
|
|
|
|
* were pinned, returns -errno. Each page returned must be released
|
|
|
|
* with a put_page() call when it is finished with.
|
|
|
|
*/
|
|
|
|
int get_kernel_pages(const struct kvec *kiov, int nr_segs, int write,
|
|
|
|
struct page **pages)
|
|
|
|
{
|
|
|
|
int seg;
|
|
|
|
|
|
|
|
for (seg = 0; seg < nr_segs; seg++) {
|
|
|
|
if (WARN_ON(kiov[seg].iov_len != PAGE_SIZE))
|
|
|
|
return seg;
|
|
|
|
|
2012-08-01 07:45:02 +08:00
|
|
|
pages[seg] = kmap_to_page(kiov[seg].iov_base);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
get_page(pages[seg]);
|
2012-08-01 07:44:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return seg;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(get_kernel_pages);
|
|
|
|
|
2011-03-23 07:33:45 +08:00
|
|
|
static void pagevec_lru_move_fn(struct pagevec *pvec,
|
2020-12-16 04:33:56 +08:00
|
|
|
void (*move_fn)(struct page *page, struct lruvec *lruvec))
|
mm: use pagevec to rotate reclaimable page
While running some memory intensive load, system response deteriorated just
after swap-out started.
The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.
Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.
The test result is:
-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms
-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms
Max round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.
I did ping test again to observe performance deterioration caused by taking
a ref.
-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
The result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
The influence to response was small.
[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:24:52 +08:00
|
|
|
{
|
|
|
|
int i;
|
2020-12-16 04:34:29 +08:00
|
|
|
struct lruvec *lruvec = NULL;
|
2011-03-23 07:33:45 +08:00
|
|
|
unsigned long flags = 0;
|
mm: use pagevec to rotate reclaimable page
While running some memory intensive load, system response deteriorated just
after swap-out started.
The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.
Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.
The test result is:
-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms
-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms
Max round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.
I did ping test again to observe performance deterioration caused by taking
a ref.
-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
The result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
The influence to response was small.
[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:24:52 +08:00
|
|
|
|
|
|
|
for (i = 0; i < pagevec_count(pvec); i++) {
|
|
|
|
struct page *page = pvec->pages[i];
|
2021-06-30 10:27:31 +08:00
|
|
|
struct folio *folio = page_folio(page);
|
2011-03-23 07:33:45 +08:00
|
|
|
|
2020-12-16 04:34:25 +08:00
|
|
|
/* block memcg migration during page moving between lru */
|
|
|
|
if (!TestClearPageLRU(page))
|
|
|
|
continue;
|
|
|
|
|
2021-06-30 10:27:31 +08:00
|
|
|
lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
|
2020-12-16 04:33:56 +08:00
|
|
|
(*move_fn)(page, lruvec);
|
2020-12-16 04:34:25 +08:00
|
|
|
|
|
|
|
SetPageLRU(page);
|
mm: use pagevec to rotate reclaimable page
While running some memory intensive load, system response deteriorated just
after swap-out started.
The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.
Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.
The test result is:
-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms
-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms
Max round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.
I did ping test again to observe performance deterioration caused by taking
a ref.
-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
The result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
The influence to response was small.
[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:24:52 +08:00
|
|
|
}
|
2020-12-16 04:34:29 +08:00
|
|
|
if (lruvec)
|
|
|
|
unlock_page_lruvec_irqrestore(lruvec, flags);
|
2017-11-16 09:37:55 +08:00
|
|
|
release_pages(pvec->pages, pvec->nr);
|
2011-01-18 06:42:34 +08:00
|
|
|
pagevec_reinit(pvec);
|
2011-01-14 07:47:33 +08:00
|
|
|
}
|
|
|
|
|
2020-12-16 04:33:56 +08:00
|
|
|
static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
|
2011-03-23 07:33:45 +08:00
|
|
|
{
|
2020-12-08 14:25:39 +08:00
|
|
|
struct folio *folio = page_folio(page);
|
|
|
|
|
|
|
|
if (!folio_test_unevictable(folio)) {
|
|
|
|
lruvec_del_folio(lruvec, folio);
|
|
|
|
folio_clear_active(folio);
|
|
|
|
lruvec_add_folio_tail(lruvec, folio);
|
|
|
|
__count_vm_events(PGROTATED, folio_nr_pages(folio));
|
2011-03-23 07:33:45 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
/* return true if pagevec needs to drain */
|
|
|
|
static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page)
|
|
|
|
{
|
|
|
|
bool ret = false;
|
|
|
|
|
|
|
|
if (!pagevec_add(pvec, page) || PageCompound(page) ||
|
|
|
|
lru_cache_disabled())
|
|
|
|
ret = true;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2020-12-08 14:25:39 +08:00
|
|
|
* Writeback is about to end against a folio which has been marked for
|
|
|
|
* immediate reclaim. If it still appears to be reclaimable, move it
|
|
|
|
* to the tail of the inactive list.
|
2020-12-16 04:33:56 +08:00
|
|
|
*
|
2020-12-08 14:25:39 +08:00
|
|
|
* folio_rotate_reclaimable() must disable IRQs, to prevent nasty races.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2020-12-08 14:25:39 +08:00
|
|
|
void folio_rotate_reclaimable(struct folio *folio)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2020-12-08 14:25:39 +08:00
|
|
|
if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
|
|
|
|
!folio_test_unevictable(folio) && folio_test_lru(folio)) {
|
2008-04-28 17:12:38 +08:00
|
|
|
struct pagevec *pvec;
|
|
|
|
unsigned long flags;
|
|
|
|
|
2020-12-08 14:25:39 +08:00
|
|
|
folio_get(folio);
|
2020-05-28 04:11:15 +08:00
|
|
|
local_lock_irqsave(&lru_rotate.lock, flags);
|
|
|
|
pvec = this_cpu_ptr(&lru_rotate.pvec);
|
2020-12-08 14:25:39 +08:00
|
|
|
if (pagevec_add_and_need_flush(pvec, &folio->page))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
|
2020-05-28 04:11:15 +08:00
|
|
|
local_unlock_irqrestore(&lru_rotate.lock, flags);
|
2008-04-28 17:12:38 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
mm: vmscan: reclaim writepage is IO cost
The VM tries to balance reclaim pressure between anon and file so as to
reduce the amount of IO incurred due to the memory shortage. It already
counts refaults and swapins, but in addition it should also count
writepage calls during reclaim.
For swap, this is obvious: it's IO that wouldn't have occurred if the
anonymous memory hadn't been under memory pressure. From a relative
balancing point of view this makes sense as well: even if anon is cold and
reclaimable, a cache that isn't thrashing may have equally cold pages that
don't require IO to reclaim.
For file writeback, it's trickier: some of the reclaim writepage IO would
have likely occurred anyway due to dirty expiration. But not all of it -
premature writeback reduces batching and generates additional writes.
Since the flushers are already woken up by the time the VM starts writing
cache pages one by one, let's assume that we'e likely causing writes that
wouldn't have happened without memory pressure. In addition, the per-page
cost of IO would have probably been much cheaper if written in larger
batches from the flusher thread rather than the single-page-writes from
kswapd.
For our purposes - getting the trend right to accelerate convergence on a
stable state that doesn't require paging at all - this is sufficiently
accurate. If we later wanted to optimize for sustained thrashing, we can
still refine the measurements.
Count all writepage calls from kswapd as IO cost toward the LRU that the
page belongs to.
Why do this dynamically? Don't we know in advance that anon pages require
IO to reclaim, and so could build in a static bias?
First, scanning is not the same as reclaiming. If all the anon pages are
referenced, we may not swap for a while just because we're scanning the
anon list. During this time, however, it's important that we age
anonymous memory and the page cache at the same rate so that their
hot-cold gradients are comparable. Everything else being equal, we still
want to reclaim the coldest memory overall.
Second, we keep copies in swap unless the page changes. If there is
swap-backed data that's mostly read (tmpfs file) and has been swapped out
before, we can reclaim it without incurring additional IO.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:09 +08:00
|
|
|
void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
|
2009-01-08 10:08:20 +08:00
|
|
|
{
|
mm: vmscan: determine anon/file pressure balance at the reclaim root
We split the LRU lists into anon and file, and we rebalance the scan
pressure between them when one of them begins thrashing: if the file cache
experiences workingset refaults, we increase the pressure on anonymous
pages; if the workload is stalled on swapins, we increase the pressure on
the file cache instead.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, LRU pressure balancing is
done on an individual cgroup LRU level. As a result, when one cgroup is
thrashing on the filesystem cache while a sibling may have cold anonymous
pages, pressure doesn't get equalized between them.
This patch moves LRU balancing decision to the root of reclaim - the same
level where the LRU order is established.
It does this by tracking LRU cost recursively, so that every level of the
cgroup tree knows the aggregate LRU cost of all memory within its domain.
When the page scanner calculates the scan balance for any given individual
cgroup's LRU list, it uses the values from the ancestor cgroup that
initiated the reclaim cycle.
If one sibling is then thrashing on the cache, it will tip the pressure
balance inside its ancestors, and the next hierarchical reclaim iteration
will go more after the anon pages in the tree.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:06 +08:00
|
|
|
do {
|
|
|
|
unsigned long lrusize;
|
|
|
|
|
2020-12-16 04:34:29 +08:00
|
|
|
/*
|
|
|
|
* Hold lruvec->lru_lock is safe here, since
|
|
|
|
* 1) The pinned lruvec in reclaim, or
|
|
|
|
* 2) From a pre-LRU page during refault (which also holds the
|
|
|
|
* rcu lock, so would be safe even if the page was on the LRU
|
|
|
|
* and could move simultaneously to a new lruvec).
|
|
|
|
*/
|
|
|
|
spin_lock_irq(&lruvec->lru_lock);
|
mm: vmscan: determine anon/file pressure balance at the reclaim root
We split the LRU lists into anon and file, and we rebalance the scan
pressure between them when one of them begins thrashing: if the file cache
experiences workingset refaults, we increase the pressure on anonymous
pages; if the workload is stalled on swapins, we increase the pressure on
the file cache instead.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, LRU pressure balancing is
done on an individual cgroup LRU level. As a result, when one cgroup is
thrashing on the filesystem cache while a sibling may have cold anonymous
pages, pressure doesn't get equalized between them.
This patch moves LRU balancing decision to the root of reclaim - the same
level where the LRU order is established.
It does this by tracking LRU cost recursively, so that every level of the
cgroup tree knows the aggregate LRU cost of all memory within its domain.
When the page scanner calculates the scan balance for any given individual
cgroup's LRU list, it uses the values from the ancestor cgroup that
initiated the reclaim cycle.
If one sibling is then thrashing on the cache, it will tip the pressure
balance inside its ancestors, and the next hierarchical reclaim iteration
will go more after the anon pages in the tree.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:06 +08:00
|
|
|
/* Record cost event */
|
mm: vmscan: reclaim writepage is IO cost
The VM tries to balance reclaim pressure between anon and file so as to
reduce the amount of IO incurred due to the memory shortage. It already
counts refaults and swapins, but in addition it should also count
writepage calls during reclaim.
For swap, this is obvious: it's IO that wouldn't have occurred if the
anonymous memory hadn't been under memory pressure. From a relative
balancing point of view this makes sense as well: even if anon is cold and
reclaimable, a cache that isn't thrashing may have equally cold pages that
don't require IO to reclaim.
For file writeback, it's trickier: some of the reclaim writepage IO would
have likely occurred anyway due to dirty expiration. But not all of it -
premature writeback reduces batching and generates additional writes.
Since the flushers are already woken up by the time the VM starts writing
cache pages one by one, let's assume that we'e likely causing writes that
wouldn't have happened without memory pressure. In addition, the per-page
cost of IO would have probably been much cheaper if written in larger
batches from the flusher thread rather than the single-page-writes from
kswapd.
For our purposes - getting the trend right to accelerate convergence on a
stable state that doesn't require paging at all - this is sufficiently
accurate. If we later wanted to optimize for sustained thrashing, we can
still refine the measurements.
Count all writepage calls from kswapd as IO cost toward the LRU that the
page belongs to.
Why do this dynamically? Don't we know in advance that anon pages require
IO to reclaim, and so could build in a static bias?
First, scanning is not the same as reclaiming. If all the anon pages are
referenced, we may not swap for a while just because we're scanning the
anon list. During this time, however, it's important that we age
anonymous memory and the page cache at the same rate so that their
hot-cold gradients are comparable. Everything else being equal, we still
want to reclaim the coldest memory overall.
Second, we keep copies in swap unless the page changes. If there is
swap-backed data that's mostly read (tmpfs file) and has been swapped out
before, we can reclaim it without incurring additional IO.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:09 +08:00
|
|
|
if (file)
|
|
|
|
lruvec->file_cost += nr_pages;
|
mm: vmscan: determine anon/file pressure balance at the reclaim root
We split the LRU lists into anon and file, and we rebalance the scan
pressure between them when one of them begins thrashing: if the file cache
experiences workingset refaults, we increase the pressure on anonymous
pages; if the workload is stalled on swapins, we increase the pressure on
the file cache instead.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, LRU pressure balancing is
done on an individual cgroup LRU level. As a result, when one cgroup is
thrashing on the filesystem cache while a sibling may have cold anonymous
pages, pressure doesn't get equalized between them.
This patch moves LRU balancing decision to the root of reclaim - the same
level where the LRU order is established.
It does this by tracking LRU cost recursively, so that every level of the
cgroup tree knows the aggregate LRU cost of all memory within its domain.
When the page scanner calculates the scan balance for any given individual
cgroup's LRU list, it uses the values from the ancestor cgroup that
initiated the reclaim cycle.
If one sibling is then thrashing on the cache, it will tip the pressure
balance inside its ancestors, and the next hierarchical reclaim iteration
will go more after the anon pages in the tree.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:06 +08:00
|
|
|
else
|
mm: vmscan: reclaim writepage is IO cost
The VM tries to balance reclaim pressure between anon and file so as to
reduce the amount of IO incurred due to the memory shortage. It already
counts refaults and swapins, but in addition it should also count
writepage calls during reclaim.
For swap, this is obvious: it's IO that wouldn't have occurred if the
anonymous memory hadn't been under memory pressure. From a relative
balancing point of view this makes sense as well: even if anon is cold and
reclaimable, a cache that isn't thrashing may have equally cold pages that
don't require IO to reclaim.
For file writeback, it's trickier: some of the reclaim writepage IO would
have likely occurred anyway due to dirty expiration. But not all of it -
premature writeback reduces batching and generates additional writes.
Since the flushers are already woken up by the time the VM starts writing
cache pages one by one, let's assume that we'e likely causing writes that
wouldn't have happened without memory pressure. In addition, the per-page
cost of IO would have probably been much cheaper if written in larger
batches from the flusher thread rather than the single-page-writes from
kswapd.
For our purposes - getting the trend right to accelerate convergence on a
stable state that doesn't require paging at all - this is sufficiently
accurate. If we later wanted to optimize for sustained thrashing, we can
still refine the measurements.
Count all writepage calls from kswapd as IO cost toward the LRU that the
page belongs to.
Why do this dynamically? Don't we know in advance that anon pages require
IO to reclaim, and so could build in a static bias?
First, scanning is not the same as reclaiming. If all the anon pages are
referenced, we may not swap for a while just because we're scanning the
anon list. During this time, however, it's important that we age
anonymous memory and the page cache at the same rate so that their
hot-cold gradients are comparable. Everything else being equal, we still
want to reclaim the coldest memory overall.
Second, we keep copies in swap unless the page changes. If there is
swap-backed data that's mostly read (tmpfs file) and has been swapped out
before, we can reclaim it without incurring additional IO.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:09 +08:00
|
|
|
lruvec->anon_cost += nr_pages;
|
mm: vmscan: determine anon/file pressure balance at the reclaim root
We split the LRU lists into anon and file, and we rebalance the scan
pressure between them when one of them begins thrashing: if the file cache
experiences workingset refaults, we increase the pressure on anonymous
pages; if the workload is stalled on swapins, we increase the pressure on
the file cache instead.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, LRU pressure balancing is
done on an individual cgroup LRU level. As a result, when one cgroup is
thrashing on the filesystem cache while a sibling may have cold anonymous
pages, pressure doesn't get equalized between them.
This patch moves LRU balancing decision to the root of reclaim - the same
level where the LRU order is established.
It does this by tracking LRU cost recursively, so that every level of the
cgroup tree knows the aggregate LRU cost of all memory within its domain.
When the page scanner calculates the scan balance for any given individual
cgroup's LRU list, it uses the values from the ancestor cgroup that
initiated the reclaim cycle.
If one sibling is then thrashing on the cache, it will tip the pressure
balance inside its ancestors, and the next hierarchical reclaim iteration
will go more after the anon pages in the tree.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:06 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Decay previous events
|
|
|
|
*
|
|
|
|
* Because workloads change over time (and to avoid
|
|
|
|
* overflow) we keep these statistics as a floating
|
|
|
|
* average, which ends up weighing recent refaults
|
|
|
|
* more than old ones.
|
|
|
|
*/
|
|
|
|
lrusize = lruvec_page_state(lruvec, NR_INACTIVE_ANON) +
|
|
|
|
lruvec_page_state(lruvec, NR_ACTIVE_ANON) +
|
|
|
|
lruvec_page_state(lruvec, NR_INACTIVE_FILE) +
|
|
|
|
lruvec_page_state(lruvec, NR_ACTIVE_FILE);
|
|
|
|
|
|
|
|
if (lruvec->file_cost + lruvec->anon_cost > lrusize / 4) {
|
|
|
|
lruvec->file_cost /= 2;
|
|
|
|
lruvec->anon_cost /= 2;
|
|
|
|
}
|
2020-12-16 04:34:29 +08:00
|
|
|
spin_unlock_irq(&lruvec->lru_lock);
|
mm: vmscan: determine anon/file pressure balance at the reclaim root
We split the LRU lists into anon and file, and we rebalance the scan
pressure between them when one of them begins thrashing: if the file cache
experiences workingset refaults, we increase the pressure on anonymous
pages; if the workload is stalled on swapins, we increase the pressure on
the file cache instead.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, LRU pressure balancing is
done on an individual cgroup LRU level. As a result, when one cgroup is
thrashing on the filesystem cache while a sibling may have cold anonymous
pages, pressure doesn't get equalized between them.
This patch moves LRU balancing decision to the root of reclaim - the same
level where the LRU order is established.
It does this by tracking LRU cost recursively, so that every level of the
cgroup tree knows the aggregate LRU cost of all memory within its domain.
When the page scanner calculates the scan balance for any given individual
cgroup's LRU list, it uses the values from the ancestor cgroup that
initiated the reclaim cycle.
If one sibling is then thrashing on the cache, it will tip the pressure
balance inside its ancestors, and the next hierarchical reclaim iteration
will go more after the anon pages in the tree.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:06 +08:00
|
|
|
} while ((lruvec = parent_lruvec(lruvec)));
|
2009-01-08 10:08:20 +08:00
|
|
|
}
|
|
|
|
|
2021-04-29 22:27:16 +08:00
|
|
|
void lru_note_cost_folio(struct folio *folio)
|
mm: vmscan: reclaim writepage is IO cost
The VM tries to balance reclaim pressure between anon and file so as to
reduce the amount of IO incurred due to the memory shortage. It already
counts refaults and swapins, but in addition it should also count
writepage calls during reclaim.
For swap, this is obvious: it's IO that wouldn't have occurred if the
anonymous memory hadn't been under memory pressure. From a relative
balancing point of view this makes sense as well: even if anon is cold and
reclaimable, a cache that isn't thrashing may have equally cold pages that
don't require IO to reclaim.
For file writeback, it's trickier: some of the reclaim writepage IO would
have likely occurred anyway due to dirty expiration. But not all of it -
premature writeback reduces batching and generates additional writes.
Since the flushers are already woken up by the time the VM starts writing
cache pages one by one, let's assume that we'e likely causing writes that
wouldn't have happened without memory pressure. In addition, the per-page
cost of IO would have probably been much cheaper if written in larger
batches from the flusher thread rather than the single-page-writes from
kswapd.
For our purposes - getting the trend right to accelerate convergence on a
stable state that doesn't require paging at all - this is sufficiently
accurate. If we later wanted to optimize for sustained thrashing, we can
still refine the measurements.
Count all writepage calls from kswapd as IO cost toward the LRU that the
page belongs to.
Why do this dynamically? Don't we know in advance that anon pages require
IO to reclaim, and so could build in a static bias?
First, scanning is not the same as reclaiming. If all the anon pages are
referenced, we may not swap for a while just because we're scanning the
anon list. During this time, however, it's important that we age
anonymous memory and the page cache at the same rate so that their
hot-cold gradients are comparable. Everything else being equal, we still
want to reclaim the coldest memory overall.
Second, we keep copies in swap unless the page changes. If there is
swap-backed data that's mostly read (tmpfs file) and has been swapped out
before, we can reclaim it without incurring additional IO.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:09 +08:00
|
|
|
{
|
2021-04-29 22:27:16 +08:00
|
|
|
lru_note_cost(folio_lruvec(folio), folio_is_file_lru(folio),
|
|
|
|
folio_nr_pages(folio));
|
mm: vmscan: reclaim writepage is IO cost
The VM tries to balance reclaim pressure between anon and file so as to
reduce the amount of IO incurred due to the memory shortage. It already
counts refaults and swapins, but in addition it should also count
writepage calls during reclaim.
For swap, this is obvious: it's IO that wouldn't have occurred if the
anonymous memory hadn't been under memory pressure. From a relative
balancing point of view this makes sense as well: even if anon is cold and
reclaimable, a cache that isn't thrashing may have equally cold pages that
don't require IO to reclaim.
For file writeback, it's trickier: some of the reclaim writepage IO would
have likely occurred anyway due to dirty expiration. But not all of it -
premature writeback reduces batching and generates additional writes.
Since the flushers are already woken up by the time the VM starts writing
cache pages one by one, let's assume that we'e likely causing writes that
wouldn't have happened without memory pressure. In addition, the per-page
cost of IO would have probably been much cheaper if written in larger
batches from the flusher thread rather than the single-page-writes from
kswapd.
For our purposes - getting the trend right to accelerate convergence on a
stable state that doesn't require paging at all - this is sufficiently
accurate. If we later wanted to optimize for sustained thrashing, we can
still refine the measurements.
Count all writepage calls from kswapd as IO cost toward the LRU that the
page belongs to.
Why do this dynamically? Don't we know in advance that anon pages require
IO to reclaim, and so could build in a static bias?
First, scanning is not the same as reclaiming. If all the anon pages are
referenced, we may not swap for a while just because we're scanning the
anon list. During this time, however, it's important that we age
anonymous memory and the page cache at the same rate so that their
hot-cold gradients are comparable. Everything else being equal, we still
want to reclaim the coldest memory overall.
Second, we keep copies in swap unless the page changes. If there is
swap-backed data that's mostly read (tmpfs file) and has been swapped out
before, we can reclaim it without incurring additional IO.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:03:09 +08:00
|
|
|
}
|
|
|
|
|
2021-04-27 22:37:50 +08:00
|
|
|
static void __folio_activate(struct folio *folio, struct lruvec *lruvec)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2021-04-27 22:37:50 +08:00
|
|
|
if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
|
|
|
|
long nr_pages = folio_nr_pages(folio);
|
2011-01-14 07:47:34 +08:00
|
|
|
|
2021-04-27 22:37:50 +08:00
|
|
|
lruvec_del_folio(lruvec, folio);
|
|
|
|
folio_set_active(folio);
|
|
|
|
lruvec_add_folio(lruvec, folio);
|
|
|
|
trace_mm_lru_activate(folio);
|
2008-10-19 11:26:32 +08:00
|
|
|
|
2020-06-04 07:03:19 +08:00
|
|
|
__count_vm_events(PGACTIVATE, nr_pages);
|
|
|
|
__count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE,
|
|
|
|
nr_pages);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2011-05-25 08:12:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_SMP
|
2021-04-27 22:37:50 +08:00
|
|
|
static void __activate_page(struct page *page, struct lruvec *lruvec)
|
|
|
|
{
|
|
|
|
return __folio_activate(page_folio(page), lruvec);
|
|
|
|
}
|
|
|
|
|
2011-05-25 08:12:55 +08:00
|
|
|
static void activate_page_drain(int cpu)
|
|
|
|
{
|
2020-05-28 04:11:15 +08:00
|
|
|
struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);
|
2011-05-25 08:12:55 +08:00
|
|
|
|
|
|
|
if (pagevec_count(pvec))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, __activate_page);
|
2011-05-25 08:12:55 +08:00
|
|
|
}
|
|
|
|
|
2013-09-13 06:13:55 +08:00
|
|
|
static bool need_activate_page_drain(int cpu)
|
|
|
|
{
|
2020-05-28 04:11:15 +08:00
|
|
|
return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
|
2013-09-13 06:13:55 +08:00
|
|
|
}
|
|
|
|
|
2021-04-27 22:37:50 +08:00
|
|
|
static void folio_activate(struct folio *folio)
|
2011-05-25 08:12:55 +08:00
|
|
|
{
|
2021-04-27 22:37:50 +08:00
|
|
|
if (folio_test_lru(folio) && !folio_test_active(folio) &&
|
|
|
|
!folio_test_unevictable(folio)) {
|
2020-05-28 04:11:15 +08:00
|
|
|
struct pagevec *pvec;
|
2011-05-25 08:12:55 +08:00
|
|
|
|
2021-04-27 22:37:50 +08:00
|
|
|
folio_get(folio);
|
2020-05-28 04:11:15 +08:00
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
pvec = this_cpu_ptr(&lru_pvecs.activate_page);
|
2021-04-27 22:37:50 +08:00
|
|
|
if (pagevec_add_and_need_flush(pvec, &folio->page))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, __activate_page);
|
2020-05-28 04:11:15 +08:00
|
|
|
local_unlock(&lru_pvecs.lock);
|
2011-05-25 08:12:55 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
static inline void activate_page_drain(int cpu)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2021-04-27 22:37:50 +08:00
|
|
|
static void folio_activate(struct folio *folio)
|
2011-05-25 08:12:55 +08:00
|
|
|
{
|
2020-12-16 04:34:29 +08:00
|
|
|
struct lruvec *lruvec;
|
2011-05-25 08:12:55 +08:00
|
|
|
|
2021-04-27 22:37:50 +08:00
|
|
|
if (folio_test_clear_lru(folio)) {
|
2021-06-29 09:59:47 +08:00
|
|
|
lruvec = folio_lruvec_lock_irq(folio);
|
2021-04-27 22:37:50 +08:00
|
|
|
__folio_activate(folio, lruvec);
|
2020-12-16 04:34:29 +08:00
|
|
|
unlock_page_lruvec_irq(lruvec);
|
2021-04-27 22:37:50 +08:00
|
|
|
folio_set_lru(folio);
|
2020-12-16 04:34:29 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2011-05-25 08:12:55 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-04-27 22:47:39 +08:00
|
|
|
static void __lru_cache_activate_folio(struct folio *folio)
|
2013-07-04 06:02:30 +08:00
|
|
|
{
|
2020-05-28 04:11:15 +08:00
|
|
|
struct pagevec *pvec;
|
2013-07-04 06:02:30 +08:00
|
|
|
int i;
|
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
|
|
|
|
|
2013-07-04 06:02:30 +08:00
|
|
|
/*
|
|
|
|
* Search backwards on the optimistic assumption that the page being
|
|
|
|
* activated has just been added to this pagevec. Note that only
|
|
|
|
* the local pagevec is examined as a !PageLRU page could be in the
|
|
|
|
* process of being released, reclaimed, migrated or on a remote
|
|
|
|
* pagevec that is currently being drained. Furthermore, marking
|
|
|
|
* a remote pagevec's page PageActive potentially hits a race where
|
|
|
|
* a page is marked PageActive just after it is added to the inactive
|
|
|
|
* list causing accounting errors and BUG_ON checks to trigger.
|
|
|
|
*/
|
|
|
|
for (i = pagevec_count(pvec) - 1; i >= 0; i--) {
|
|
|
|
struct page *pagevec_page = pvec->pages[i];
|
|
|
|
|
2021-04-27 22:47:39 +08:00
|
|
|
if (pagevec_page == &folio->page) {
|
|
|
|
folio_set_active(folio);
|
2013-07-04 06:02:30 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
local_unlock(&lru_pvecs.lock);
|
2013-07-04 06:02:30 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Mark a page as having seen activity.
|
|
|
|
*
|
|
|
|
* inactive,unreferenced -> inactive,referenced
|
|
|
|
* inactive,referenced -> active,unreferenced
|
|
|
|
* active,unreferenced -> active,referenced
|
2014-08-07 07:06:43 +08:00
|
|
|
*
|
|
|
|
* When a newly allocated page is not yet visible, so safe for non-atomic ops,
|
|
|
|
* __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2021-04-27 22:47:39 +08:00
|
|
|
void folio_mark_accessed(struct folio *folio)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2021-04-27 22:47:39 +08:00
|
|
|
if (!folio_test_referenced(folio)) {
|
|
|
|
folio_set_referenced(folio);
|
|
|
|
} else if (folio_test_unevictable(folio)) {
|
2019-12-01 09:50:00 +08:00
|
|
|
/*
|
|
|
|
* Unevictable pages are on the "LRU_UNEVICTABLE" list. But,
|
|
|
|
* this list is never rotated or maintained, so marking an
|
|
|
|
* evictable page accessed has no effect.
|
|
|
|
*/
|
2021-04-27 22:47:39 +08:00
|
|
|
} else if (!folio_test_active(folio)) {
|
2013-07-04 06:02:30 +08:00
|
|
|
/*
|
|
|
|
* If the page is on the LRU, queue it for activation via
|
2020-05-28 04:11:15 +08:00
|
|
|
* lru_pvecs.activate_page. Otherwise, assume the page is on a
|
2013-07-04 06:02:30 +08:00
|
|
|
* pagevec, mark it active and it'll be moved to the active
|
|
|
|
* LRU on the next drain.
|
|
|
|
*/
|
2021-04-27 22:47:39 +08:00
|
|
|
if (folio_test_lru(folio))
|
|
|
|
folio_activate(folio);
|
2013-07-04 06:02:30 +08:00
|
|
|
else
|
2021-04-27 22:47:39 +08:00
|
|
|
__lru_cache_activate_folio(folio);
|
|
|
|
folio_clear_referenced(folio);
|
|
|
|
workingset_activation(folio);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2021-04-27 22:47:39 +08:00
|
|
|
if (folio_test_idle(folio))
|
|
|
|
folio_clear_idle(folio);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2021-04-27 22:47:39 +08:00
|
|
|
EXPORT_SYMBOL(folio_mark_accessed);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-10-19 11:26:19 +08:00
|
|
|
/**
|
2021-04-29 23:09:31 +08:00
|
|
|
* folio_add_lru - Add a folio to an LRU list.
|
|
|
|
* @folio: The folio to be added to the LRU.
|
2014-06-05 07:07:31 +08:00
|
|
|
*
|
2021-04-29 23:09:31 +08:00
|
|
|
* Queue the folio for addition to the LRU. The decision on whether
|
2014-06-05 07:07:31 +08:00
|
|
|
* to add the page to the [in]active [file|anon] list is deferred until the
|
2021-04-29 23:09:31 +08:00
|
|
|
* pagevec is drained. This gives a chance for the caller of folio_add_lru()
|
|
|
|
* have the folio added to the active list using folio_mark_accessed().
|
2008-10-19 11:26:19 +08:00
|
|
|
*/
|
2021-04-29 23:09:31 +08:00
|
|
|
void folio_add_lru(struct folio *folio)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2020-06-04 07:02:40 +08:00
|
|
|
struct pagevec *pvec;
|
|
|
|
|
2021-04-29 23:09:31 +08:00
|
|
|
VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
|
|
|
|
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
|
2020-06-04 07:02:40 +08:00
|
|
|
|
2021-04-29 23:09:31 +08:00
|
|
|
folio_get(folio);
|
2020-06-04 07:02:40 +08:00
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
|
2021-04-29 23:09:31 +08:00
|
|
|
if (pagevec_add_and_need_flush(pvec, &folio->page))
|
2020-06-04 07:02:40 +08:00
|
|
|
__pagevec_lru_add(pvec);
|
|
|
|
local_unlock(&lru_pvecs.lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2021-04-29 23:09:31 +08:00
|
|
|
EXPORT_SYMBOL(folio_add_lru);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
/**
|
2020-08-12 09:30:40 +08:00
|
|
|
* lru_cache_add_inactive_or_unevictable
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
* @page: the page to be added to LRU
|
|
|
|
* @vma: vma in which page is mapped for determining reclaimability
|
|
|
|
*
|
2020-08-12 09:30:40 +08:00
|
|
|
* Place @page on the inactive or unevictable LRU list, depending on its
|
2020-10-14 07:52:24 +08:00
|
|
|
* evictability.
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
*/
|
2020-08-12 09:30:40 +08:00
|
|
|
void lru_cache_add_inactive_or_unevictable(struct page *page,
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
struct vm_area_struct *vma)
|
|
|
|
{
|
2020-08-12 09:30:40 +08:00
|
|
|
bool unevictable;
|
|
|
|
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
VM_BUG_ON_PAGE(PageLRU(page), page);
|
|
|
|
|
2020-08-12 09:30:40 +08:00
|
|
|
unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
|
|
|
|
if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
|
2020-09-19 12:20:15 +08:00
|
|
|
int nr_pages = thp_nr_pages(page);
|
2022-02-15 10:28:05 +08:00
|
|
|
|
|
|
|
mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
|
2020-09-19 12:20:15 +08:00
|
|
|
count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
}
|
mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.
The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.
This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.
However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
move to evictable LRU putback_lru_page()
else
move to unevictable LRU
In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.
In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU(). If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page. That page will be stranded on the unevictable
LRU.
There is one (good) side effect though. Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment. This patch will correctly
put such pages to unevictable LRU.
Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 06:45:28 +08:00
|
|
|
lru_cache_add(page);
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
}
|
|
|
|
|
2011-03-23 07:32:52 +08:00
|
|
|
/*
|
|
|
|
* If the page can not be invalidated, it is moved to the
|
|
|
|
* inactive list to speed up its reclaim. It is moved to the
|
|
|
|
* head of the list, rather than the tail, to give the flusher
|
|
|
|
* threads some time to write it out, as this is much more
|
|
|
|
* effective than the single-page writeout from reclaim.
|
mm: reclaim invalidated page ASAP
invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.
Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.
In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.
I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Steven Barrett <damentz@liquorix.net>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 07:32:54 +08:00
|
|
|
*
|
|
|
|
* If the page isn't page_mapped and dirty/writeback, the page
|
|
|
|
* could reclaim asap using PG_reclaim.
|
|
|
|
*
|
|
|
|
* 1. active, mapped page -> none
|
|
|
|
* 2. active, dirty/writeback page -> inactive, head, PG_reclaim
|
|
|
|
* 3. inactive, mapped page -> none
|
|
|
|
* 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
|
|
|
|
* 5. inactive, clean -> inactive, tail
|
|
|
|
* 6. Others -> none
|
|
|
|
*
|
|
|
|
* In 4, why it moves inactive's head, the VM expects the page would
|
|
|
|
* be write it out by flusher threads as this is much more effective
|
|
|
|
* than the single-page writeout from reclaim.
|
2011-03-23 07:32:52 +08:00
|
|
|
*/
|
2020-12-16 04:33:56 +08:00
|
|
|
static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
|
2011-03-23 07:32:52 +08:00
|
|
|
{
|
2021-02-25 04:08:25 +08:00
|
|
|
bool active = PageActive(page);
|
2020-08-15 08:30:37 +08:00
|
|
|
int nr_pages = thp_nr_pages(page);
|
2011-03-23 07:32:52 +08:00
|
|
|
|
2011-05-12 06:13:30 +08:00
|
|
|
if (PageUnevictable(page))
|
|
|
|
return;
|
|
|
|
|
2011-03-23 07:32:52 +08:00
|
|
|
/* Some processes are using the page */
|
|
|
|
if (page_mapped(page))
|
|
|
|
return;
|
|
|
|
|
2021-02-25 04:08:25 +08:00
|
|
|
del_page_from_lru_list(page, lruvec);
|
2011-03-23 07:32:52 +08:00
|
|
|
ClearPageActive(page);
|
|
|
|
ClearPageReferenced(page);
|
|
|
|
|
mm: reclaim invalidated page ASAP
invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.
Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.
In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.
I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Steven Barrett <damentz@liquorix.net>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 07:32:54 +08:00
|
|
|
if (PageWriteback(page) || PageDirty(page)) {
|
|
|
|
/*
|
|
|
|
* PG_reclaim could be raced with end_page_writeback
|
|
|
|
* It can make readahead confusing. But race window
|
|
|
|
* is _really_ small and it's non-critical problem.
|
|
|
|
*/
|
2021-02-25 04:08:17 +08:00
|
|
|
add_page_to_lru_list(page, lruvec);
|
mm: reclaim invalidated page ASAP
invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.
Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.
In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.
I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Steven Barrett <damentz@liquorix.net>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 07:32:54 +08:00
|
|
|
SetPageReclaim(page);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* The page's writeback ends up during pagevec
|
2021-07-01 09:53:10 +08:00
|
|
|
* We move that page into tail of inactive.
|
mm: reclaim invalidated page ASAP
invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.
Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.
In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.
I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Steven Barrett <damentz@liquorix.net>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 07:32:54 +08:00
|
|
|
*/
|
2021-02-25 04:08:17 +08:00
|
|
|
add_page_to_lru_list_tail(page, lruvec);
|
2020-06-04 07:03:16 +08:00
|
|
|
__count_vm_events(PGROTATED, nr_pages);
|
mm: reclaim invalidated page ASAP
invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.
Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.
In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.
I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Steven Barrett <damentz@liquorix.net>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 07:32:54 +08:00
|
|
|
}
|
|
|
|
|
2020-06-04 07:03:19 +08:00
|
|
|
if (active) {
|
2020-06-04 07:03:16 +08:00
|
|
|
__count_vm_events(PGDEACTIVATE, nr_pages);
|
2020-06-04 07:03:19 +08:00
|
|
|
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE,
|
|
|
|
nr_pages);
|
|
|
|
}
|
2011-03-23 07:32:52 +08:00
|
|
|
}
|
|
|
|
|
2020-12-16 04:33:56 +08:00
|
|
|
static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
{
|
2020-12-16 04:34:25 +08:00
|
|
|
if (PageActive(page) && !PageUnevictable(page)) {
|
2020-08-15 08:30:37 +08:00
|
|
|
int nr_pages = thp_nr_pages(page);
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
|
2021-02-25 04:08:25 +08:00
|
|
|
del_page_from_lru_list(page, lruvec);
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
ClearPageActive(page);
|
|
|
|
ClearPageReferenced(page);
|
2021-02-25 04:08:17 +08:00
|
|
|
add_page_to_lru_list(page, lruvec);
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
|
2020-06-04 07:03:19 +08:00
|
|
|
__count_vm_events(PGDEACTIVATE, nr_pages);
|
|
|
|
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE,
|
|
|
|
nr_pages);
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
}
|
|
|
|
}
|
2016-01-16 08:55:11 +08:00
|
|
|
|
2020-12-16 04:33:56 +08:00
|
|
|
static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
|
2016-01-16 08:55:11 +08:00
|
|
|
{
|
2020-12-16 04:34:25 +08:00
|
|
|
if (PageAnon(page) && PageSwapBacked(page) &&
|
2017-10-04 07:15:29 +08:00
|
|
|
!PageSwapCache(page) && !PageUnevictable(page)) {
|
2020-08-15 08:30:37 +08:00
|
|
|
int nr_pages = thp_nr_pages(page);
|
2016-01-16 08:55:11 +08:00
|
|
|
|
2021-02-25 04:08:25 +08:00
|
|
|
del_page_from_lru_list(page, lruvec);
|
2016-01-16 08:55:11 +08:00
|
|
|
ClearPageActive(page);
|
|
|
|
ClearPageReferenced(page);
|
2017-05-04 05:52:29 +08:00
|
|
|
/*
|
2020-04-07 11:04:41 +08:00
|
|
|
* Lazyfree pages are clean anonymous pages. They have
|
|
|
|
* PG_swapbacked flag cleared, to distinguish them from normal
|
|
|
|
* anonymous pages
|
2017-05-04 05:52:29 +08:00
|
|
|
*/
|
|
|
|
ClearPageSwapBacked(page);
|
2021-02-25 04:08:17 +08:00
|
|
|
add_page_to_lru_list(page, lruvec);
|
2016-01-16 08:55:11 +08:00
|
|
|
|
2020-06-04 07:03:19 +08:00
|
|
|
__count_vm_events(PGLAZYFREE, nr_pages);
|
|
|
|
__count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE,
|
|
|
|
nr_pages);
|
2016-01-16 08:55:11 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
mm: use pagevec to rotate reclaimable page
While running some memory intensive load, system response deteriorated just
after swap-out started.
The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.
Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.
The test result is:
-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms
-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms
Max round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.
I did ping test again to observe performance deterioration caused by taking
a ref.
-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
The result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
The influence to response was small.
[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:24:52 +08:00
|
|
|
/*
|
|
|
|
* Drain pages out of the cpu's pagevecs.
|
|
|
|
* Either "cpu" is the current CPU, and preemption has already been
|
|
|
|
* disabled; or "cpu" is being hot-unplugged, and is already dead.
|
|
|
|
*/
|
2012-03-22 07:34:06 +08:00
|
|
|
void lru_add_drain_cpu(int cpu)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2020-05-28 04:11:15 +08:00
|
|
|
struct pagevec *pvec = &per_cpu(lru_pvecs.lru_add, cpu);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-07-04 06:02:28 +08:00
|
|
|
if (pagevec_count(pvec))
|
2013-07-04 06:02:32 +08:00
|
|
|
__pagevec_lru_add(pvec);
|
mm: use pagevec to rotate reclaimable page
While running some memory intensive load, system response deteriorated just
after swap-out started.
The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.
Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.
The test result is:
-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms
-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms
Max round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.
I did ping test again to observe performance deterioration caused by taking
a ref.
-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
The result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
The influence to response was small.
[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:24:52 +08:00
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
pvec = &per_cpu(lru_rotate.pvec, cpu);
|
2020-08-15 08:31:50 +08:00
|
|
|
/* Disabling interrupts below acts as a compiler barrier. */
|
|
|
|
if (data_race(pagevec_count(pvec))) {
|
mm: use pagevec to rotate reclaimable page
While running some memory intensive load, system response deteriorated just
after swap-out started.
The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.
Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.
The test result is:
-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms
-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms
Max round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.
I did ping test again to observe performance deterioration caused by taking
a ref.
-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
The result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
The influence to response was small.
[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:24:52 +08:00
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/* No harm done if a racing interrupt already did this */
|
2020-05-28 04:11:15 +08:00
|
|
|
local_lock_irqsave(&lru_rotate.lock, flags);
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
|
2020-05-28 04:11:15 +08:00
|
|
|
local_unlock_irqrestore(&lru_rotate.lock, flags);
|
mm: use pagevec to rotate reclaimable page
While running some memory intensive load, system response deteriorated just
after swap-out started.
The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.
Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.
The test result is:
-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms
-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms
Max round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.
I did ping test again to observe performance deterioration caused by taking
a ref.
-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
The result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
The influence to response was small.
[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 16:24:52 +08:00
|
|
|
}
|
2011-03-23 07:32:52 +08:00
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
|
2011-03-23 07:32:52 +08:00
|
|
|
if (pagevec_count(pvec))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
|
2011-05-25 08:12:55 +08:00
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu);
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
if (pagevec_count(pvec))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, lru_deactivate_fn);
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu);
|
2016-01-16 08:55:11 +08:00
|
|
|
if (pagevec_count(pvec))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
|
2016-01-16 08:55:11 +08:00
|
|
|
|
2011-05-25 08:12:55 +08:00
|
|
|
activate_page_drain(cpu);
|
2011-03-23 07:32:52 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2015-04-16 07:13:26 +08:00
|
|
|
* deactivate_file_page - forcefully deactivate a file page
|
2011-03-23 07:32:52 +08:00
|
|
|
* @page: page to deactivate
|
|
|
|
*
|
|
|
|
* This function hints the VM that @page is a good reclaim candidate,
|
|
|
|
* for example if its invalidation fails due to the page being dirty
|
|
|
|
* or under writeback.
|
|
|
|
*/
|
2015-04-16 07:13:26 +08:00
|
|
|
void deactivate_file_page(struct page *page)
|
2011-03-23 07:32:52 +08:00
|
|
|
{
|
2011-05-25 08:12:31 +08:00
|
|
|
/*
|
2015-04-16 07:13:26 +08:00
|
|
|
* In a workload with many unevictable page such as mprotect,
|
|
|
|
* unevictable page deactivation for accelerating reclaim is pointless.
|
2011-05-25 08:12:31 +08:00
|
|
|
*/
|
|
|
|
if (PageUnevictable(page))
|
|
|
|
return;
|
|
|
|
|
2011-03-23 07:32:52 +08:00
|
|
|
if (likely(get_page_unless_zero(page))) {
|
2020-05-28 04:11:15 +08:00
|
|
|
struct pagevec *pvec;
|
|
|
|
|
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
|
2011-03-23 07:32:52 +08:00
|
|
|
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
if (pagevec_add_and_need_flush(pvec, page))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
|
2020-05-28 04:11:15 +08:00
|
|
|
local_unlock(&lru_pvecs.lock);
|
2011-03-23 07:32:52 +08:00
|
|
|
}
|
2006-01-06 16:11:14 +08:00
|
|
|
}
|
|
|
|
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
/*
|
|
|
|
* deactivate_page - deactivate a page
|
|
|
|
* @page: page to deactivate
|
|
|
|
*
|
|
|
|
* deactivate_page() moves @page to the inactive list if @page was on the active
|
|
|
|
* list and was not an unevictable page. This is done to accelerate the reclaim
|
|
|
|
* of @page.
|
|
|
|
*/
|
|
|
|
void deactivate_page(struct page *page)
|
|
|
|
{
|
|
|
|
if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
|
2020-05-28 04:11:15 +08:00
|
|
|
struct pagevec *pvec;
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
get_page(page);
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
if (pagevec_add_and_need_flush(pvec, page))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, lru_deactivate_fn);
|
2020-05-28 04:11:15 +08:00
|
|
|
local_unlock(&lru_pvecs.lock);
|
mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:49:08 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-01-16 08:55:11 +08:00
|
|
|
/**
|
2017-05-04 05:52:29 +08:00
|
|
|
* mark_page_lazyfree - make an anon page lazyfree
|
2016-01-16 08:55:11 +08:00
|
|
|
* @page: page to deactivate
|
|
|
|
*
|
2017-05-04 05:52:29 +08:00
|
|
|
* mark_page_lazyfree() moves @page to the inactive file list.
|
|
|
|
* This is done to accelerate the reclaim of @page.
|
2016-01-16 08:55:11 +08:00
|
|
|
*/
|
2017-05-04 05:52:29 +08:00
|
|
|
void mark_page_lazyfree(struct page *page)
|
2016-01-16 08:55:11 +08:00
|
|
|
{
|
2017-05-04 05:52:29 +08:00
|
|
|
if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
|
2017-10-04 07:15:29 +08:00
|
|
|
!PageSwapCache(page) && !PageUnevictable(page)) {
|
2020-05-28 04:11:15 +08:00
|
|
|
struct pagevec *pvec;
|
2016-01-16 08:55:11 +08:00
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
get_page(page);
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
if (pagevec_add_and_need_flush(pvec, page))
|
2020-12-16 04:33:56 +08:00
|
|
|
pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
|
2020-05-28 04:11:15 +08:00
|
|
|
local_unlock(&lru_pvecs.lock);
|
2016-01-16 08:55:11 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-01-06 16:11:14 +08:00
|
|
|
void lru_add_drain(void)
|
|
|
|
{
|
2020-05-28 04:11:15 +08:00
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
lru_add_drain_cpu(smp_processor_id());
|
|
|
|
local_unlock(&lru_pvecs.lock);
|
|
|
|
}
|
|
|
|
|
2021-09-25 06:43:47 +08:00
|
|
|
/*
|
|
|
|
* It's called from per-cpu workqueue context in SMP case so
|
|
|
|
* lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on
|
|
|
|
* the same cpu. It shouldn't be a problem in !SMP case since
|
|
|
|
* the core is only one and the locks will disable preemption.
|
|
|
|
*/
|
|
|
|
static void lru_add_and_bh_lrus_drain(void)
|
|
|
|
{
|
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
lru_add_drain_cpu(smp_processor_id());
|
|
|
|
local_unlock(&lru_pvecs.lock);
|
|
|
|
invalidate_bh_lrus_cpu();
|
|
|
|
}
|
|
|
|
|
2020-05-28 04:11:15 +08:00
|
|
|
void lru_add_drain_cpu_zone(struct zone *zone)
|
|
|
|
{
|
|
|
|
local_lock(&lru_pvecs.lock);
|
|
|
|
lru_add_drain_cpu(smp_processor_id());
|
|
|
|
drain_local_pages(zone);
|
|
|
|
local_unlock(&lru_pvecs.lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2019-02-21 14:19:54 +08:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
|
|
|
|
static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
|
|
|
|
|
2006-11-22 22:57:56 +08:00
|
|
|
static void lru_add_drain_per_cpu(struct work_struct *dummy)
|
2006-01-19 09:42:27 +08:00
|
|
|
{
|
2021-09-25 06:43:47 +08:00
|
|
|
lru_add_and_bh_lrus_drain();
|
2006-01-19 09:42:27 +08:00
|
|
|
}
|
|
|
|
|
2018-02-01 08:16:19 +08:00
|
|
|
/*
|
|
|
|
* Doesn't need any cpu hotplug locking because we do rely on per-cpu
|
|
|
|
* kworkers being shut down before our page_alloc_cpu_dead callback is
|
|
|
|
* executed on the offlined cpu.
|
|
|
|
* Calling this function with cpu hotplug locks held can actually lead
|
|
|
|
* to obscure indirect dependencies via WQ context.
|
|
|
|
*/
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
inline void __lru_add_drain_all(bool force_all_cpus)
|
2006-01-19 09:42:27 +08:00
|
|
|
{
|
2020-08-27 19:40:38 +08:00
|
|
|
/*
|
|
|
|
* lru_drain_gen - Global pages generation number
|
|
|
|
*
|
|
|
|
* (A) Definition: global lru_drain_gen = x implies that all generations
|
|
|
|
* 0 < n <= x are already *scheduled* for draining.
|
|
|
|
*
|
|
|
|
* This is an optimization for the highly-contended use case where a
|
|
|
|
* user space workload keeps constantly generating a flow of pages for
|
|
|
|
* each CPU.
|
|
|
|
*/
|
|
|
|
static unsigned int lru_drain_gen;
|
2013-09-13 06:13:55 +08:00
|
|
|
static struct cpumask has_work;
|
2020-08-27 19:40:38 +08:00
|
|
|
static DEFINE_MUTEX(lock);
|
|
|
|
unsigned cpu, this_gen;
|
2013-09-13 06:13:55 +08:00
|
|
|
|
2017-04-08 07:05:05 +08:00
|
|
|
/*
|
|
|
|
* Make sure nobody triggers this path before mm_percpu_wq is fully
|
|
|
|
* initialized.
|
|
|
|
*/
|
|
|
|
if (WARN_ON(!mm_percpu_wq))
|
|
|
|
return;
|
|
|
|
|
2020-08-27 19:40:38 +08:00
|
|
|
/*
|
|
|
|
* Guarantee pagevec counter stores visible by this CPU are visible to
|
|
|
|
* other CPUs before loading the current drain generation.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* (B) Locally cache global LRU draining generation number
|
|
|
|
*
|
|
|
|
* The read barrier ensures that the counter is loaded before the mutex
|
|
|
|
* is taken. It pairs with smp_mb() inside the mutex critical section
|
|
|
|
* at (D).
|
|
|
|
*/
|
|
|
|
this_gen = smp_load_acquire(&lru_drain_gen);
|
2019-12-01 09:50:40 +08:00
|
|
|
|
2013-09-13 06:13:55 +08:00
|
|
|
mutex_lock(&lock);
|
2019-12-01 09:50:40 +08:00
|
|
|
|
|
|
|
/*
|
2020-08-27 19:40:38 +08:00
|
|
|
* (C) Exit the draining operation if a newer generation, from another
|
|
|
|
* lru_add_drain_all(), was already scheduled for draining. Check (A).
|
2019-12-01 09:50:40 +08:00
|
|
|
*/
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
if (unlikely(this_gen != lru_drain_gen && !force_all_cpus))
|
2019-12-01 09:50:40 +08:00
|
|
|
goto done;
|
|
|
|
|
2020-08-27 19:40:38 +08:00
|
|
|
/*
|
|
|
|
* (D) Increment global generation number
|
|
|
|
*
|
|
|
|
* Pairs with smp_load_acquire() at (B), outside of the critical
|
|
|
|
* section. Use a full memory barrier to guarantee that the new global
|
|
|
|
* drain generation number is stored before loading pagevec counters.
|
|
|
|
*
|
|
|
|
* This pairing must be done here, before the for_each_online_cpu loop
|
|
|
|
* below which drains the page vectors.
|
|
|
|
*
|
|
|
|
* Let x, y, and z represent some system CPU numbers, where x < y < z.
|
2021-05-07 09:05:51 +08:00
|
|
|
* Assume CPU #z is in the middle of the for_each_online_cpu loop
|
2020-08-27 19:40:38 +08:00
|
|
|
* below and has already reached CPU #y's per-cpu data. CPU #x comes
|
|
|
|
* along, adds some pages to its per-cpu vectors, then calls
|
|
|
|
* lru_add_drain_all().
|
|
|
|
*
|
|
|
|
* If the paired barrier is done at any later step, e.g. after the
|
|
|
|
* loop, CPU #x will just exit at (C) and miss flushing out all of its
|
|
|
|
* added pages.
|
|
|
|
*/
|
|
|
|
WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
|
|
|
|
smp_mb();
|
2019-12-01 09:50:40 +08:00
|
|
|
|
2013-09-13 06:13:55 +08:00
|
|
|
cpumask_clear(&has_work);
|
|
|
|
for_each_online_cpu(cpu) {
|
|
|
|
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
|
|
|
|
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
if (force_all_cpus ||
|
|
|
|
pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
|
2020-08-15 08:31:50 +08:00
|
|
|
data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
|
2020-05-28 04:11:15 +08:00
|
|
|
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
|
|
|
|
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
|
|
|
|
pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
|
2021-05-05 09:37:00 +08:00
|
|
|
need_activate_page_drain(cpu) ||
|
|
|
|
has_bh_in_lru(cpu, NULL)) {
|
2013-09-13 06:13:55 +08:00
|
|
|
INIT_WORK(work, lru_add_drain_per_cpu);
|
2017-04-08 07:05:05 +08:00
|
|
|
queue_work_on(cpu, mm_percpu_wq, work);
|
2020-08-27 19:40:38 +08:00
|
|
|
__cpumask_set_cpu(cpu, &has_work);
|
2013-09-13 06:13:55 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for_each_cpu(cpu, &has_work)
|
|
|
|
flush_work(&per_cpu(lru_add_drain_work, cpu));
|
|
|
|
|
2019-12-01 09:50:40 +08:00
|
|
|
done:
|
2013-09-13 06:13:55 +08:00
|
|
|
mutex_unlock(&lock);
|
2006-01-19 09:42:27 +08:00
|
|
|
}
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
|
|
|
|
void lru_add_drain_all(void)
|
|
|
|
{
|
|
|
|
__lru_add_drain_all(false);
|
|
|
|
}
|
2019-02-21 14:19:54 +08:00
|
|
|
#else
|
|
|
|
void lru_add_drain_all(void)
|
|
|
|
{
|
|
|
|
lru_add_drain();
|
|
|
|
}
|
2020-08-27 19:40:38 +08:00
|
|
|
#endif /* CONFIG_SMP */
|
2006-01-19 09:42:27 +08:00
|
|
|
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
atomic_t lru_disable_count = ATOMIC_INIT(0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* lru_cache_disable() needs to be called before we start compiling
|
|
|
|
* a list of pages to be migrated using isolate_lru_page().
|
|
|
|
* It drains pages on LRU cache and then disable on all cpus until
|
|
|
|
* lru_cache_enable is called.
|
|
|
|
*
|
|
|
|
* Must be paired with a call to lru_cache_enable().
|
|
|
|
*/
|
|
|
|
void lru_cache_disable(void)
|
|
|
|
{
|
|
|
|
atomic_inc(&lru_disable_count);
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
/*
|
|
|
|
* lru_add_drain_all in the force mode will schedule draining on
|
|
|
|
* all online CPUs so any calls of lru_cache_disabled wrapped by
|
|
|
|
* local_lock or preemption disabled would be ordered by that.
|
|
|
|
* The atomic operation doesn't need to have stronger ordering
|
2022-01-15 06:09:25 +08:00
|
|
|
* requirements because that is enforced by the scheduling
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
* guarantees.
|
|
|
|
*/
|
|
|
|
__lru_add_drain_all(true);
|
|
|
|
#else
|
2021-09-25 06:43:47 +08:00
|
|
|
lru_add_and_bh_lrus_drain();
|
mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained. It
could prevent migration since the refcount of the page is greater than
the expection in migration logic. To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation in
background every five seconds. Total cma allocation count was about 500
during the testing. With this patch, the dump_page count was reduced
from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:36:54 +08:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2014-10-10 06:28:52 +08:00
|
|
|
/**
|
2016-04-01 20:29:48 +08:00
|
|
|
* release_pages - batched put_page()
|
2014-10-10 06:28:52 +08:00
|
|
|
* @pages: array of pages to release
|
|
|
|
* @nr: number of pages
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2014-10-10 06:28:52 +08:00
|
|
|
* Decrement the reference count on all the pages in @pages. If it
|
|
|
|
* fell to zero, remove the page from the LRU and free it.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-11-16 09:37:55 +08:00
|
|
|
void release_pages(struct page **pages, int nr)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
int i;
|
2012-01-11 07:07:04 +08:00
|
|
|
LIST_HEAD(pages_to_free);
|
2020-12-16 04:34:29 +08:00
|
|
|
struct lruvec *lruvec = NULL;
|
2021-06-30 10:27:31 +08:00
|
|
|
unsigned long flags = 0;
|
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-04 04:09:38 +08:00
|
|
|
unsigned int lock_batch;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
for (i = 0; i < nr; i++) {
|
|
|
|
struct page *page = pages[i];
|
2021-06-30 10:27:31 +08:00
|
|
|
struct folio *folio = page_folio(page);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2014-10-10 06:28:52 +08:00
|
|
|
/*
|
|
|
|
* Make sure the IRQ-safe lock-holding time does not get
|
|
|
|
* excessive with a continuous string of pages from the
|
2020-12-16 04:34:29 +08:00
|
|
|
* same lruvec. The lock is held only if lruvec != NULL.
|
2014-10-10 06:28:52 +08:00
|
|
|
*/
|
2020-12-16 04:34:29 +08:00
|
|
|
if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
|
|
|
|
unlock_page_lruvec_irqrestore(lruvec, flags);
|
|
|
|
lruvec = NULL;
|
2014-10-10 06:28:52 +08:00
|
|
|
}
|
|
|
|
|
2021-06-30 10:27:31 +08:00
|
|
|
page = &folio->page;
|
2016-10-08 08:00:08 +08:00
|
|
|
if (is_huge_zero_page(page))
|
2016-04-29 07:18:27 +08:00
|
|
|
continue;
|
|
|
|
|
2019-06-06 05:49:22 +08:00
|
|
|
if (is_zone_device_page(page)) {
|
2020-12-16 04:34:29 +08:00
|
|
|
if (lruvec) {
|
|
|
|
unlock_page_lruvec_irqrestore(lruvec, flags);
|
|
|
|
lruvec = NULL;
|
2017-09-09 07:12:24 +08:00
|
|
|
}
|
2019-06-06 05:49:22 +08:00
|
|
|
/*
|
|
|
|
* ZONE_DEVICE pages that return 'false' from
|
2020-10-14 07:52:15 +08:00
|
|
|
* page_is_devmap_managed() do not require special
|
2019-06-06 05:49:22 +08:00
|
|
|
* processing, and instead, expect a call to
|
|
|
|
* put_page_testzero().
|
|
|
|
*/
|
2020-01-31 14:12:28 +08:00
|
|
|
if (page_is_devmap_managed(page)) {
|
|
|
|
put_devmap_managed_page(page);
|
2019-06-06 05:49:22 +08:00
|
|
|
continue;
|
2020-01-31 14:12:28 +08:00
|
|
|
}
|
2020-12-15 11:05:55 +08:00
|
|
|
if (put_page_testzero(page))
|
|
|
|
put_dev_pagemap(page->pgmap);
|
|
|
|
continue;
|
2017-09-09 07:12:24 +08:00
|
|
|
}
|
|
|
|
|
2005-10-30 09:16:12 +08:00
|
|
|
if (!put_page_testzero(page))
|
2005-04-17 06:20:36 +08:00
|
|
|
continue;
|
|
|
|
|
2016-01-16 08:52:56 +08:00
|
|
|
if (PageCompound(page)) {
|
2020-12-16 04:34:29 +08:00
|
|
|
if (lruvec) {
|
|
|
|
unlock_page_lruvec_irqrestore(lruvec, flags);
|
|
|
|
lruvec = NULL;
|
2016-01-16 08:52:56 +08:00
|
|
|
}
|
|
|
|
__put_compound_page(page);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:07:58 +08:00
|
|
|
if (PageLRU(page)) {
|
2020-12-16 04:34:33 +08:00
|
|
|
struct lruvec *prev_lruvec = lruvec;
|
|
|
|
|
2021-06-30 10:27:31 +08:00
|
|
|
lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
|
2020-12-16 04:34:33 +08:00
|
|
|
&flags);
|
|
|
|
if (prev_lruvec != lruvec)
|
2014-10-10 06:28:52 +08:00
|
|
|
lock_batch = 0;
|
2012-05-30 06:07:09 +08:00
|
|
|
|
2021-02-25 04:08:25 +08:00
|
|
|
del_page_from_lru_list(page, lruvec);
|
2021-02-25 04:08:28 +08:00
|
|
|
__clear_page_lru_flags(page);
|
2006-03-22 16:07:58 +08:00
|
|
|
}
|
|
|
|
|
2022-02-15 10:28:05 +08:00
|
|
|
/*
|
|
|
|
* In rare cases, when truncation or holepunching raced with
|
|
|
|
* munlock after VM_LOCKED was cleared, Mlocked may still be
|
|
|
|
* found set here. This does not indicate a problem, unless
|
|
|
|
* "unevictable_pgs_cleared" appears worryingly large.
|
|
|
|
*/
|
|
|
|
if (unlikely(PageMlocked(page))) {
|
|
|
|
__ClearPageMlocked(page);
|
|
|
|
dec_zone_page_state(page, NR_MLOCK);
|
|
|
|
count_vm_event(UNEVICTABLE_PGCLEARED);
|
|
|
|
}
|
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
__ClearPageWaiters(page);
|
2013-07-04 06:02:34 +08:00
|
|
|
|
2012-01-11 07:07:04 +08:00
|
|
|
list_add(&page->lru, &pages_to_free);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2020-12-16 04:34:29 +08:00
|
|
|
if (lruvec)
|
|
|
|
unlock_page_lruvec_irqrestore(lruvec, flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2014-08-09 05:19:24 +08:00
|
|
|
mem_cgroup_uncharge_list(&pages_to_free);
|
2017-11-16 09:37:59 +08:00
|
|
|
free_unref_page_list(&pages_to_free);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2010-10-28 06:34:46 +08:00
|
|
|
EXPORT_SYMBOL(release_pages);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The pages which we're about to release may be in the deferred lru-addition
|
|
|
|
* queues. That would prevent them from really being freed right now. That's
|
|
|
|
* OK from a correctness point of view but is inefficient - those pages may be
|
|
|
|
* cache-warm and we want to give them back to the page allocator ASAP.
|
|
|
|
*
|
|
|
|
* So __pagevec_release() will drain those queues here. __pagevec_lru_add()
|
|
|
|
* and __pagevec_lru_add_active() call release_pages() directly to avoid
|
|
|
|
* mutual recursion.
|
|
|
|
*/
|
|
|
|
void __pagevec_release(struct pagevec *pvec)
|
|
|
|
{
|
2017-11-16 09:38:10 +08:00
|
|
|
if (!pvec->percpu_pvec_drained) {
|
2017-11-16 09:37:48 +08:00
|
|
|
lru_add_drain();
|
2017-11-16 09:38:10 +08:00
|
|
|
pvec->percpu_pvec_drained = true;
|
2017-11-16 09:37:48 +08:00
|
|
|
}
|
2017-11-16 09:37:55 +08:00
|
|
|
release_pages(pvec->pages, pagevec_count(pvec));
|
2005-04-17 06:20:36 +08:00
|
|
|
pagevec_reinit(pvec);
|
|
|
|
}
|
2005-11-02 02:22:55 +08:00
|
|
|
EXPORT_SYMBOL(__pagevec_release);
|
|
|
|
|
2021-05-15 03:08:29 +08:00
|
|
|
static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
|
2011-03-23 07:33:45 +08:00
|
|
|
{
|
2021-05-15 03:08:29 +08:00
|
|
|
int was_unevictable = folio_test_clear_unevictable(folio);
|
|
|
|
long nr_pages = folio_nr_pages(folio);
|
2011-03-23 07:33:45 +08:00
|
|
|
|
2021-05-15 03:08:29 +08:00
|
|
|
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
|
2011-03-23 07:33:45 +08:00
|
|
|
|
2022-02-15 10:34:46 +08:00
|
|
|
folio_set_lru(folio);
|
mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.
The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.
This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.
However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
move to evictable LRU putback_lru_page()
else
move to unevictable LRU
In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.
In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU(). If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page. That page will be stranded on the unevictable
LRU.
There is one (good) side effect though. Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment. This patch will correctly
put such pages to unevictable LRU.
Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 06:45:28 +08:00
|
|
|
/*
|
2022-02-15 10:34:46 +08:00
|
|
|
* Is an smp_mb__after_atomic() still required here, before
|
|
|
|
* folio_evictable() tests PageMlocked, to rule out the possibility
|
|
|
|
* of stranding an evictable folio on an unevictable LRU? I think
|
|
|
|
* not, because munlock_page() only clears PageMlocked while the LRU
|
|
|
|
* lock is held.
|
mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.
The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.
This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.
However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
move to evictable LRU putback_lru_page()
else
move to unevictable LRU
In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.
In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU(). If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page. That page will be stranded on the unevictable
LRU.
There is one (good) side effect though. Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment. This patch will correctly
put such pages to unevictable LRU.
Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 06:45:28 +08:00
|
|
|
*
|
2022-02-15 10:34:46 +08:00
|
|
|
* (That is not true of __page_cache_release(), and not necessarily
|
|
|
|
* true of release_pages(): but those only clear PageMlocked after
|
|
|
|
* put_page_testzero() has excluded any other users of the page.)
|
mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.
The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.
This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.
However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
move to evictable LRU putback_lru_page()
else
move to unevictable LRU
In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.
In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU(). If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page. That page will be stranded on the unevictable
LRU.
There is one (good) side effect though. Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment. This patch will correctly
put such pages to unevictable LRU.
Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 06:45:28 +08:00
|
|
|
*/
|
2021-05-15 03:08:29 +08:00
|
|
|
if (folio_evictable(folio)) {
|
mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.
The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.
This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.
However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
move to evictable LRU putback_lru_page()
else
move to unevictable LRU
In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.
In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU(). If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page. That page will be stranded on the unevictable
LRU.
There is one (good) side effect though. Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment. This patch will correctly
put such pages to unevictable LRU.
Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 06:45:28 +08:00
|
|
|
if (was_unevictable)
|
2020-06-04 07:03:16 +08:00
|
|
|
__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
|
mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.
The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.
This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.
However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
move to evictable LRU putback_lru_page()
else
move to unevictable LRU
In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.
In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU(). If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page. That page will be stranded on the unevictable
LRU.
There is one (good) side effect though. Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment. This patch will correctly
put such pages to unevictable LRU.
Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 06:45:28 +08:00
|
|
|
} else {
|
2021-05-15 03:08:29 +08:00
|
|
|
folio_clear_active(folio);
|
|
|
|
folio_set_unevictable(folio);
|
2022-02-15 10:29:54 +08:00
|
|
|
folio->mlock_count = !!folio_test_mlocked(folio);
|
mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.
The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.
This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.
However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
move to evictable LRU putback_lru_page()
else
move to unevictable LRU
In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.
In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU(). If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page. That page will be stranded on the unevictable
LRU.
There is one (good) side effect though. Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment. This patch will correctly
put such pages to unevictable LRU.
Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 06:45:28 +08:00
|
|
|
if (!was_unevictable)
|
2020-06-04 07:03:16 +08:00
|
|
|
__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
|
mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.
The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.
This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.
However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
move to evictable LRU putback_lru_page()
else
move to unevictable LRU
In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.
In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU(). If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page. That page will be stranded on the unevictable
LRU.
There is one (good) side effect though. Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment. This patch will correctly
put such pages to unevictable LRU.
Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 06:45:28 +08:00
|
|
|
}
|
|
|
|
|
2021-05-15 03:08:29 +08:00
|
|
|
lruvec_add_folio(lruvec, folio);
|
|
|
|
trace_mm_lru_insertion(folio);
|
2011-03-23 07:33:45 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Add the passed pages to the LRU, then drop the caller's refcount
|
|
|
|
* on them. Reinitialises the caller's pagevec.
|
|
|
|
*/
|
2013-07-04 06:02:32 +08:00
|
|
|
void __pagevec_lru_add(struct pagevec *pvec)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2020-12-16 04:34:25 +08:00
|
|
|
int i;
|
2020-12-16 04:34:29 +08:00
|
|
|
struct lruvec *lruvec = NULL;
|
2020-12-16 04:34:25 +08:00
|
|
|
unsigned long flags = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < pagevec_count(pvec); i++) {
|
2021-05-15 03:08:29 +08:00
|
|
|
struct folio *folio = page_folio(pvec->pages[i]);
|
2020-12-16 04:34:25 +08:00
|
|
|
|
2021-06-30 10:27:31 +08:00
|
|
|
lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
|
2021-05-15 03:08:29 +08:00
|
|
|
__pagevec_lru_add_fn(folio, lruvec);
|
2020-12-16 04:34:25 +08:00
|
|
|
}
|
2020-12-16 04:34:29 +08:00
|
|
|
if (lruvec)
|
|
|
|
unlock_page_lruvec_irqrestore(lruvec, flags);
|
2020-12-16 04:34:25 +08:00
|
|
|
release_pages(pvec->pages, pvec->nr);
|
|
|
|
pagevec_reinit(pvec);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2014-04-04 05:47:46 +08:00
|
|
|
/**
|
2021-12-08 03:28:49 +08:00
|
|
|
* folio_batch_remove_exceptionals() - Prune non-folios from a batch.
|
|
|
|
* @fbatch: The batch to prune
|
2014-04-04 05:47:46 +08:00
|
|
|
*
|
2021-12-08 03:28:49 +08:00
|
|
|
* find_get_entries() fills a batch with both folios and shadow/swap/DAX
|
|
|
|
* entries. This function prunes all the non-folio entries from @fbatch
|
|
|
|
* without leaving holes, so that it can be passed on to folio-only batch
|
|
|
|
* operations.
|
2014-04-04 05:47:46 +08:00
|
|
|
*/
|
2021-12-08 03:28:49 +08:00
|
|
|
void folio_batch_remove_exceptionals(struct folio_batch *fbatch)
|
2014-04-04 05:47:46 +08:00
|
|
|
{
|
2021-12-08 03:28:49 +08:00
|
|
|
unsigned int i, j;
|
2014-04-04 05:47:46 +08:00
|
|
|
|
2021-12-08 03:28:49 +08:00
|
|
|
for (i = 0, j = 0; i < folio_batch_count(fbatch); i++) {
|
|
|
|
struct folio *folio = fbatch->folios[i];
|
|
|
|
if (!xa_is_value(folio))
|
|
|
|
fbatch->folios[j++] = folio;
|
2014-04-04 05:47:46 +08:00
|
|
|
}
|
2021-12-08 03:28:49 +08:00
|
|
|
fbatch->nr = j;
|
2014-04-04 05:47:46 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
2017-09-07 07:21:21 +08:00
|
|
|
* pagevec_lookup_range - gang pagecache lookup
|
2005-04-17 06:20:36 +08:00
|
|
|
* @pvec: Where the resulting pages are placed
|
|
|
|
* @mapping: The address_space to search
|
|
|
|
* @start: The starting page index
|
2017-09-07 07:21:21 +08:00
|
|
|
* @end: The final page index
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2018-02-01 08:21:19 +08:00
|
|
|
* pagevec_lookup_range() will search for & return a group of up to PAGEVEC_SIZE
|
2017-09-07 07:21:21 +08:00
|
|
|
* pages in the mapping starting from index @start and upto index @end
|
|
|
|
* (inclusive). The pages are placed in @pvec. pagevec_lookup() takes a
|
2005-04-17 06:20:36 +08:00
|
|
|
* reference against the pages in @pvec.
|
|
|
|
*
|
|
|
|
* The search returns a group of mapping-contiguous pages with ascending
|
2017-09-07 07:21:18 +08:00
|
|
|
* indexes. There may be holes in the indices due to not-present pages. We
|
|
|
|
* also update @start to index the next page for the traversal.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2017-09-07 07:21:21 +08:00
|
|
|
* pagevec_lookup_range() returns the number of pages which were found. If this
|
2018-02-01 08:21:19 +08:00
|
|
|
* number is smaller than PAGEVEC_SIZE, the end of specified range has been
|
2017-09-07 07:21:21 +08:00
|
|
|
* reached.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-09-07 07:21:21 +08:00
|
|
|
unsigned pagevec_lookup_range(struct pagevec *pvec,
|
2017-09-07 07:21:43 +08:00
|
|
|
struct address_space *mapping, pgoff_t *start, pgoff_t end)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-09-07 07:21:43 +08:00
|
|
|
pvec->nr = find_get_pages_range(mapping, start, end, PAGEVEC_SIZE,
|
2017-09-07 07:21:21 +08:00
|
|
|
pvec->pages);
|
2005-04-17 06:20:36 +08:00
|
|
|
return pagevec_count(pvec);
|
|
|
|
}
|
2017-09-07 07:21:21 +08:00
|
|
|
EXPORT_SYMBOL(pagevec_lookup_range);
|
2006-01-11 17:47:41 +08:00
|
|
|
|
2017-11-16 09:34:33 +08:00
|
|
|
unsigned pagevec_lookup_range_tag(struct pagevec *pvec,
|
|
|
|
struct address_space *mapping, pgoff_t *index, pgoff_t end,
|
2017-12-06 06:30:38 +08:00
|
|
|
xa_mark_t tag)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-11-16 09:34:33 +08:00
|
|
|
pvec->nr = find_get_pages_range_tag(mapping, index, end, tag,
|
2017-11-16 09:35:19 +08:00
|
|
|
PAGEVEC_SIZE, pvec->pages);
|
2005-04-17 06:20:36 +08:00
|
|
|
return pagevec_count(pvec);
|
|
|
|
}
|
2017-11-16 09:34:33 +08:00
|
|
|
EXPORT_SYMBOL(pagevec_lookup_range_tag);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform any setup for the swap system
|
|
|
|
*/
|
|
|
|
void __init swap_setup(void)
|
|
|
|
{
|
2018-12-28 16:34:29 +08:00
|
|
|
unsigned long megs = totalram_pages() >> (20 - PAGE_SHIFT);
|
2007-10-17 14:25:46 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Use a smaller cluster for small-memory machines */
|
|
|
|
if (megs < 16)
|
|
|
|
page_cluster = 2;
|
|
|
|
else
|
|
|
|
page_cluster = 3;
|
|
|
|
/*
|
|
|
|
* Right now other parts of the system means that we
|
|
|
|
* _really_ don't want to cluster much more
|
|
|
|
*/
|
|
|
|
}
|
2020-01-31 14:12:28 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_DEV_PAGEMAP_OPS
|
|
|
|
void put_devmap_managed_page(struct page *page)
|
|
|
|
{
|
|
|
|
int count;
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
count = page_ref_dec_return(page);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* devmap page refcounts are 1-based, rather than 0-based: if
|
|
|
|
* refcount is 1, then the page is free and the refcount is
|
|
|
|
* stable because nobody holds a reference on the page.
|
|
|
|
*/
|
|
|
|
if (count == 1)
|
|
|
|
free_devmap_managed_page(page);
|
|
|
|
else if (!count)
|
|
|
|
__put_page(page);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(put_devmap_managed_page);
|
|
|
|
#endif
|