License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2016-07-27 06:26:24 +08:00
|
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/sched.h>
|
2017-02-09 01:51:29 +08:00
|
|
|
#include <linux/sched/mm.h>
|
2017-02-09 01:51:30 +08:00
|
|
|
#include <linux/sched/coredump.h>
|
2016-07-27 06:26:24 +08:00
|
|
|
#include <linux/mmu_notifier.h>
|
|
|
|
#include <linux/rmap.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/mm_inline.h>
|
|
|
|
#include <linux/kthread.h>
|
|
|
|
#include <linux/khugepaged.h>
|
|
|
|
#include <linux/freezer.h>
|
|
|
|
#include <linux/mman.h>
|
|
|
|
#include <linux/hashtable.h>
|
|
|
|
#include <linux/userfaultfd_k.h>
|
|
|
|
#include <linux/page_idle.h>
|
|
|
|
#include <linux/swapops.h>
|
2016-07-27 06:26:32 +08:00
|
|
|
#include <linux/shmem_fs.h>
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
#include <asm/tlb.h>
|
|
|
|
#include <asm/pgalloc.h>
|
|
|
|
#include "internal.h"
|
|
|
|
|
|
|
|
enum scan_result {
|
|
|
|
SCAN_FAIL,
|
|
|
|
SCAN_SUCCEED,
|
|
|
|
SCAN_PMD_NULL,
|
|
|
|
SCAN_EXCEED_NONE_PTE,
|
2020-06-04 07:00:30 +08:00
|
|
|
SCAN_EXCEED_SWAP_PTE,
|
|
|
|
SCAN_EXCEED_SHARED_PTE,
|
2016-07-27 06:26:24 +08:00
|
|
|
SCAN_PTE_NON_PRESENT,
|
2020-04-07 11:06:04 +08:00
|
|
|
SCAN_PTE_UFFD_WP,
|
2016-07-27 06:26:24 +08:00
|
|
|
SCAN_PAGE_RO,
|
2016-07-27 06:26:46 +08:00
|
|
|
SCAN_LACK_REFERENCED_PAGE,
|
2016-07-27 06:26:24 +08:00
|
|
|
SCAN_PAGE_NULL,
|
|
|
|
SCAN_SCAN_ABORT,
|
|
|
|
SCAN_PAGE_COUNT,
|
|
|
|
SCAN_PAGE_LRU,
|
|
|
|
SCAN_PAGE_LOCK,
|
|
|
|
SCAN_PAGE_ANON,
|
|
|
|
SCAN_PAGE_COMPOUND,
|
|
|
|
SCAN_ANY_PROCESS,
|
|
|
|
SCAN_VMA_NULL,
|
|
|
|
SCAN_VMA_CHECK,
|
|
|
|
SCAN_ADDRESS_RANGE,
|
|
|
|
SCAN_SWAP_CACHE_PAGE,
|
|
|
|
SCAN_DEL_PAGE_LRU,
|
|
|
|
SCAN_ALLOC_HUGE_PAGE_FAIL,
|
|
|
|
SCAN_CGROUP_CHARGE_FAIL,
|
2016-07-27 06:26:32 +08:00
|
|
|
SCAN_TRUNCATED,
|
2019-09-24 06:38:00 +08:00
|
|
|
SCAN_PAGE_HAS_PRIVATE,
|
2016-07-27 06:26:24 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/huge_memory.h>
|
|
|
|
|
2020-10-11 14:16:40 +08:00
|
|
|
static struct task_struct *khugepaged_thread __read_mostly;
|
|
|
|
static DEFINE_MUTEX(khugepaged_mutex);
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
/* default scan 8*512 pte (or vmas) every 30 second */
|
|
|
|
static unsigned int khugepaged_pages_to_scan __read_mostly;
|
|
|
|
static unsigned int khugepaged_pages_collapsed;
|
|
|
|
static unsigned int khugepaged_full_scans;
|
|
|
|
static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
|
|
|
|
/* during fragmentation poll the hugepage allocator once every minute */
|
|
|
|
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
|
|
|
|
static unsigned long khugepaged_sleep_expire;
|
|
|
|
static DEFINE_SPINLOCK(khugepaged_mm_lock);
|
|
|
|
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
|
|
|
|
/*
|
|
|
|
* default collapse hugepages if there is at least one pte mapped like
|
|
|
|
* it would have happened if the vma was large enough during page
|
|
|
|
* fault.
|
|
|
|
*/
|
|
|
|
static unsigned int khugepaged_max_ptes_none __read_mostly;
|
|
|
|
static unsigned int khugepaged_max_ptes_swap __read_mostly;
|
2020-06-04 07:00:30 +08:00
|
|
|
static unsigned int khugepaged_max_ptes_shared __read_mostly;
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
#define MM_SLOTS_HASH_BITS 10
|
|
|
|
static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
|
|
|
|
|
|
|
|
static struct kmem_cache *mm_slot_cache __read_mostly;
|
|
|
|
|
2019-09-24 06:38:30 +08:00
|
|
|
#define MAX_PTE_MAPPED_THP 8
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
/**
|
|
|
|
* struct mm_slot - hash lookup from mm to mm_slot
|
|
|
|
* @hash: hash collision list
|
|
|
|
* @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
|
|
|
|
* @mm: the mm that this information is valid for
|
2020-12-15 11:12:01 +08:00
|
|
|
* @nr_pte_mapped_thp: number of pte mapped THP
|
|
|
|
* @pte_mapped_thp: address array corresponding pte mapped THP
|
2016-07-27 06:26:24 +08:00
|
|
|
*/
|
|
|
|
struct mm_slot {
|
|
|
|
struct hlist_node hash;
|
|
|
|
struct list_head mm_node;
|
|
|
|
struct mm_struct *mm;
|
2019-09-24 06:38:30 +08:00
|
|
|
|
|
|
|
/* pte-mapped THP in this mm */
|
|
|
|
int nr_pte_mapped_thp;
|
|
|
|
unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
|
2016-07-27 06:26:24 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
/**
|
|
|
|
* struct khugepaged_scan - cursor for scanning
|
|
|
|
* @mm_head: the head of the mm list to scan
|
|
|
|
* @mm_slot: the current mm_slot we are scanning
|
|
|
|
* @address: the next address inside that to be scanned
|
|
|
|
*
|
|
|
|
* There is only the one khugepaged_scan instance of this cursor structure.
|
|
|
|
*/
|
|
|
|
struct khugepaged_scan {
|
|
|
|
struct list_head mm_head;
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
unsigned long address;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct khugepaged_scan khugepaged_scan = {
|
|
|
|
.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
|
|
|
|
};
|
|
|
|
|
2016-12-01 07:54:02 +08:00
|
|
|
#ifdef CONFIG_SYSFS
|
2016-07-27 06:26:24 +08:00
|
|
|
static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
2020-12-15 11:14:42 +08:00
|
|
|
return sysfs_emit(buf, "%u\n", khugepaged_scan_sleep_millisecs);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
2020-12-15 11:15:03 +08:00
|
|
|
unsigned int msecs;
|
2016-07-27 06:26:24 +08:00
|
|
|
int err;
|
|
|
|
|
2020-12-15 11:15:03 +08:00
|
|
|
err = kstrtouint(buf, 10, &msecs);
|
|
|
|
if (err)
|
2016-07-27 06:26:24 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_scan_sleep_millisecs = msecs;
|
|
|
|
khugepaged_sleep_expire = 0;
|
|
|
|
wake_up_interruptible(&khugepaged_wait);
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute scan_sleep_millisecs_attr =
|
|
|
|
__ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
|
|
|
|
scan_sleep_millisecs_store);
|
|
|
|
|
|
|
|
static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
2020-12-15 11:14:42 +08:00
|
|
|
return sysfs_emit(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
2020-12-15 11:15:03 +08:00
|
|
|
unsigned int msecs;
|
2016-07-27 06:26:24 +08:00
|
|
|
int err;
|
|
|
|
|
2020-12-15 11:15:03 +08:00
|
|
|
err = kstrtouint(buf, 10, &msecs);
|
|
|
|
if (err)
|
2016-07-27 06:26:24 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_alloc_sleep_millisecs = msecs;
|
|
|
|
khugepaged_sleep_expire = 0;
|
|
|
|
wake_up_interruptible(&khugepaged_wait);
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute alloc_sleep_millisecs_attr =
|
|
|
|
__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
|
|
|
|
alloc_sleep_millisecs_store);
|
|
|
|
|
|
|
|
static ssize_t pages_to_scan_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
2020-12-15 11:14:42 +08:00
|
|
|
return sysfs_emit(buf, "%u\n", khugepaged_pages_to_scan);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
static ssize_t pages_to_scan_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
2020-12-15 11:15:03 +08:00
|
|
|
unsigned int pages;
|
2016-07-27 06:26:24 +08:00
|
|
|
int err;
|
|
|
|
|
2020-12-15 11:15:03 +08:00
|
|
|
err = kstrtouint(buf, 10, &pages);
|
|
|
|
if (err || !pages)
|
2016-07-27 06:26:24 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_pages_to_scan = pages;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute pages_to_scan_attr =
|
|
|
|
__ATTR(pages_to_scan, 0644, pages_to_scan_show,
|
|
|
|
pages_to_scan_store);
|
|
|
|
|
|
|
|
static ssize_t pages_collapsed_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
2020-12-15 11:14:42 +08:00
|
|
|
return sysfs_emit(buf, "%u\n", khugepaged_pages_collapsed);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
static struct kobj_attribute pages_collapsed_attr =
|
|
|
|
__ATTR_RO(pages_collapsed);
|
|
|
|
|
|
|
|
static ssize_t full_scans_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
2020-12-15 11:14:42 +08:00
|
|
|
return sysfs_emit(buf, "%u\n", khugepaged_full_scans);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
static struct kobj_attribute full_scans_attr =
|
|
|
|
__ATTR_RO(full_scans);
|
|
|
|
|
|
|
|
static ssize_t khugepaged_defrag_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
return single_hugepage_flag_show(kobj, attr, buf,
|
2020-12-15 11:14:42 +08:00
|
|
|
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
static ssize_t khugepaged_defrag_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
return single_hugepage_flag_store(kobj, attr, buf, count,
|
|
|
|
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
|
|
|
|
}
|
|
|
|
static struct kobj_attribute khugepaged_defrag_attr =
|
|
|
|
__ATTR(defrag, 0644, khugepaged_defrag_show,
|
|
|
|
khugepaged_defrag_store);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* max_ptes_none controls if khugepaged should collapse hugepages over
|
|
|
|
* any unmapped ptes in turn potentially increasing the memory
|
|
|
|
* footprint of the vmas. When max_ptes_none is 0 khugepaged will not
|
|
|
|
* reduce the available free memory in the system as it
|
|
|
|
* runs. Increasing max_ptes_none will instead potentially reduce the
|
|
|
|
* free memory in the system during the khugepaged scan.
|
|
|
|
*/
|
|
|
|
static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
2020-12-15 11:14:42 +08:00
|
|
|
return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_none);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
unsigned long max_ptes_none;
|
|
|
|
|
|
|
|
err = kstrtoul(buf, 10, &max_ptes_none);
|
|
|
|
if (err || max_ptes_none > HPAGE_PMD_NR-1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_max_ptes_none = max_ptes_none;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute khugepaged_max_ptes_none_attr =
|
|
|
|
__ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
|
|
|
|
khugepaged_max_ptes_none_store);
|
|
|
|
|
|
|
|
static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
2020-12-15 11:14:42 +08:00
|
|
|
return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_swap);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
unsigned long max_ptes_swap;
|
|
|
|
|
|
|
|
err = kstrtoul(buf, 10, &max_ptes_swap);
|
|
|
|
if (err || max_ptes_swap > HPAGE_PMD_NR-1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_max_ptes_swap = max_ptes_swap;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct kobj_attribute khugepaged_max_ptes_swap_attr =
|
|
|
|
__ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
|
|
|
|
khugepaged_max_ptes_swap_store);
|
|
|
|
|
2020-06-04 07:00:30 +08:00
|
|
|
static ssize_t khugepaged_max_ptes_shared_show(struct kobject *kobj,
|
2020-12-15 11:14:42 +08:00
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
2020-06-04 07:00:30 +08:00
|
|
|
{
|
2020-12-15 11:14:42 +08:00
|
|
|
return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_shared);
|
2020-06-04 07:00:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t khugepaged_max_ptes_shared_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
unsigned long max_ptes_shared;
|
|
|
|
|
|
|
|
err = kstrtoul(buf, 10, &max_ptes_shared);
|
|
|
|
if (err || max_ptes_shared > HPAGE_PMD_NR-1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_max_ptes_shared = max_ptes_shared;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct kobj_attribute khugepaged_max_ptes_shared_attr =
|
|
|
|
__ATTR(max_ptes_shared, 0644, khugepaged_max_ptes_shared_show,
|
|
|
|
khugepaged_max_ptes_shared_store);
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
static struct attribute *khugepaged_attr[] = {
|
|
|
|
&khugepaged_defrag_attr.attr,
|
|
|
|
&khugepaged_max_ptes_none_attr.attr,
|
2020-06-04 07:00:30 +08:00
|
|
|
&khugepaged_max_ptes_swap_attr.attr,
|
|
|
|
&khugepaged_max_ptes_shared_attr.attr,
|
2016-07-27 06:26:24 +08:00
|
|
|
&pages_to_scan_attr.attr,
|
|
|
|
&pages_collapsed_attr.attr,
|
|
|
|
&full_scans_attr.attr,
|
|
|
|
&scan_sleep_millisecs_attr.attr,
|
|
|
|
&alloc_sleep_millisecs_attr.attr,
|
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
|
|
|
|
struct attribute_group khugepaged_attr_group = {
|
|
|
|
.attrs = khugepaged_attr,
|
|
|
|
.name = "khugepaged",
|
|
|
|
};
|
2016-12-01 07:54:02 +08:00
|
|
|
#endif /* CONFIG_SYSFS */
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
int hugepage_madvise(struct vm_area_struct *vma,
|
|
|
|
unsigned long *vm_flags, int advice)
|
|
|
|
{
|
|
|
|
switch (advice) {
|
|
|
|
case MADV_HUGEPAGE:
|
|
|
|
#ifdef CONFIG_S390
|
|
|
|
/*
|
|
|
|
* qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
|
|
|
|
* can't handle this properly after s390_enable_sie, so we simply
|
|
|
|
* ignore the madvise to prevent qemu from causing a SIGSEGV.
|
|
|
|
*/
|
|
|
|
if (mm_has_pgste(vma->vm_mm))
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
*vm_flags &= ~VM_NOHUGEPAGE;
|
|
|
|
*vm_flags |= VM_HUGEPAGE;
|
|
|
|
/*
|
|
|
|
* If the vma become good for khugepaged to scan,
|
|
|
|
* register it here without waiting a page fault that
|
|
|
|
* may not happen any time soon.
|
|
|
|
*/
|
|
|
|
if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
|
|
|
|
khugepaged_enter_vma_merge(vma, *vm_flags))
|
|
|
|
return -ENOMEM;
|
|
|
|
break;
|
|
|
|
case MADV_NOHUGEPAGE:
|
|
|
|
*vm_flags &= ~VM_HUGEPAGE;
|
|
|
|
*vm_flags |= VM_NOHUGEPAGE;
|
|
|
|
/*
|
|
|
|
* Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
|
|
|
|
* this vma even if we leave the mm registered in khugepaged if
|
|
|
|
* it got registered before VM_NOHUGEPAGE was set.
|
|
|
|
*/
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int __init khugepaged_init(void)
|
|
|
|
{
|
|
|
|
mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
|
|
|
|
sizeof(struct mm_slot),
|
|
|
|
__alignof__(struct mm_slot), 0, NULL);
|
|
|
|
if (!mm_slot_cache)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
|
|
|
|
khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
|
|
|
|
khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
|
2020-06-04 07:00:30 +08:00
|
|
|
khugepaged_max_ptes_shared = HPAGE_PMD_NR / 2;
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void __init khugepaged_destroy(void)
|
|
|
|
{
|
|
|
|
kmem_cache_destroy(mm_slot_cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct mm_slot *alloc_mm_slot(void)
|
|
|
|
{
|
|
|
|
if (!mm_slot_cache) /* initialization failed */
|
|
|
|
return NULL;
|
|
|
|
return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void free_mm_slot(struct mm_slot *mm_slot)
|
|
|
|
{
|
|
|
|
kmem_cache_free(mm_slot_cache, mm_slot);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct mm_slot *get_mm_slot(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
|
|
|
|
hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm)
|
|
|
|
if (mm == mm_slot->mm)
|
|
|
|
return mm_slot;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void insert_to_mm_slots_hash(struct mm_struct *mm,
|
|
|
|
struct mm_slot *mm_slot)
|
|
|
|
{
|
|
|
|
mm_slot->mm = mm;
|
|
|
|
hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int khugepaged_test_exit(struct mm_struct *mm)
|
|
|
|
{
|
2020-10-16 11:13:00 +08:00
|
|
|
return atomic_read(&mm->mm_users) == 0;
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
|
mm: thp: pass correct vm_flags to hugepage_vma_check()
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
hugepage_vma_check(). The argument vm_flags contains the latest value.
Therefore, it is necessary to pass this vm_flags into
hugepage_vma_check().
With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
put memory in huge pages. Here is an example of failed madvise():
/* mount /dev/shm with huge=advise:
* mount -o remount,huge=advise /dev/shm */
/* create file /dev/shm/huge */
#define HUGE_FILE "/dev/shm/huge"
fd = open(HUGE_FILE, O_RDONLY);
ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);
madvise() will return 0, but this memory region is never put in huge
page (check from /proc/meminfo: ShmemHugePages).
Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:47:00 +08:00
|
|
|
static bool hugepage_vma_check(struct vm_area_struct *vma,
|
|
|
|
unsigned long vm_flags)
|
2018-08-18 06:45:26 +08:00
|
|
|
{
|
2021-07-01 09:47:50 +08:00
|
|
|
if (!transhuge_vma_enabled(vma, vm_flags))
|
2018-08-18 06:45:26 +08:00
|
|
|
return false;
|
2019-09-24 06:38:00 +08:00
|
|
|
|
2021-10-29 05:36:30 +08:00
|
|
|
if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
|
|
|
|
vma->vm_pgoff, HPAGE_PMD_NR))
|
|
|
|
return false;
|
|
|
|
|
2021-02-26 09:16:25 +08:00
|
|
|
/* Enabled via shmem mount options or sysfs settings. */
|
2021-10-29 05:36:30 +08:00
|
|
|
if (shmem_file(vma->vm_file))
|
|
|
|
return shmem_huge_enabled(vma);
|
2021-02-26 09:16:25 +08:00
|
|
|
|
|
|
|
/* THP settings require madvise. */
|
|
|
|
if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always())
|
|
|
|
return false;
|
|
|
|
|
2021-10-29 05:36:30 +08:00
|
|
|
/* Only regular file is valid */
|
2021-02-26 09:16:25 +08:00
|
|
|
if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
|
mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs
Transparent huge pages are supported for read-only non-shmem files, but
are only used for vmas with VM_DENYWRITE. This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY). Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open(). Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().
Systems that make heavy use of shared libraries (e.g. Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.
This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()). It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.
Restricting the use of THPs to executable mappings eliminates the risk
that a read-only file later opened for write would encounter significant
latencies due to page cache truncation.
The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout. The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping
to be backed with THPs.
Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.
4KB Pages:
==========
count event_name # count / runtime
598,995,035,942 cpu-cycles # 1.800861 GHz
81,195,620,851 raw-stall-frontend # 244.112 M/sec
347,754,466,597 iTLB-loads # 1.046 G/sec
2,970,248,900 iTLB-load-misses # 0.854122% miss rate
Total test time: 332.854998 seconds.
2MB Pages:
==========
count event_name # count / runtime
592,872,663,047 cpu-cycles # 1.800358 GHz
76,485,624,143 raw-stall-frontend # 232.261 M/sec
350,478,413,710 iTLB-loads # 1.064 G/sec
803,233,322 iTLB-load-misses # 0.229182% miss rate
Total test time: 329.826087 seconds
A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:
/apex/com.android.art/lib64/libart.so
FilePmdMapped: 4096 kB
/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped: 2048 kB
Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 09:51:32 +08:00
|
|
|
(vm_flags & VM_EXEC)) {
|
2021-10-29 05:36:30 +08:00
|
|
|
struct inode *inode = vma->vm_file->f_inode;
|
|
|
|
|
|
|
|
return !inode_is_open_for_write(inode) &&
|
|
|
|
S_ISREG(inode->i_mode);
|
2021-02-26 09:16:25 +08:00
|
|
|
}
|
|
|
|
|
2018-08-18 06:45:26 +08:00
|
|
|
if (!vma->anon_vma || vma->vm_ops)
|
|
|
|
return false;
|
2020-04-02 12:07:52 +08:00
|
|
|
if (vma_is_temporary_stack(vma))
|
2018-08-18 06:45:26 +08:00
|
|
|
return false;
|
mm: thp: pass correct vm_flags to hugepage_vma_check()
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
hugepage_vma_check(). The argument vm_flags contains the latest value.
Therefore, it is necessary to pass this vm_flags into
hugepage_vma_check().
With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
put memory in huge pages. Here is an example of failed madvise():
/* mount /dev/shm with huge=advise:
* mount -o remount,huge=advise /dev/shm */
/* create file /dev/shm/huge */
#define HUGE_FILE "/dev/shm/huge"
fd = open(HUGE_FILE, O_RDONLY);
ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);
madvise() will return 0, but this memory region is never put in huge
page (check from /proc/meminfo: ShmemHugePages).
Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:47:00 +08:00
|
|
|
return !(vm_flags & VM_NO_KHUGEPAGED);
|
2018-08-18 06:45:26 +08:00
|
|
|
}
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
int __khugepaged_enter(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
int wakeup;
|
|
|
|
|
|
|
|
mm_slot = alloc_mm_slot();
|
|
|
|
if (!mm_slot)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* __khugepaged_exit() must not run from under us */
|
2021-05-05 09:33:43 +08:00
|
|
|
VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
|
2016-07-27 06:26:24 +08:00
|
|
|
if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
|
|
|
|
free_mm_slot(mm_slot);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
insert_to_mm_slots_hash(mm, mm_slot);
|
|
|
|
/*
|
|
|
|
* Insert just behind the scanning cursor, to let the area settle
|
|
|
|
* down a little.
|
|
|
|
*/
|
|
|
|
wakeup = list_empty(&khugepaged_scan.mm_head);
|
|
|
|
list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
|
2017-02-28 06:30:07 +08:00
|
|
|
mmgrab(mm);
|
2016-07-27 06:26:24 +08:00
|
|
|
if (wakeup)
|
|
|
|
wake_up_interruptible(&khugepaged_wait);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
|
|
|
|
unsigned long vm_flags)
|
|
|
|
{
|
|
|
|
unsigned long hstart, hend;
|
2018-08-18 06:45:26 +08:00
|
|
|
|
|
|
|
/*
|
2019-09-24 06:38:00 +08:00
|
|
|
* khugepaged only supports read-only files for non-shmem files.
|
|
|
|
* khugepaged does not yet work on special mappings. And
|
|
|
|
* file-private shmem THP is not supported.
|
2018-08-18 06:45:26 +08:00
|
|
|
*/
|
mm: thp: pass correct vm_flags to hugepage_vma_check()
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
hugepage_vma_check(). The argument vm_flags contains the latest value.
Therefore, it is necessary to pass this vm_flags into
hugepage_vma_check().
With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
put memory in huge pages. Here is an example of failed madvise():
/* mount /dev/shm with huge=advise:
* mount -o remount,huge=advise /dev/shm */
/* create file /dev/shm/huge */
#define HUGE_FILE "/dev/shm/huge"
fd = open(HUGE_FILE, O_RDONLY);
ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);
madvise() will return 0, but this memory region is never put in huge
page (check from /proc/meminfo: ShmemHugePages).
Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:47:00 +08:00
|
|
|
if (!hugepage_vma_check(vma, vm_flags))
|
2016-07-27 06:26:24 +08:00
|
|
|
return 0;
|
2018-08-18 06:45:26 +08:00
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
|
|
|
|
hend = vma->vm_end & HPAGE_PMD_MASK;
|
|
|
|
if (hstart < hend)
|
|
|
|
return khugepaged_enter(vma, vm_flags);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void __khugepaged_exit(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
int free = 0;
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
mm_slot = get_mm_slot(mm);
|
|
|
|
if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
|
|
|
|
hash_del(&mm_slot->hash);
|
|
|
|
list_del(&mm_slot->mm_node);
|
|
|
|
free = 1;
|
|
|
|
}
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
|
|
|
|
if (free) {
|
|
|
|
clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
|
|
|
|
free_mm_slot(mm_slot);
|
|
|
|
mmdrop(mm);
|
|
|
|
} else if (mm_slot) {
|
|
|
|
/*
|
|
|
|
* This is required to serialize against
|
|
|
|
* khugepaged_test_exit() (which is guaranteed to run
|
|
|
|
* under mmap sem read mode). Stop here (after we
|
|
|
|
* return all pagetables will be destroyed) until
|
|
|
|
* khugepaged has finished working on the pagetables
|
2020-06-09 12:33:54 +08:00
|
|
|
* under the mmap_lock.
|
2016-07-27 06:26:24 +08:00
|
|
|
*/
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_lock(mm);
|
|
|
|
mmap_write_unlock(mm);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void release_pte_page(struct page *page)
|
|
|
|
{
|
2020-06-04 07:00:23 +08:00
|
|
|
mod_node_page_state(page_pgdat(page),
|
|
|
|
NR_ISOLATED_ANON + page_is_file_lru(page),
|
|
|
|
-compound_nr(page));
|
2016-07-27 06:26:24 +08:00
|
|
|
unlock_page(page);
|
|
|
|
putback_lru_page(page);
|
|
|
|
}
|
|
|
|
|
2020-06-04 07:00:23 +08:00
|
|
|
static void release_pte_pages(pte_t *pte, pte_t *_pte,
|
|
|
|
struct list_head *compound_pagelist)
|
2016-07-27 06:26:24 +08:00
|
|
|
{
|
2020-06-04 07:00:23 +08:00
|
|
|
struct page *page, *tmp;
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
while (--_pte >= pte) {
|
|
|
|
pte_t pteval = *_pte;
|
2020-06-04 07:00:23 +08:00
|
|
|
|
|
|
|
page = pte_page(pteval);
|
|
|
|
if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)) &&
|
|
|
|
!PageCompound(page))
|
|
|
|
release_pte_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
|
|
|
|
list_del(&page->lru);
|
|
|
|
release_pte_page(page);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-06-04 07:00:20 +08:00
|
|
|
static bool is_refcount_suitable(struct page *page)
|
|
|
|
{
|
|
|
|
int expected_refcount;
|
|
|
|
|
|
|
|
expected_refcount = total_mapcount(page);
|
|
|
|
if (PageSwapCache(page))
|
|
|
|
expected_refcount += compound_nr(page);
|
|
|
|
|
|
|
|
return page_count(page) == expected_refcount;
|
|
|
|
}
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
2020-06-04 07:00:23 +08:00
|
|
|
pte_t *pte,
|
|
|
|
struct list_head *compound_pagelist)
|
2016-07-27 06:26:24 +08:00
|
|
|
{
|
|
|
|
struct page *page = NULL;
|
|
|
|
pte_t *_pte;
|
2020-06-04 07:00:30 +08:00
|
|
|
int none_or_zero = 0, shared = 0, result = 0, referenced = 0;
|
2016-07-27 06:26:46 +08:00
|
|
|
bool writable = false;
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
|
|
|
|
_pte++, address += PAGE_SIZE) {
|
|
|
|
pte_t pteval = *_pte;
|
|
|
|
if (pte_none(pteval) || (pte_present(pteval) &&
|
|
|
|
is_zero_pfn(pte_pfn(pteval)))) {
|
|
|
|
if (!userfaultfd_armed(vma) &&
|
|
|
|
++none_or_zero <= khugepaged_max_ptes_none) {
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
result = SCAN_EXCEED_NONE_PTE;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!pte_present(pteval)) {
|
|
|
|
result = SCAN_PTE_NON_PRESENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
page = vm_normal_page(vma, address, pteval);
|
|
|
|
if (unlikely(!page)) {
|
|
|
|
result = SCAN_PAGE_NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2020-06-04 07:00:23 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageAnon(page), page);
|
|
|
|
|
2020-06-04 07:00:30 +08:00
|
|
|
if (page_mapcount(page) > 1 &&
|
|
|
|
++shared > khugepaged_max_ptes_shared) {
|
|
|
|
result = SCAN_EXCEED_SHARED_PTE;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-03-23 07:17:28 +08:00
|
|
|
if (PageCompound(page)) {
|
2020-06-04 07:00:23 +08:00
|
|
|
struct page *p;
|
|
|
|
page = compound_head(page);
|
2018-03-23 07:17:28 +08:00
|
|
|
|
2020-06-04 07:00:23 +08:00
|
|
|
/*
|
|
|
|
* Check if we have dealt with the compound page
|
|
|
|
* already
|
|
|
|
*/
|
|
|
|
list_for_each_entry(p, compound_pagelist, lru) {
|
|
|
|
if (page == p)
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We can do it before isolate_lru_page because the
|
|
|
|
* page can't be freed from under us. NOTE: PG_lock
|
|
|
|
* is needed to serialize against split_huge_page
|
|
|
|
* when invoked from the VM.
|
|
|
|
*/
|
|
|
|
if (!trylock_page(page)) {
|
|
|
|
result = SCAN_PAGE_LOCK;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2020-06-04 07:00:20 +08:00
|
|
|
* Check if the page has any GUP (or other external) pins.
|
|
|
|
*
|
|
|
|
* The page table that maps the page has been already unlinked
|
|
|
|
* from the page table tree and this process cannot get
|
2021-05-07 09:06:47 +08:00
|
|
|
* an additional pin on the page.
|
2020-06-04 07:00:20 +08:00
|
|
|
*
|
|
|
|
* New pins can come later if the page is shared across fork,
|
|
|
|
* but not from this process. The other process cannot write to
|
|
|
|
* the page, only trigger CoW.
|
2016-07-27 06:26:24 +08:00
|
|
|
*/
|
2020-06-04 07:00:20 +08:00
|
|
|
if (!is_refcount_suitable(page)) {
|
2016-07-27 06:26:24 +08:00
|
|
|
unlock_page(page);
|
|
|
|
result = SCAN_PAGE_COUNT;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-06-04 07:00:23 +08:00
|
|
|
if (!pte_write(pteval) && PageSwapCache(page) &&
|
|
|
|
!reuse_swap_page(page, NULL)) {
|
2016-07-27 06:26:24 +08:00
|
|
|
/*
|
2020-06-04 07:00:23 +08:00
|
|
|
* Page is in the swap cache and cannot be re-used.
|
|
|
|
* It cannot be collapsed into a THP.
|
2016-07-27 06:26:24 +08:00
|
|
|
*/
|
2020-06-04 07:00:23 +08:00
|
|
|
unlock_page(page);
|
|
|
|
result = SCAN_SWAP_CACHE_PAGE;
|
|
|
|
goto out;
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Isolate the page to avoid collapsing an hugepage
|
|
|
|
* currently in use by the VM.
|
|
|
|
*/
|
|
|
|
if (isolate_lru_page(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
result = SCAN_DEL_PAGE_LRU;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-06-04 07:00:23 +08:00
|
|
|
mod_node_page_state(page_pgdat(page),
|
|
|
|
NR_ISOLATED_ANON + page_is_file_lru(page),
|
|
|
|
compound_nr(page));
|
2016-07-27 06:26:24 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
VM_BUG_ON_PAGE(PageLRU(page), page);
|
|
|
|
|
2020-06-04 07:00:23 +08:00
|
|
|
if (PageCompound(page))
|
|
|
|
list_add_tail(&page->lru, compound_pagelist);
|
|
|
|
next:
|
2016-07-27 06:26:46 +08:00
|
|
|
/* There should be enough young pte to collapse the page */
|
2016-07-27 06:26:24 +08:00
|
|
|
if (pte_young(pteval) ||
|
|
|
|
page_is_young(page) || PageReferenced(page) ||
|
|
|
|
mmu_notifier_test_young(vma->vm_mm, address))
|
2016-07-27 06:26:46 +08:00
|
|
|
referenced++;
|
2020-06-04 07:00:23 +08:00
|
|
|
|
|
|
|
if (pte_write(pteval))
|
|
|
|
writable = true;
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
2021-05-05 09:33:46 +08:00
|
|
|
|
|
|
|
if (unlikely(!writable)) {
|
2016-07-27 06:26:24 +08:00
|
|
|
result = SCAN_PAGE_RO;
|
2021-05-05 09:33:46 +08:00
|
|
|
} else if (unlikely(!referenced)) {
|
|
|
|
result = SCAN_LACK_REFERENCED_PAGE;
|
|
|
|
} else {
|
|
|
|
result = SCAN_SUCCEED;
|
|
|
|
trace_mm_collapse_huge_page_isolate(page, none_or_zero,
|
|
|
|
referenced, writable, result);
|
|
|
|
return 1;
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
out:
|
2020-06-04 07:00:23 +08:00
|
|
|
release_pte_pages(pte, _pte, compound_pagelist);
|
2016-07-27 06:26:24 +08:00
|
|
|
trace_mm_collapse_huge_page_isolate(page, none_or_zero,
|
|
|
|
referenced, writable, result);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
2020-06-04 07:00:23 +08:00
|
|
|
spinlock_t *ptl,
|
|
|
|
struct list_head *compound_pagelist)
|
2016-07-27 06:26:24 +08:00
|
|
|
{
|
2020-06-04 07:00:23 +08:00
|
|
|
struct page *src_page, *tmp;
|
2016-07-27 06:26:24 +08:00
|
|
|
pte_t *_pte;
|
2017-05-13 06:47:03 +08:00
|
|
|
for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
|
|
|
|
_pte++, page++, address += PAGE_SIZE) {
|
2016-07-27 06:26:24 +08:00
|
|
|
pte_t pteval = *_pte;
|
|
|
|
|
|
|
|
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
|
|
|
|
clear_user_highpage(page, address);
|
|
|
|
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
|
|
|
|
if (is_zero_pfn(pte_pfn(pteval))) {
|
|
|
|
/*
|
|
|
|
* ptl mostly unnecessary.
|
|
|
|
*/
|
|
|
|
spin_lock(ptl);
|
|
|
|
/*
|
|
|
|
* paravirt calls inside pte_clear here are
|
|
|
|
* superfluous.
|
|
|
|
*/
|
|
|
|
pte_clear(vma->vm_mm, address, _pte);
|
|
|
|
spin_unlock(ptl);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
src_page = pte_page(pteval);
|
|
|
|
copy_user_highpage(page, src_page, address, vma);
|
2020-06-04 07:00:23 +08:00
|
|
|
if (!PageCompound(src_page))
|
|
|
|
release_pte_page(src_page);
|
2016-07-27 06:26:24 +08:00
|
|
|
/*
|
|
|
|
* ptl mostly unnecessary, but preempt has to
|
|
|
|
* be disabled to update the per-cpu stats
|
|
|
|
* inside page_remove_rmap().
|
|
|
|
*/
|
|
|
|
spin_lock(ptl);
|
|
|
|
/*
|
|
|
|
* paravirt calls inside pte_clear here are
|
|
|
|
* superfluous.
|
|
|
|
*/
|
|
|
|
pte_clear(vma->vm_mm, address, _pte);
|
|
|
|
page_remove_rmap(src_page, false);
|
|
|
|
spin_unlock(ptl);
|
|
|
|
free_page_and_swap_cache(src_page);
|
|
|
|
}
|
|
|
|
}
|
2020-06-04 07:00:23 +08:00
|
|
|
|
|
|
|
list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) {
|
|
|
|
list_del(&src_page->lru);
|
|
|
|
release_pte_page(src_page);
|
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void khugepaged_alloc_sleep(void)
|
|
|
|
{
|
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
|
|
|
|
add_wait_queue(&khugepaged_wait, &wait);
|
|
|
|
freezable_schedule_timeout_interruptible(
|
|
|
|
msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
|
|
|
|
remove_wait_queue(&khugepaged_wait, &wait);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged_node_load[MAX_NUMNODES];
|
|
|
|
|
|
|
|
static bool khugepaged_scan_abort(int nid)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/*
|
2016-07-29 06:46:32 +08:00
|
|
|
* If node_reclaim_mode is disabled, then no extra effort is made to
|
2016-07-27 06:26:24 +08:00
|
|
|
* allocate memory locally.
|
|
|
|
*/
|
2021-05-05 09:36:04 +08:00
|
|
|
if (!node_reclaim_enabled())
|
2016-07-27 06:26:24 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/* If there is a count for this node already, it must be acceptable */
|
|
|
|
if (khugepaged_node_load[nid])
|
|
|
|
return false;
|
|
|
|
|
|
|
|
for (i = 0; i < MAX_NUMNODES; i++) {
|
|
|
|
if (!khugepaged_node_load[i])
|
|
|
|
continue;
|
2019-08-09 03:53:01 +08:00
|
|
|
if (node_distance(nid, i) > node_reclaim_distance)
|
2016-07-27 06:26:24 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
|
|
|
|
static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
|
|
|
|
{
|
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY. This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore. We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.
We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
The patch effectively changes the current GFP_TRANSHUGE users as
follows:
* get_huge_zero_page() - the zero page lifetime should be relatively
long and it's shared by multiple users, so it's worth spending some
effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
This also restores direct reclaim to this allocation, which was
unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
by default to madvise and add a stall-free defrag option")
* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
is not an issue. So if khugepaged "defrag" is enabled (the default), do
reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
PF_KTHREAD check from page alloc.
As a side-effect, khugepaged will now no longer check if the initial
compaction was deferred or contended. This is OK, as khugepaged sleep
times between collapsion attempts are long enough to prevent noticeable
disruption, so we should allow it to spend some effort.
* migrate_misplaced_transhuge_page() - already was masking out
__GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
equivalent.
* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
are now allocating without __GFP_NORETRY. Other vma's keep using
__GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
it's allowed only for madvised vma's). The rest is conversion to
GFP_TRANSHUGE(_LIGHT).
[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-29 06:49:25 +08:00
|
|
|
return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
static int khugepaged_find_target_node(void)
|
|
|
|
{
|
|
|
|
static int last_khugepaged_target_node = NUMA_NO_NODE;
|
|
|
|
int nid, target_node = 0, max_value = 0;
|
|
|
|
|
|
|
|
/* find first node with max normal pages hit */
|
|
|
|
for (nid = 0; nid < MAX_NUMNODES; nid++)
|
|
|
|
if (khugepaged_node_load[nid] > max_value) {
|
|
|
|
max_value = khugepaged_node_load[nid];
|
|
|
|
target_node = nid;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* do some balance if several nodes have the same hit record */
|
|
|
|
if (target_node <= last_khugepaged_target_node)
|
|
|
|
for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
|
|
|
|
nid++)
|
|
|
|
if (max_value == khugepaged_node_load[nid]) {
|
|
|
|
target_node = nid;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
last_khugepaged_target_node = target_node;
|
|
|
|
return target_node;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
|
|
|
|
{
|
|
|
|
if (IS_ERR(*hpage)) {
|
|
|
|
if (!*wait)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
*wait = false;
|
|
|
|
*hpage = NULL;
|
|
|
|
khugepaged_alloc_sleep();
|
|
|
|
} else if (*hpage) {
|
|
|
|
put_page(*hpage);
|
|
|
|
*hpage = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct page *
|
2016-07-27 06:26:26 +08:00
|
|
|
khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
|
2016-07-27 06:26:24 +08:00
|
|
|
{
|
|
|
|
VM_BUG_ON_PAGE(*hpage, *hpage);
|
|
|
|
|
|
|
|
*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
|
|
|
|
if (unlikely(!*hpage)) {
|
|
|
|
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
|
|
|
|
*hpage = ERR_PTR(-ENOMEM);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
prep_transhuge_page(*hpage);
|
|
|
|
count_vm_event(THP_COLLAPSE_ALLOC);
|
|
|
|
return *hpage;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static int khugepaged_find_target_node(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *alloc_khugepaged_hugepage(void)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
|
|
|
|
HPAGE_PMD_ORDER);
|
|
|
|
if (page)
|
|
|
|
prep_transhuge_page(page);
|
|
|
|
return page;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct page *khugepaged_alloc_hugepage(bool *wait)
|
|
|
|
{
|
|
|
|
struct page *hpage;
|
|
|
|
|
|
|
|
do {
|
|
|
|
hpage = alloc_khugepaged_hugepage();
|
|
|
|
if (!hpage) {
|
|
|
|
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
|
|
|
|
if (!*wait)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
*wait = false;
|
|
|
|
khugepaged_alloc_sleep();
|
|
|
|
} else
|
|
|
|
count_vm_event(THP_COLLAPSE_ALLOC);
|
|
|
|
} while (unlikely(!hpage) && likely(khugepaged_enabled()));
|
|
|
|
|
|
|
|
return hpage;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
|
|
|
|
{
|
mm/khugepaged: fix filemap page_to_pgoff(page) != offset
There have been elusive reports of filemap_fault() hitting its
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
with CONFIG_READ_ONLY_THP_FOR_FS=y.
Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
without NUMA reuses the same huge page after collapse_file() failed
(whereas NUMA targets its allocation to the respective node each time).
And most of us were usually testing with CONFIG_NUMA=y kernels.
collapse_file(old start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=old offset
new_page->mapping = mapping
xas_store(&xas, new_page)
filemap_fault
page = find_get_page(mapping, offset)
// if offset falls inside hpage then
// compound_head(page) == hpage
lock_page_maybe_drop_mmap()
__lock_page(page)
// collapse fails
xas_store(&xas, old page)
new_page->mapping = NULL
unlock_page(new_page)
collapse_file(new start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=new offset
new_page->mapping = mapping // mapping becomes valid again
// since compound_head(page) == hpage
// page_to_pgoff(page) got changed
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
An initial patch replaced __SetPageLocked() by lock_page(), which did
fix the race which Suren illustrates above. But testing showed that it's
not good enough: if the racing task's __lock_page() gets delayed long
after its find_get_page(), then it may follow collapse_file(new start)'s
successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.
It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
check and retry (as is done for mapping), with similar relaxations in
find_lock_entry() and pagecache_get_page(): but it's not obvious what
else might get caught out; and khugepaged non-NUMA appears to be unique
in exposing a page to page cache, then revoking, without going through
a full cycle of freeing before reuse.
Instead, non-NUMA khugepaged_prealloc_page() release the old page
if anyone else has a reference to it (1% of cases when I tested).
Although never reported on huge tmpfs, I believe its find_lock_entry()
has been at similar risk; but huge tmpfs does not rely on khugepaged
for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.
Reported-by: Denis Lisov <dennis.lissov@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw
Reported-and-analyzed-by: Suren Baghdasaryan <surenb@google.com>
Fixes: 87c460a0bded ("mm/khugepaged: collapse_shmem() without freezing new_page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v4.9+
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-10 11:07:59 +08:00
|
|
|
/*
|
|
|
|
* If the hpage allocated earlier was briefly exposed in page cache
|
|
|
|
* before collapse_file() failed, it is possible that racing lookups
|
|
|
|
* have not yet completed, and would then be unpleasantly surprised by
|
|
|
|
* finding the hpage reused for the same mapping at a different offset.
|
|
|
|
* Just release the previous allocation if there is any danger of that.
|
|
|
|
*/
|
|
|
|
if (*hpage && page_count(*hpage) > 1) {
|
|
|
|
put_page(*hpage);
|
|
|
|
*hpage = NULL;
|
|
|
|
}
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
if (!*hpage)
|
|
|
|
*hpage = khugepaged_alloc_hugepage(wait);
|
|
|
|
|
|
|
|
if (unlikely(!*hpage))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct page *
|
2016-07-27 06:26:26 +08:00
|
|
|
khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
|
2016-07-27 06:26:24 +08:00
|
|
|
{
|
|
|
|
VM_BUG_ON(!*hpage);
|
|
|
|
|
|
|
|
return *hpage;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
2020-06-09 12:33:54 +08:00
|
|
|
* If mmap_lock temporarily dropped, revalidate vma
|
|
|
|
* before taking mmap_lock.
|
2016-07-27 06:26:24 +08:00
|
|
|
* Return 0 if succeeds, otherwise return none-zero
|
|
|
|
* value (scan code).
|
|
|
|
*/
|
|
|
|
|
2016-09-20 05:44:01 +08:00
|
|
|
static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
|
|
|
|
struct vm_area_struct **vmap)
|
2016-07-27 06:26:24 +08:00
|
|
|
{
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
unsigned long hstart, hend;
|
|
|
|
|
|
|
|
if (unlikely(khugepaged_test_exit(mm)))
|
|
|
|
return SCAN_ANY_PROCESS;
|
|
|
|
|
2016-09-20 05:44:01 +08:00
|
|
|
*vmap = vma = find_vma(mm, address);
|
2016-07-27 06:26:24 +08:00
|
|
|
if (!vma)
|
|
|
|
return SCAN_VMA_NULL;
|
|
|
|
|
|
|
|
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
|
|
|
|
hend = vma->vm_end & HPAGE_PMD_MASK;
|
|
|
|
if (address < hstart || address + HPAGE_PMD_SIZE > hend)
|
|
|
|
return SCAN_ADDRESS_RANGE;
|
mm: thp: pass correct vm_flags to hugepage_vma_check()
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
hugepage_vma_check(). The argument vm_flags contains the latest value.
Therefore, it is necessary to pass this vm_flags into
hugepage_vma_check().
With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
put memory in huge pages. Here is an example of failed madvise():
/* mount /dev/shm with huge=advise:
* mount -o remount,huge=advise /dev/shm */
/* create file /dev/shm/huge */
#define HUGE_FILE "/dev/shm/huge"
fd = open(HUGE_FILE, O_RDONLY);
ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);
madvise() will return 0, but this memory region is never put in huge
page (check from /proc/meminfo: ShmemHugePages).
Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:47:00 +08:00
|
|
|
if (!hugepage_vma_check(vma, vma->vm_flags))
|
2016-07-27 06:26:24 +08:00
|
|
|
return SCAN_VMA_CHECK;
|
2020-07-24 12:15:34 +08:00
|
|
|
/* Anon VMA expected */
|
|
|
|
if (!vma->anon_vma || vma->vm_ops)
|
|
|
|
return SCAN_VMA_CHECK;
|
2016-07-27 06:26:24 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Bring missing pages in from swap, to complete THP collapse.
|
|
|
|
* Only done if khugepaged_scan_pmd believes it is worthwhile.
|
|
|
|
*
|
|
|
|
* Called and returns without pte mapped or spinlocks held,
|
2020-06-09 12:33:54 +08:00
|
|
|
* but with mmap_lock held to protect against vma changes.
|
2016-07-27 06:26:24 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
static bool __collapse_huge_page_swapin(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
2021-01-14 23:33:49 +08:00
|
|
|
unsigned long haddr, pmd_t *pmd,
|
2016-07-27 06:26:46 +08:00
|
|
|
int referenced)
|
2016-07-27 06:26:24 +08:00
|
|
|
{
|
2018-08-24 08:01:36 +08:00
|
|
|
int swapped_in = 0;
|
|
|
|
vm_fault_t ret = 0;
|
2021-01-14 23:33:49 +08:00
|
|
|
unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
|
|
|
|
|
|
|
|
for (address = haddr; address < end; address += PAGE_SIZE) {
|
|
|
|
struct vm_fault vmf = {
|
|
|
|
.vma = vma,
|
|
|
|
.address = address,
|
|
|
|
.pgoff = linear_page_index(vma, haddr),
|
|
|
|
.flags = FAULT_FLAG_ALLOW_RETRY,
|
|
|
|
.pmd = pmd,
|
|
|
|
};
|
|
|
|
|
|
|
|
vmf.pte = pte_offset_map(pmd, address);
|
2016-12-15 07:07:16 +08:00
|
|
|
vmf.orig_pte = *vmf.pte;
|
2021-01-14 23:33:49 +08:00
|
|
|
if (!is_swap_pte(vmf.orig_pte)) {
|
|
|
|
pte_unmap(vmf.pte);
|
2016-07-27 06:26:24 +08:00
|
|
|
continue;
|
2021-01-14 23:33:49 +08:00
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
swapped_in++;
|
2016-12-15 07:07:16 +08:00
|
|
|
ret = do_swap_page(&vmf);
|
2016-07-27 06:26:46 +08:00
|
|
|
|
2020-06-09 12:33:54 +08:00
|
|
|
/* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
|
2016-07-27 06:26:24 +08:00
|
|
|
if (ret & VM_FAULT_RETRY) {
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_lock(mm);
|
2021-01-14 23:33:49 +08:00
|
|
|
if (hugepage_vma_revalidate(mm, haddr, &vma)) {
|
2016-07-27 06:26:43 +08:00
|
|
|
/* vma is no longer available, don't continue to swapin */
|
2016-07-27 06:26:46 +08:00
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
|
2016-07-27 06:26:24 +08:00
|
|
|
return false;
|
2016-07-27 06:26:43 +08:00
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
/* check if the pmd is still valid */
|
2021-01-14 23:33:49 +08:00
|
|
|
if (mm_find_pmd(mm, haddr) != pmd) {
|
2017-05-13 06:46:38 +08:00
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
|
2016-07-27 06:26:24 +08:00
|
|
|
return false;
|
2017-05-13 06:46:38 +08:00
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
if (ret & VM_FAULT_ERROR) {
|
2016-07-27 06:26:46 +08:00
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
|
2016-07-27 06:26:24 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
2020-06-04 07:00:17 +08:00
|
|
|
|
|
|
|
/* Drain LRU add pagevec to remove extra pin on the swapped in pages */
|
|
|
|
if (swapped_in)
|
|
|
|
lru_add_drain();
|
|
|
|
|
2016-07-27 06:26:46 +08:00
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1);
|
2016-07-27 06:26:24 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void collapse_huge_page(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
|
|
|
struct page **hpage,
|
2020-06-04 07:00:09 +08:00
|
|
|
int node, int referenced, int unmapped)
|
2016-07-27 06:26:24 +08:00
|
|
|
{
|
2020-06-04 07:00:23 +08:00
|
|
|
LIST_HEAD(compound_pagelist);
|
2016-07-27 06:26:24 +08:00
|
|
|
pmd_t *pmd, _pmd;
|
|
|
|
pte_t *pte;
|
|
|
|
pgtable_t pgtable;
|
|
|
|
struct page *new_page;
|
|
|
|
spinlock_t *pmd_ptl, *pte_ptl;
|
|
|
|
int isolated = 0, result = 0;
|
2016-09-20 05:44:01 +08:00
|
|
|
struct vm_area_struct *vma;
|
2018-12-28 16:38:09 +08:00
|
|
|
struct mmu_notifier_range range;
|
2016-07-27 06:26:24 +08:00
|
|
|
gfp_t gfp;
|
|
|
|
|
|
|
|
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
|
|
|
|
|
|
|
|
/* Only allocate from the target node */
|
2017-01-11 08:57:42 +08:00
|
|
|
gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
|
2016-07-27 06:26:24 +08:00
|
|
|
|
2016-07-27 06:26:26 +08:00
|
|
|
/*
|
2020-06-09 12:33:54 +08:00
|
|
|
* Before allocating the hugepage, release the mmap_lock read lock.
|
2016-07-27 06:26:26 +08:00
|
|
|
* The allocation can take potentially a long time if it involves
|
2020-06-09 12:33:54 +08:00
|
|
|
* sync compaction, and we do not need to hold the mmap_lock during
|
2016-07-27 06:26:26 +08:00
|
|
|
* that. We will recheck the vma after taking it again in write mode.
|
|
|
|
*/
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm);
|
2016-07-27 06:26:26 +08:00
|
|
|
new_page = khugepaged_alloc_page(hpage, gfp, node);
|
2016-07-27 06:26:24 +08:00
|
|
|
if (!new_page) {
|
|
|
|
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
|
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
2020-06-04 07:02:24 +08:00
|
|
|
if (unlikely(mem_cgroup_charge(new_page, mm, gfp))) {
|
2016-07-27 06:26:24 +08:00
|
|
|
result = SCAN_CGROUP_CHARGE_FAIL;
|
|
|
|
goto out_nolock;
|
|
|
|
}
|
2020-06-04 07:02:04 +08:00
|
|
|
count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
|
2016-07-27 06:26:24 +08:00
|
|
|
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_lock(mm);
|
2016-09-20 05:44:01 +08:00
|
|
|
result = hugepage_vma_revalidate(mm, address, &vma);
|
2016-07-27 06:26:24 +08:00
|
|
|
if (result) {
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm);
|
2016-07-27 06:26:24 +08:00
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
|
|
|
pmd = mm_find_pmd(mm, address);
|
|
|
|
if (!pmd) {
|
|
|
|
result = SCAN_PMD_NULL;
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm);
|
2016-07-27 06:26:24 +08:00
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2020-06-09 12:33:54 +08:00
|
|
|
* __collapse_huge_page_swapin always returns with mmap_lock locked.
|
|
|
|
* If it fails, we release mmap_lock and jump out_nolock.
|
2016-07-27 06:26:24 +08:00
|
|
|
* Continuing to collapse causes inconsistency.
|
|
|
|
*/
|
2020-06-04 07:00:09 +08:00
|
|
|
if (unmapped && !__collapse_huge_page_swapin(mm, vma, address,
|
|
|
|
pmd, referenced)) {
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm);
|
2016-07-27 06:26:24 +08:00
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm);
|
2016-07-27 06:26:24 +08:00
|
|
|
/*
|
|
|
|
* Prevent all access to pagetables with the exception of
|
|
|
|
* gup_fast later handled by the ptep_clear_flush and the VM
|
|
|
|
* handled by the anon_vma lock + PG_lock.
|
|
|
|
*/
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_lock(mm);
|
2016-09-20 05:44:01 +08:00
|
|
|
result = hugepage_vma_revalidate(mm, address, &vma);
|
2016-07-27 06:26:24 +08:00
|
|
|
if (result)
|
2021-05-05 09:34:17 +08:00
|
|
|
goto out_up_write;
|
2016-07-27 06:26:24 +08:00
|
|
|
/* check if the pmd is still valid */
|
|
|
|
if (mm_find_pmd(mm, address) != pmd)
|
2021-05-05 09:34:17 +08:00
|
|
|
goto out_up_write;
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
anon_vma_lock_write(vma->anon_vma);
|
|
|
|
|
2019-05-14 08:20:53 +08:00
|
|
|
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
|
mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).
Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.
This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.
The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.
This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:
%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%
Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place
Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 08:20:49 +08:00
|
|
|
address, address + HPAGE_PMD_SIZE);
|
2018-12-28 16:38:09 +08:00
|
|
|
mmu_notifier_invalidate_range_start(&range);
|
2019-11-06 13:16:48 +08:00
|
|
|
|
|
|
|
pte = pte_offset_map(pmd, address);
|
|
|
|
pte_ptl = pte_lockptr(mm, pmd);
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
|
|
|
|
/*
|
2022-09-08 02:01:43 +08:00
|
|
|
* This removes any huge TLB entry from the CPU so we won't allow
|
|
|
|
* huge and small TLB entries for the same virtual address to
|
|
|
|
* avoid the risk of CPU bugs in that area.
|
|
|
|
*
|
|
|
|
* Parallel fast GUP is fine since fast GUP will back off when
|
|
|
|
* it detects PMD is changed.
|
2016-07-27 06:26:24 +08:00
|
|
|
*/
|
|
|
|
_pmd = pmdp_collapse_flush(vma, address, pmd);
|
|
|
|
spin_unlock(pmd_ptl);
|
2018-12-28 16:38:09 +08:00
|
|
|
mmu_notifier_invalidate_range_end(&range);
|
2022-12-07 01:16:04 +08:00
|
|
|
tlb_remove_table_sync_one();
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
spin_lock(pte_ptl);
|
2020-06-04 07:00:23 +08:00
|
|
|
isolated = __collapse_huge_page_isolate(vma, address, pte,
|
|
|
|
&compound_pagelist);
|
2016-07-27 06:26:24 +08:00
|
|
|
spin_unlock(pte_ptl);
|
|
|
|
|
|
|
|
if (unlikely(!isolated)) {
|
|
|
|
pte_unmap(pte);
|
|
|
|
spin_lock(pmd_ptl);
|
|
|
|
BUG_ON(!pmd_none(*pmd));
|
|
|
|
/*
|
|
|
|
* We can only use set_pmd_at when establishing
|
|
|
|
* hugepmds and never for establishing regular pmds that
|
|
|
|
* points to regular pagetables. Use pmd_populate for that
|
|
|
|
*/
|
|
|
|
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
|
|
|
|
spin_unlock(pmd_ptl);
|
|
|
|
anon_vma_unlock_write(vma->anon_vma);
|
|
|
|
result = SCAN_FAIL;
|
2021-05-05 09:34:17 +08:00
|
|
|
goto out_up_write;
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All pages are isolated and locked so anon_vma rmap
|
|
|
|
* can't run anymore.
|
|
|
|
*/
|
|
|
|
anon_vma_unlock_write(vma->anon_vma);
|
|
|
|
|
2020-06-04 07:00:23 +08:00
|
|
|
__collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
|
|
|
|
&compound_pagelist);
|
2016-07-27 06:26:24 +08:00
|
|
|
pte_unmap(pte);
|
2021-05-05 09:33:40 +08:00
|
|
|
/*
|
|
|
|
* spin_lock() below is not the equivalent of smp_wmb(), but
|
|
|
|
* the smp_wmb() inside __SetPageUptodate() can be reused to
|
|
|
|
* avoid the copy_huge_page writes to become visible after
|
|
|
|
* the set_pmd_at() write.
|
|
|
|
*/
|
2016-07-27 06:26:24 +08:00
|
|
|
__SetPageUptodate(new_page);
|
|
|
|
pgtable = pmd_pgtable(_pmd);
|
|
|
|
|
|
|
|
_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
|
2017-11-30 01:01:01 +08:00
|
|
|
_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
spin_lock(pmd_ptl);
|
|
|
|
BUG_ON(!pmd_none(*pmd));
|
2020-06-04 07:01:57 +08:00
|
|
|
page_add_new_anon_rmap(new_page, vma, address, true);
|
2020-08-12 09:30:40 +08:00
|
|
|
lru_cache_add_inactive_or_unevictable(new_page, vma);
|
2016-07-27 06:26:24 +08:00
|
|
|
pgtable_trans_huge_deposit(mm, pmd, pgtable);
|
|
|
|
set_pmd_at(mm, address, pmd, _pmd);
|
|
|
|
update_mmu_cache_pmd(vma, address, pmd);
|
|
|
|
spin_unlock(pmd_ptl);
|
|
|
|
|
|
|
|
*hpage = NULL;
|
|
|
|
|
|
|
|
khugepaged_pages_collapsed++;
|
|
|
|
result = SCAN_SUCCEED;
|
|
|
|
out_up_write:
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_unlock(mm);
|
2016-07-27 06:26:24 +08:00
|
|
|
out_nolock:
|
2020-06-04 07:02:04 +08:00
|
|
|
if (!IS_ERR_OR_NULL(*hpage))
|
|
|
|
mem_cgroup_uncharge(*hpage);
|
2016-07-27 06:26:24 +08:00
|
|
|
trace_mm_collapse_huge_page(mm, isolated, result);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged_scan_pmd(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
struct page **hpage)
|
|
|
|
{
|
|
|
|
pmd_t *pmd;
|
|
|
|
pte_t *pte, *_pte;
|
2020-06-04 07:00:30 +08:00
|
|
|
int ret = 0, result = 0, referenced = 0;
|
|
|
|
int none_or_zero = 0, shared = 0;
|
2016-07-27 06:26:24 +08:00
|
|
|
struct page *page = NULL;
|
|
|
|
unsigned long _address;
|
|
|
|
spinlock_t *ptl;
|
|
|
|
int node = NUMA_NO_NODE, unmapped = 0;
|
2016-07-27 06:26:46 +08:00
|
|
|
bool writable = false;
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
|
|
|
|
|
|
|
|
pmd = mm_find_pmd(mm, address);
|
|
|
|
if (!pmd) {
|
|
|
|
result = SCAN_PMD_NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
|
|
|
|
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
|
|
|
|
for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
|
|
|
|
_pte++, _address += PAGE_SIZE) {
|
|
|
|
pte_t pteval = *_pte;
|
|
|
|
if (is_swap_pte(pteval)) {
|
|
|
|
if (++unmapped <= khugepaged_max_ptes_swap) {
|
2020-04-07 11:06:04 +08:00
|
|
|
/*
|
|
|
|
* Always be strict with uffd-wp
|
|
|
|
* enabled swap entries. Please see
|
|
|
|
* comment below for pte_uffd_wp().
|
|
|
|
*/
|
|
|
|
if (pte_swp_uffd_wp(pteval)) {
|
|
|
|
result = SCAN_PTE_UFFD_WP;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
result = SCAN_EXCEED_SWAP_PTE;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
|
|
|
|
if (!userfaultfd_armed(vma) &&
|
|
|
|
++none_or_zero <= khugepaged_max_ptes_none) {
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
result = SCAN_EXCEED_NONE_PTE;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
}
|
2020-04-07 11:06:04 +08:00
|
|
|
if (pte_uffd_wp(pteval)) {
|
|
|
|
/*
|
|
|
|
* Don't collapse the page if any of the small
|
|
|
|
* PTEs are armed with uffd write protection.
|
|
|
|
* Here we can also mark the new huge pmd as
|
|
|
|
* write protected if any of the small ones is
|
2020-12-16 12:47:26 +08:00
|
|
|
* marked but that could bring unknown
|
2020-04-07 11:06:04 +08:00
|
|
|
* userfault messages that falls outside of
|
|
|
|
* the registered range. So, just be simple.
|
|
|
|
*/
|
|
|
|
result = SCAN_PTE_UFFD_WP;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
if (pte_write(pteval))
|
|
|
|
writable = true;
|
|
|
|
|
|
|
|
page = vm_normal_page(vma, _address, pteval);
|
|
|
|
if (unlikely(!page)) {
|
|
|
|
result = SCAN_PAGE_NULL;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
|
2020-06-04 07:00:30 +08:00
|
|
|
if (page_mapcount(page) > 1 &&
|
|
|
|
++shared > khugepaged_max_ptes_shared) {
|
|
|
|
result = SCAN_EXCEED_SHARED_PTE;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
|
2020-06-04 07:00:23 +08:00
|
|
|
page = compound_head(page);
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Record which node the original page is from and save this
|
|
|
|
* information to khugepaged_node_load[].
|
|
|
|
* Khupaged will allocate hugepage from the node has the max
|
|
|
|
* hit record.
|
|
|
|
*/
|
|
|
|
node = page_to_nid(page);
|
|
|
|
if (khugepaged_scan_abort(node)) {
|
|
|
|
result = SCAN_SCAN_ABORT;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
khugepaged_node_load[node]++;
|
|
|
|
if (!PageLRU(page)) {
|
|
|
|
result = SCAN_PAGE_LRU;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
if (PageLocked(page)) {
|
|
|
|
result = SCAN_PAGE_LOCK;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
if (!PageAnon(page)) {
|
|
|
|
result = SCAN_PAGE_ANON;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2020-06-04 07:00:20 +08:00
|
|
|
* Check if the page has any GUP (or other external) pins.
|
|
|
|
*
|
|
|
|
* Here the check is racy it may see totmal_mapcount > refcount
|
|
|
|
* in some cases.
|
|
|
|
* For example, one process with one forked child process.
|
|
|
|
* The parent has the PMD split due to MADV_DONTNEED, then
|
|
|
|
* the child is trying unmap the whole PMD, but khugepaged
|
|
|
|
* may be scanning the parent between the child has
|
|
|
|
* PageDoubleMap flag cleared and dec the mapcount. So
|
|
|
|
* khugepaged may see total_mapcount > refcount.
|
|
|
|
*
|
|
|
|
* But such case is ephemeral we could always retry collapse
|
|
|
|
* later. However it may report false positive if the page
|
|
|
|
* has excessive GUP pins (i.e. 512). Anyway the same check
|
|
|
|
* will be done again later the risk seems low.
|
2016-07-27 06:26:24 +08:00
|
|
|
*/
|
2020-06-04 07:00:20 +08:00
|
|
|
if (!is_refcount_suitable(page)) {
|
2016-07-27 06:26:24 +08:00
|
|
|
result = SCAN_PAGE_COUNT;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
if (pte_young(pteval) ||
|
|
|
|
page_is_young(page) || PageReferenced(page) ||
|
|
|
|
mmu_notifier_test_young(vma->vm_mm, address))
|
2016-07-27 06:26:46 +08:00
|
|
|
referenced++;
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
2020-06-04 07:00:09 +08:00
|
|
|
if (!writable) {
|
2016-07-27 06:26:24 +08:00
|
|
|
result = SCAN_PAGE_RO;
|
2020-06-04 07:00:09 +08:00
|
|
|
} else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
|
|
|
|
result = SCAN_LACK_REFERENCED_PAGE;
|
|
|
|
} else {
|
|
|
|
result = SCAN_SUCCEED;
|
|
|
|
ret = 1;
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
out_unmap:
|
|
|
|
pte_unmap_unlock(pte, ptl);
|
|
|
|
if (ret) {
|
|
|
|
node = khugepaged_find_target_node();
|
2020-06-09 12:33:54 +08:00
|
|
|
/* collapse_huge_page will return with the mmap_lock released */
|
2020-06-04 07:00:09 +08:00
|
|
|
collapse_huge_page(mm, address, hpage, node,
|
|
|
|
referenced, unmapped);
|
2016-07-27 06:26:24 +08:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
|
|
|
|
none_or_zero, result, unmapped);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void collect_mm_slot(struct mm_slot *mm_slot)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = mm_slot->mm;
|
|
|
|
|
2018-10-05 14:45:47 +08:00
|
|
|
lockdep_assert_held(&khugepaged_mm_lock);
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
if (khugepaged_test_exit(mm)) {
|
|
|
|
/* free mm_slot */
|
|
|
|
hash_del(&mm_slot->hash);
|
|
|
|
list_del(&mm_slot->mm_node);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Not strictly needed because the mm exited already.
|
|
|
|
*
|
|
|
|
* clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* khugepaged_mm_lock actually not necessary for the below */
|
|
|
|
free_mm_slot(mm_slot);
|
|
|
|
mmdrop(mm);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-04-07 11:04:35 +08:00
|
|
|
#ifdef CONFIG_SHMEM
|
2019-09-24 06:38:30 +08:00
|
|
|
/*
|
|
|
|
* Notify khugepaged that given addr of the mm is pte-mapped THP. Then
|
|
|
|
* khugepaged should try to collapse the page table.
|
|
|
|
*/
|
|
|
|
static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
|
|
|
|
unsigned long addr)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
|
|
|
|
VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
mm_slot = get_mm_slot(mm);
|
|
|
|
if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
|
|
|
|
mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2020-12-15 11:12:01 +08:00
|
|
|
* collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
|
|
|
|
* address haddr.
|
|
|
|
*
|
|
|
|
* @mm: process address space where collapse happens
|
|
|
|
* @addr: THP collapse address
|
2019-09-24 06:38:30 +08:00
|
|
|
*
|
|
|
|
* This function checks whether all the PTEs in the PMD are pointing to the
|
|
|
|
* right THP. If so, retract the page table so the THP can refault in with
|
|
|
|
* as pmd-mapped.
|
|
|
|
*/
|
|
|
|
void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
|
|
|
|
{
|
|
|
|
unsigned long haddr = addr & HPAGE_PMD_MASK;
|
|
|
|
struct vm_area_struct *vma = find_vma(mm, haddr);
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
struct page *hpage;
|
2019-09-24 06:38:30 +08:00
|
|
|
pte_t *start_pte, *pte;
|
|
|
|
pmd_t *pmd, _pmd;
|
|
|
|
spinlock_t *ptl;
|
|
|
|
int count = 0;
|
|
|
|
int i;
|
2022-12-07 01:16:05 +08:00
|
|
|
struct mmu_notifier_range range;
|
2019-09-24 06:38:30 +08:00
|
|
|
|
|
|
|
if (!vma || !vma->vm_file ||
|
2021-05-05 09:34:15 +08:00
|
|
|
!range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
|
2019-09-24 06:38:30 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This vm_flags may not have VM_HUGEPAGE if the page was not
|
|
|
|
* collapsed by this mm. But we can still collapse if the page is
|
|
|
|
* the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
|
|
|
|
* will not fail the vma for missing VM_HUGEPAGE
|
|
|
|
*/
|
|
|
|
if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
|
|
|
|
return;
|
|
|
|
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
hpage = find_lock_page(vma->vm_file->f_mapping,
|
|
|
|
linear_page_index(vma, haddr));
|
|
|
|
if (!hpage)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!PageHead(hpage))
|
|
|
|
goto drop_hpage;
|
|
|
|
|
2019-09-24 06:38:30 +08:00
|
|
|
pmd = mm_find_pmd(mm, haddr);
|
|
|
|
if (!pmd)
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
goto drop_hpage;
|
2019-09-24 06:38:30 +08:00
|
|
|
|
mm/khugepaged: take the right locks for page table retraction
commit 8d3c106e19e8d251da31ff4cc7462e4565d65084 upstream.
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[manual backport: this code was refactored from two copies into a common
helper between 5.15 and 6.0]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-07 01:16:06 +08:00
|
|
|
/*
|
|
|
|
* We need to lock the mapping so that from here on, only GUP-fast and
|
|
|
|
* hardware page walks can access the parts of the page tables that
|
|
|
|
* we're operating on.
|
|
|
|
*/
|
|
|
|
i_mmap_lock_write(vma->vm_file->f_mapping);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This spinlock should be unnecessary: Nobody else should be accessing
|
|
|
|
* the page tables under spinlock protection here, only
|
|
|
|
* lockless_pages_from_mm() and the hardware page walker can access page
|
|
|
|
* tables while all the high-level locks are held in write mode.
|
|
|
|
*/
|
2019-09-24 06:38:30 +08:00
|
|
|
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
|
|
|
|
|
|
|
|
/* step 1: check all mapped PTEs are to the right huge page */
|
|
|
|
for (i = 0, addr = haddr, pte = start_pte;
|
|
|
|
i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
/* empty pte, skip */
|
|
|
|
if (pte_none(*pte))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* page swapped out, abort */
|
|
|
|
if (!pte_present(*pte))
|
|
|
|
goto abort;
|
|
|
|
|
|
|
|
page = vm_normal_page(vma, addr, *pte);
|
|
|
|
|
|
|
|
/*
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
* Note that uprobe, debugger, or MAP_PRIVATE may change the
|
|
|
|
* page table, but the new page will not be a subpage of hpage.
|
2019-09-24 06:38:30 +08:00
|
|
|
*/
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
if (hpage + i != page)
|
2019-09-24 06:38:30 +08:00
|
|
|
goto abort;
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* step 2: adjust rmap */
|
|
|
|
for (i = 0, addr = haddr, pte = start_pte;
|
|
|
|
i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
if (pte_none(*pte))
|
|
|
|
continue;
|
|
|
|
page = vm_normal_page(vma, addr, *pte);
|
|
|
|
page_remove_rmap(page, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
pte_unmap_unlock(start_pte, ptl);
|
|
|
|
|
|
|
|
/* step 3: set proper refcount and mm_counters. */
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
if (count) {
|
2019-09-24 06:38:30 +08:00
|
|
|
page_ref_sub(hpage, count);
|
|
|
|
add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* step 4: collapse pmd */
|
2022-12-23 04:41:50 +08:00
|
|
|
/* we make no change to anon, but protect concurrent anon page lookup */
|
|
|
|
if (vma->anon_vma)
|
|
|
|
anon_vma_lock_write(vma->anon_vma);
|
|
|
|
|
2022-12-07 01:16:05 +08:00
|
|
|
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, haddr,
|
|
|
|
haddr + HPAGE_PMD_SIZE);
|
|
|
|
mmu_notifier_invalidate_range_start(&range);
|
2020-08-07 14:26:15 +08:00
|
|
|
_pmd = pmdp_collapse_flush(vma, haddr, pmd);
|
2019-09-24 06:38:30 +08:00
|
|
|
mm_dec_nr_ptes(mm);
|
2022-12-07 01:16:04 +08:00
|
|
|
tlb_remove_table_sync_one();
|
2022-12-07 01:16:05 +08:00
|
|
|
mmu_notifier_invalidate_range_end(&range);
|
2019-09-24 06:38:30 +08:00
|
|
|
pte_free(mm, pmd_pgtable(_pmd));
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
|
2022-12-23 04:41:50 +08:00
|
|
|
if (vma->anon_vma)
|
|
|
|
anon_vma_unlock_write(vma->anon_vma);
|
mm/khugepaged: take the right locks for page table retraction
commit 8d3c106e19e8d251da31ff4cc7462e4565d65084 upstream.
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[manual backport: this code was refactored from two copies into a common
helper between 5.15 and 6.0]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-07 01:16:06 +08:00
|
|
|
i_mmap_unlock_write(vma->vm_file->f_mapping);
|
|
|
|
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
drop_hpage:
|
|
|
|
unlock_page(hpage);
|
|
|
|
put_page(hpage);
|
2019-09-24 06:38:30 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
abort:
|
|
|
|
pte_unmap_unlock(start_pte, ptl);
|
mm/khugepaged: take the right locks for page table retraction
commit 8d3c106e19e8d251da31ff4cc7462e4565d65084 upstream.
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[manual backport: this code was refactored from two copies into a common
helper between 5.15 and 6.0]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-07 01:16:06 +08:00
|
|
|
i_mmap_unlock_write(vma->vm_file->f_mapping);
|
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:18 +08:00
|
|
|
goto drop_hpage;
|
2019-09-24 06:38:30 +08:00
|
|
|
}
|
|
|
|
|
2021-05-05 09:33:37 +08:00
|
|
|
static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
|
2019-09-24 06:38:30 +08:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = mm_slot->mm;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (likely(mm_slot->nr_pte_mapped_thp == 0))
|
2021-05-05 09:33:37 +08:00
|
|
|
return;
|
2019-09-24 06:38:30 +08:00
|
|
|
|
2020-06-09 12:33:25 +08:00
|
|
|
if (!mmap_write_trylock(mm))
|
2021-05-05 09:33:37 +08:00
|
|
|
return;
|
2019-09-24 06:38:30 +08:00
|
|
|
|
|
|
|
if (unlikely(khugepaged_test_exit(mm)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
|
|
|
|
collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
|
|
|
|
|
|
|
|
out:
|
|
|
|
mm_slot->nr_pte_mapped_thp = 0;
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_unlock(mm);
|
2019-09-24 06:38:30 +08:00
|
|
|
}
|
|
|
|
|
2016-07-27 06:26:32 +08:00
|
|
|
static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma;
|
khugepaged: retract_page_tables() remember to test exit
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:22 +08:00
|
|
|
struct mm_struct *mm;
|
2016-07-27 06:26:32 +08:00
|
|
|
unsigned long addr;
|
|
|
|
pmd_t *pmd, _pmd;
|
|
|
|
|
|
|
|
i_mmap_lock_write(mapping);
|
|
|
|
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
|
2019-09-24 06:38:30 +08:00
|
|
|
/*
|
|
|
|
* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
|
|
|
|
* got written to. These VMAs are likely not worth investing
|
2020-06-09 12:33:51 +08:00
|
|
|
* mmap_write_lock(mm) as PMD-mapping is likely to be split
|
2019-09-24 06:38:30 +08:00
|
|
|
* later.
|
|
|
|
*
|
|
|
|
* Not that vma->anon_vma check is racy: it can be set up after
|
2020-06-09 12:33:54 +08:00
|
|
|
* the check but before we took mmap_lock by the fault path.
|
2019-09-24 06:38:30 +08:00
|
|
|
* But page lock would prevent establishing any new ptes of the
|
|
|
|
* page, so we are safe.
|
|
|
|
*
|
|
|
|
* An alternative would be drop the check, but check that page
|
|
|
|
* table is clear before calling pmdp_collapse_flush() under
|
|
|
|
* ptl. It has higher chance to recover THP for the VMA, but
|
mm/khugepaged: take the right locks for page table retraction
commit 8d3c106e19e8d251da31ff4cc7462e4565d65084 upstream.
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[manual backport: this code was refactored from two copies into a common
helper between 5.15 and 6.0]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-07 01:16:06 +08:00
|
|
|
* has higher cost too. It would also probably require locking
|
|
|
|
* the anon_vma.
|
2019-09-24 06:38:30 +08:00
|
|
|
*/
|
2016-07-27 06:26:32 +08:00
|
|
|
if (vma->anon_vma)
|
|
|
|
continue;
|
|
|
|
addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
|
|
|
|
if (addr & ~HPAGE_PMD_MASK)
|
|
|
|
continue;
|
|
|
|
if (vma->vm_end < addr + HPAGE_PMD_SIZE)
|
|
|
|
continue;
|
khugepaged: retract_page_tables() remember to test exit
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:22 +08:00
|
|
|
mm = vma->vm_mm;
|
|
|
|
pmd = mm_find_pmd(mm, addr);
|
2016-07-27 06:26:32 +08:00
|
|
|
if (!pmd)
|
|
|
|
continue;
|
|
|
|
/*
|
2020-06-09 12:33:54 +08:00
|
|
|
* We need exclusive mmap_lock to retract page table.
|
2019-09-24 06:38:30 +08:00
|
|
|
*
|
|
|
|
* We use trylock due to lock inversion: we need to acquire
|
2020-06-09 12:33:54 +08:00
|
|
|
* mmap_lock while holding page lock. Fault path does it in
|
2019-09-24 06:38:30 +08:00
|
|
|
* reverse order. Trylock is a way to avoid deadlock.
|
2016-07-27 06:26:32 +08:00
|
|
|
*/
|
khugepaged: retract_page_tables() remember to test exit
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:22 +08:00
|
|
|
if (mmap_write_trylock(mm)) {
|
|
|
|
if (!khugepaged_test_exit(mm)) {
|
2022-12-07 01:16:05 +08:00
|
|
|
struct mmu_notifier_range range;
|
|
|
|
|
|
|
|
mmu_notifier_range_init(&range,
|
|
|
|
MMU_NOTIFY_CLEAR, 0,
|
|
|
|
NULL, mm, addr,
|
|
|
|
addr + HPAGE_PMD_SIZE);
|
|
|
|
mmu_notifier_invalidate_range_start(&range);
|
khugepaged: retract_page_tables() remember to test exit
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:22 +08:00
|
|
|
/* assume page table is clear */
|
|
|
|
_pmd = pmdp_collapse_flush(vma, addr, pmd);
|
|
|
|
mm_dec_nr_ptes(mm);
|
2022-12-07 01:16:04 +08:00
|
|
|
tlb_remove_table_sync_one();
|
khugepaged: retract_page_tables() remember to test exit
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:22 +08:00
|
|
|
pte_free(mm, pmd_pgtable(_pmd));
|
2022-12-07 01:16:05 +08:00
|
|
|
mmu_notifier_invalidate_range_end(&range);
|
khugepaged: retract_page_tables() remember to test exit
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:22 +08:00
|
|
|
}
|
|
|
|
mmap_write_unlock(mm);
|
2019-09-24 06:38:30 +08:00
|
|
|
} else {
|
|
|
|
/* Try again later */
|
khugepaged: retract_page_tables() remember to test exit
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 14:26:22 +08:00
|
|
|
khugepaged_add_pte_mapped_thp(mm, addr);
|
2016-07-27 06:26:32 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
i_mmap_unlock_write(mapping);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2019-09-24 06:38:00 +08:00
|
|
|
* collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
|
2016-07-27 06:26:32 +08:00
|
|
|
*
|
2020-12-15 11:12:01 +08:00
|
|
|
* @mm: process address space where collapse happens
|
|
|
|
* @file: file that collapse on
|
|
|
|
* @start: collapse start address
|
|
|
|
* @hpage: new allocated huge page for collapse
|
|
|
|
* @node: appointed node the new huge page allocate from
|
|
|
|
*
|
2016-07-27 06:26:32 +08:00
|
|
|
* Basic scheme is simple, details are more complex:
|
mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-01 06:10:43 +08:00
|
|
|
* - allocate and lock a new huge page;
|
2017-12-05 03:56:08 +08:00
|
|
|
* - scan page cache replacing old pages with the new one
|
2019-09-24 06:38:00 +08:00
|
|
|
* + swap/gup in pages if necessary;
|
2016-07-27 06:26:32 +08:00
|
|
|
* + fill in gaps;
|
2017-12-05 03:56:08 +08:00
|
|
|
* + keep old pages around in case rollback is required;
|
|
|
|
* - if replacing succeeds:
|
2016-07-27 06:26:32 +08:00
|
|
|
* + copy data over;
|
|
|
|
* + free old pages;
|
mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-01 06:10:43 +08:00
|
|
|
* + unlock huge page;
|
2016-07-27 06:26:32 +08:00
|
|
|
* - if replacing failed;
|
|
|
|
* + put all pages back and unfreeze them;
|
2017-12-05 03:56:08 +08:00
|
|
|
* + restore gaps in the page cache;
|
mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-01 06:10:43 +08:00
|
|
|
* + unlock and free huge page;
|
2016-07-27 06:26:32 +08:00
|
|
|
*/
|
2019-09-24 06:37:57 +08:00
|
|
|
static void collapse_file(struct mm_struct *mm,
|
|
|
|
struct file *file, pgoff_t start,
|
2016-07-27 06:26:32 +08:00
|
|
|
struct page **hpage, int node)
|
|
|
|
{
|
2019-09-24 06:37:57 +08:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
2016-07-27 06:26:32 +08:00
|
|
|
gfp_t gfp;
|
2017-12-05 03:56:08 +08:00
|
|
|
struct page *new_page;
|
2016-07-27 06:26:32 +08:00
|
|
|
pgoff_t index, end = start + HPAGE_PMD_NR;
|
|
|
|
LIST_HEAD(pagelist);
|
2017-12-05 03:56:08 +08:00
|
|
|
XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
|
2016-07-27 06:26:32 +08:00
|
|
|
int nr_none = 0, result = SCAN_SUCCEED;
|
2019-09-24 06:38:00 +08:00
|
|
|
bool is_shmem = shmem_file(file);
|
2021-02-25 04:03:27 +08:00
|
|
|
int nr;
|
2016-07-27 06:26:32 +08:00
|
|
|
|
2019-09-24 06:38:00 +08:00
|
|
|
VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
|
2016-07-27 06:26:32 +08:00
|
|
|
VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
|
|
|
|
|
|
|
|
/* Only allocate from the target node */
|
2017-01-11 08:57:42 +08:00
|
|
|
gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
|
2016-07-27 06:26:32 +08:00
|
|
|
|
|
|
|
new_page = khugepaged_alloc_page(hpage, gfp, node);
|
|
|
|
if (!new_page) {
|
|
|
|
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2020-06-04 07:02:24 +08:00
|
|
|
if (unlikely(mem_cgroup_charge(new_page, mm, gfp))) {
|
2016-07-27 06:26:32 +08:00
|
|
|
result = SCAN_CGROUP_CHARGE_FAIL;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-06-04 07:02:04 +08:00
|
|
|
count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
|
2016-07-27 06:26:32 +08:00
|
|
|
|
2018-12-01 06:10:50 +08:00
|
|
|
/* This will be less messy when we use multi-index entries */
|
|
|
|
do {
|
|
|
|
xas_lock_irq(&xas);
|
|
|
|
xas_create_range(&xas);
|
|
|
|
if (!xas_error(&xas))
|
|
|
|
break;
|
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
if (!xas_nomem(&xas, GFP_KERNEL)) {
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
} while (1);
|
|
|
|
|
2018-12-01 06:10:39 +08:00
|
|
|
__SetPageLocked(new_page);
|
2019-09-24 06:38:00 +08:00
|
|
|
if (is_shmem)
|
|
|
|
__SetPageSwapBacked(new_page);
|
2016-07-27 06:26:32 +08:00
|
|
|
new_page->index = start;
|
|
|
|
new_page->mapping = mapping;
|
|
|
|
|
|
|
|
/*
|
mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-01 06:10:43 +08:00
|
|
|
* At this point the new_page is locked and not up-to-date.
|
|
|
|
* It's safe to insert it into the page cache, because nobody would
|
|
|
|
* be able to map it or use it in another way until we unlock it.
|
2016-07-27 06:26:32 +08:00
|
|
|
*/
|
|
|
|
|
2017-12-05 03:56:08 +08:00
|
|
|
xas_set(&xas, start);
|
|
|
|
for (index = start; index < end; index++) {
|
|
|
|
struct page *page = xas_next(&xas);
|
|
|
|
|
|
|
|
VM_BUG_ON(index != xas.xa_index);
|
2019-09-24 06:38:00 +08:00
|
|
|
if (is_shmem) {
|
|
|
|
if (!page) {
|
|
|
|
/*
|
|
|
|
* Stop if extent has been truncated or
|
|
|
|
* hole-punched, and is now completely
|
|
|
|
* empty.
|
|
|
|
*/
|
|
|
|
if (index == start) {
|
|
|
|
if (!xas_next_entry(&xas, end - 1)) {
|
|
|
|
result = SCAN_TRUNCATED;
|
|
|
|
goto xa_locked;
|
|
|
|
}
|
|
|
|
xas_set(&xas, index);
|
|
|
|
}
|
|
|
|
if (!shmem_charge(mapping->host, 1)) {
|
|
|
|
result = SCAN_FAIL;
|
2018-12-01 06:10:39 +08:00
|
|
|
goto xa_locked;
|
2018-12-01 06:10:25 +08:00
|
|
|
}
|
2019-09-24 06:38:00 +08:00
|
|
|
xas_store(&xas, new_page);
|
|
|
|
nr_none++;
|
|
|
|
continue;
|
2018-12-01 06:10:25 +08:00
|
|
|
}
|
2019-09-24 06:38:00 +08:00
|
|
|
|
|
|
|
if (xa_is_value(page) || !PageUptodate(page)) {
|
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
/* swap in or instantiate fallocated page */
|
|
|
|
if (shmem_getpage(mapping->host, index, &page,
|
2021-09-03 05:54:34 +08:00
|
|
|
SGP_NOALLOC)) {
|
2019-09-24 06:38:00 +08:00
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto xa_unlocked;
|
|
|
|
}
|
|
|
|
} else if (trylock_page(page)) {
|
|
|
|
get_page(page);
|
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
} else {
|
|
|
|
result = SCAN_PAGE_LOCK;
|
2018-12-01 06:10:39 +08:00
|
|
|
goto xa_locked;
|
2017-12-05 03:56:08 +08:00
|
|
|
}
|
2019-09-24 06:38:00 +08:00
|
|
|
} else { /* !is_shmem */
|
|
|
|
if (!page || xa_is_value(page)) {
|
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
page_cache_sync_readahead(mapping, &file->f_ra,
|
|
|
|
file, index,
|
2020-09-05 07:36:16 +08:00
|
|
|
end - index);
|
2019-09-24 06:38:00 +08:00
|
|
|
/* drain pagevecs to help isolate_lru_page() */
|
|
|
|
lru_add_drain();
|
|
|
|
page = find_lock_page(mapping, index);
|
|
|
|
if (unlikely(page == NULL)) {
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto xa_unlocked;
|
|
|
|
}
|
2019-12-01 09:57:19 +08:00
|
|
|
} else if (PageDirty(page)) {
|
|
|
|
/*
|
|
|
|
* khugepaged only works on read-only fd,
|
|
|
|
* so this page is dirty because it hasn't
|
|
|
|
* been flushed since first write. There
|
|
|
|
* won't be new dirty pages.
|
|
|
|
*
|
|
|
|
* Trigger async flush here and hope the
|
|
|
|
* writeback is done when khugepaged
|
|
|
|
* revisits this page.
|
|
|
|
*
|
|
|
|
* This is a one-off situation. We are not
|
|
|
|
* forcing writeback in loop.
|
|
|
|
*/
|
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
filemap_flush(mapping);
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto xa_unlocked;
|
2021-10-29 05:36:27 +08:00
|
|
|
} else if (PageWriteback(page)) {
|
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto xa_unlocked;
|
2019-09-24 06:38:00 +08:00
|
|
|
} else if (trylock_page(page)) {
|
|
|
|
get_page(page);
|
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
} else {
|
|
|
|
result = SCAN_PAGE_LOCK;
|
|
|
|
goto xa_locked;
|
2016-07-27 06:26:32 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-04-11 07:36:56 +08:00
|
|
|
* The page must be locked, so we can drop the i_pages lock
|
2016-07-27 06:26:32 +08:00
|
|
|
* without racing with truncate.
|
|
|
|
*/
|
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
mm,thp: recheck each page before collapsing file THP
In collapse_file(), for !is_shmem case, current check cannot guarantee
the locked page is up-to-date. Specifically, xas_unlock_irq() should
not be called before lock_page() and get_page(); and it is necessary to
recheck PageUptodate() after locking the page.
With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
may contain corrupted data. This is because khugepaged mistakenly
collapses some not up-to-date sub pages into a huge page, and assumes
the huge page is up-to-date. This will NOT corrupt data in the disk,
because the page is read-only and never written back. Fix this by
properly checking PageUptodate() after locking the page. This check
replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
Also, move PageDirty() check after locking the page. Current khugepaged
should not try to collapse dirty file THP, because it is limited to
read-only .text. The only case we hit a dirty page here is when the
page hasn't been written since write. Bail out and retry when this
happens.
syzbot reported bug on previous version of this patch.
Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16 09:34:53 +08:00
|
|
|
|
|
|
|
/* make sure the page is up to date */
|
|
|
|
if (unlikely(!PageUptodate(page))) {
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2018-12-01 06:10:47 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If file was truncated then extended, or hole-punched, before
|
|
|
|
* we locked the first page, then a THP might be there already.
|
|
|
|
*/
|
|
|
|
if (PageTransCompound(page)) {
|
|
|
|
result = SCAN_PAGE_COMPOUND;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2016-07-27 06:26:32 +08:00
|
|
|
|
|
|
|
if (page_mapping(page) != mapping) {
|
|
|
|
result = SCAN_TRUNCATED;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2021-10-29 05:36:27 +08:00
|
|
|
if (!is_shmem && (PageDirty(page) ||
|
|
|
|
PageWriteback(page))) {
|
mm,thp: recheck each page before collapsing file THP
In collapse_file(), for !is_shmem case, current check cannot guarantee
the locked page is up-to-date. Specifically, xas_unlock_irq() should
not be called before lock_page() and get_page(); and it is necessary to
recheck PageUptodate() after locking the page.
With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
may contain corrupted data. This is because khugepaged mistakenly
collapses some not up-to-date sub pages into a huge page, and assumes
the huge page is up-to-date. This will NOT corrupt data in the disk,
because the page is read-only and never written back. Fix this by
properly checking PageUptodate() after locking the page. This check
replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
Also, move PageDirty() check after locking the page. Current khugepaged
should not try to collapse dirty file THP, because it is limited to
read-only .text. The only case we hit a dirty page here is when the
page hasn't been written since write. Bail out and retry when this
happens.
syzbot reported bug on previous version of this patch.
Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16 09:34:53 +08:00
|
|
|
/*
|
|
|
|
* khugepaged only works on read-only fd, so this
|
|
|
|
* page is dirty because it hasn't been flushed
|
|
|
|
* since first write.
|
|
|
|
*/
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2016-07-27 06:26:32 +08:00
|
|
|
if (isolate_lru_page(page)) {
|
|
|
|
result = SCAN_DEL_PAGE_LRU;
|
2018-12-01 06:10:39 +08:00
|
|
|
goto out_unlock;
|
2016-07-27 06:26:32 +08:00
|
|
|
}
|
|
|
|
|
2019-09-24 06:38:00 +08:00
|
|
|
if (page_has_private(page) &&
|
|
|
|
!try_to_release_page(page, GFP_KERNEL)) {
|
|
|
|
result = SCAN_PAGE_HAS_PRIVATE;
|
2020-05-28 13:20:43 +08:00
|
|
|
putback_lru_page(page);
|
2019-09-24 06:38:00 +08:00
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2016-07-27 06:26:32 +08:00
|
|
|
if (page_mapped(page))
|
2018-02-01 08:17:36 +08:00
|
|
|
unmap_mapping_pages(mapping, index, 1, false);
|
2016-07-27 06:26:32 +08:00
|
|
|
|
2017-12-05 03:56:08 +08:00
|
|
|
xas_lock_irq(&xas);
|
|
|
|
xas_set(&xas, index);
|
2016-07-27 06:26:32 +08:00
|
|
|
|
2017-12-05 03:56:08 +08:00
|
|
|
VM_BUG_ON_PAGE(page != xas_load(&xas), page);
|
2016-07-27 06:26:32 +08:00
|
|
|
VM_BUG_ON_PAGE(page_mapped(page), page);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The page is expected to have page_count() == 3:
|
|
|
|
* - we hold a pin on it;
|
2017-12-05 03:56:08 +08:00
|
|
|
* - one reference from page cache;
|
2016-07-27 06:26:32 +08:00
|
|
|
* - one from isolate_lru_page;
|
|
|
|
*/
|
|
|
|
if (!page_ref_freeze(page, 3)) {
|
|
|
|
result = SCAN_PAGE_COUNT;
|
2018-12-01 06:10:39 +08:00
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
putback_lru_page(page);
|
|
|
|
goto out_unlock;
|
2016-07-27 06:26:32 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the page to the list to be able to undo the collapse if
|
|
|
|
* something go wrong.
|
|
|
|
*/
|
|
|
|
list_add_tail(&page->lru, &pagelist);
|
|
|
|
|
|
|
|
/* Finally, replace with the new page. */
|
2019-09-24 06:34:52 +08:00
|
|
|
xas_store(&xas, new_page);
|
2016-07-27 06:26:32 +08:00
|
|
|
continue;
|
|
|
|
out_unlock:
|
|
|
|
unlock_page(page);
|
|
|
|
put_page(page);
|
2018-12-01 06:10:39 +08:00
|
|
|
goto xa_unlocked;
|
2016-07-27 06:26:32 +08:00
|
|
|
}
|
2021-02-25 04:03:27 +08:00
|
|
|
nr = thp_nr_pages(new_page);
|
2016-07-27 06:26:32 +08:00
|
|
|
|
2019-09-24 06:38:00 +08:00
|
|
|
if (is_shmem)
|
2021-02-25 04:03:31 +08:00
|
|
|
__mod_lruvec_page_state(new_page, NR_SHMEM_THPS, nr);
|
2019-09-24 06:38:03 +08:00
|
|
|
else {
|
2021-02-25 04:03:27 +08:00
|
|
|
__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
|
2019-09-24 06:38:03 +08:00
|
|
|
filemap_nr_thps_inc(mapping);
|
mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs
Transparent huge pages are supported for read-only non-shmem files, but
are only used for vmas with VM_DENYWRITE. This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY). Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open(). Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().
Systems that make heavy use of shared libraries (e.g. Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.
This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()). It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.
Restricting the use of THPs to executable mappings eliminates the risk
that a read-only file later opened for write would encounter significant
latencies due to page cache truncation.
The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout. The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping
to be backed with THPs.
Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.
4KB Pages:
==========
count event_name # count / runtime
598,995,035,942 cpu-cycles # 1.800861 GHz
81,195,620,851 raw-stall-frontend # 244.112 M/sec
347,754,466,597 iTLB-loads # 1.046 G/sec
2,970,248,900 iTLB-load-misses # 0.854122% miss rate
Total test time: 332.854998 seconds.
2MB Pages:
==========
count event_name # count / runtime
592,872,663,047 cpu-cycles # 1.800358 GHz
76,485,624,143 raw-stall-frontend # 232.261 M/sec
350,478,413,710 iTLB-loads # 1.064 G/sec
803,233,322 iTLB-load-misses # 0.229182% miss rate
Total test time: 329.826087 seconds
A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:
/apex/com.android.art/lib64/libart.so
FilePmdMapped: 4096 kB
/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped: 2048 kB
Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 09:51:32 +08:00
|
|
|
/*
|
|
|
|
* Paired with smp_mb() in do_dentry_open() to ensure
|
|
|
|
* i_writecount is up to date and the update to nr_thps is
|
|
|
|
* visible. Ensures the page cache will be truncated if the
|
|
|
|
* file is opened writable.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
if (inode_is_open_for_write(mapping->host)) {
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
__mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
|
|
|
|
filemap_nr_thps_dec(mapping);
|
|
|
|
goto xa_locked;
|
|
|
|
}
|
2019-09-24 06:38:03 +08:00
|
|
|
}
|
2019-09-24 06:38:00 +08:00
|
|
|
|
2018-12-01 06:10:39 +08:00
|
|
|
if (nr_none) {
|
2020-06-04 07:02:04 +08:00
|
|
|
__mod_lruvec_page_state(new_page, NR_FILE_PAGES, nr_none);
|
2019-09-24 06:38:00 +08:00
|
|
|
if (is_shmem)
|
2020-06-04 07:02:04 +08:00
|
|
|
__mod_lruvec_page_state(new_page, NR_SHMEM, nr_none);
|
2018-12-01 06:10:39 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
xa_locked:
|
|
|
|
xas_unlock_irq(&xas);
|
2017-12-05 03:56:08 +08:00
|
|
|
xa_unlocked:
|
2018-12-01 06:10:39 +08:00
|
|
|
|
2016-07-27 06:26:32 +08:00
|
|
|
if (result == SCAN_SUCCEED) {
|
2017-12-05 03:56:08 +08:00
|
|
|
struct page *page, *tmp;
|
2016-07-27 06:26:32 +08:00
|
|
|
|
|
|
|
/*
|
2017-12-05 03:56:08 +08:00
|
|
|
* Replacing old pages with new one has succeeded, now we
|
|
|
|
* need to copy the content and free the old pages.
|
2016-07-27 06:26:32 +08:00
|
|
|
*/
|
2018-12-01 06:10:35 +08:00
|
|
|
index = start;
|
2016-07-27 06:26:32 +08:00
|
|
|
list_for_each_entry_safe(page, tmp, &pagelist, lru) {
|
2018-12-01 06:10:35 +08:00
|
|
|
while (index < page->index) {
|
|
|
|
clear_highpage(new_page + (index % HPAGE_PMD_NR));
|
|
|
|
index++;
|
|
|
|
}
|
2016-07-27 06:26:32 +08:00
|
|
|
copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
|
|
|
|
page);
|
|
|
|
list_del(&page->lru);
|
|
|
|
page->mapping = NULL;
|
2018-12-01 06:10:39 +08:00
|
|
|
page_ref_unfreeze(page, 1);
|
2016-07-27 06:26:32 +08:00
|
|
|
ClearPageActive(page);
|
|
|
|
ClearPageUnevictable(page);
|
2018-12-01 06:10:39 +08:00
|
|
|
unlock_page(page);
|
2016-07-27 06:26:32 +08:00
|
|
|
put_page(page);
|
2018-12-01 06:10:35 +08:00
|
|
|
index++;
|
|
|
|
}
|
|
|
|
while (index < end) {
|
|
|
|
clear_highpage(new_page + (index % HPAGE_PMD_NR));
|
|
|
|
index++;
|
2016-07-27 06:26:32 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
SetPageUptodate(new_page);
|
mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-01 06:10:43 +08:00
|
|
|
page_ref_add(new_page, HPAGE_PMD_NR - 1);
|
2020-06-04 07:02:40 +08:00
|
|
|
if (is_shmem)
|
2019-09-24 06:38:00 +08:00
|
|
|
set_page_dirty(new_page);
|
2020-06-04 07:02:40 +08:00
|
|
|
lru_cache_add(new_page);
|
2016-07-27 06:26:32 +08:00
|
|
|
|
2018-12-01 06:10:39 +08:00
|
|
|
/*
|
|
|
|
* Remove pte page tables, so we can re-fault the page as huge.
|
|
|
|
*/
|
|
|
|
retract_page_tables(mapping, start);
|
2016-07-27 06:26:32 +08:00
|
|
|
*hpage = NULL;
|
2018-08-18 06:45:29 +08:00
|
|
|
|
|
|
|
khugepaged_pages_collapsed++;
|
2016-07-27 06:26:32 +08:00
|
|
|
} else {
|
2017-12-05 03:56:08 +08:00
|
|
|
struct page *page;
|
2018-12-01 06:10:29 +08:00
|
|
|
|
2017-12-05 03:56:08 +08:00
|
|
|
/* Something went wrong: roll back page cache changes */
|
|
|
|
xas_lock_irq(&xas);
|
2018-12-01 06:10:29 +08:00
|
|
|
mapping->nrpages -= nr_none;
|
2019-09-24 06:38:00 +08:00
|
|
|
|
|
|
|
if (is_shmem)
|
|
|
|
shmem_uncharge(mapping->host, nr_none);
|
2018-12-01 06:10:29 +08:00
|
|
|
|
2017-12-05 03:56:08 +08:00
|
|
|
xas_set(&xas, start);
|
|
|
|
xas_for_each(&xas, page, end - 1) {
|
2016-07-27 06:26:32 +08:00
|
|
|
page = list_first_entry_or_null(&pagelist,
|
|
|
|
struct page, lru);
|
2017-12-05 03:56:08 +08:00
|
|
|
if (!page || xas.xa_index < page->index) {
|
2016-07-27 06:26:32 +08:00
|
|
|
if (!nr_none)
|
|
|
|
break;
|
|
|
|
nr_none--;
|
2016-12-13 08:43:35 +08:00
|
|
|
/* Put holes back where they were */
|
2017-12-05 03:56:08 +08:00
|
|
|
xas_store(&xas, NULL);
|
2016-07-27 06:26:32 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2017-12-05 03:56:08 +08:00
|
|
|
VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
|
2016-07-27 06:26:32 +08:00
|
|
|
|
|
|
|
/* Unfreeze the page. */
|
|
|
|
list_del(&page->lru);
|
|
|
|
page_ref_unfreeze(page, 2);
|
2017-12-05 03:56:08 +08:00
|
|
|
xas_store(&xas, page);
|
|
|
|
xas_pause(&xas);
|
|
|
|
xas_unlock_irq(&xas);
|
2016-07-27 06:26:32 +08:00
|
|
|
unlock_page(page);
|
2018-12-01 06:10:39 +08:00
|
|
|
putback_lru_page(page);
|
2017-12-05 03:56:08 +08:00
|
|
|
xas_lock_irq(&xas);
|
2016-07-27 06:26:32 +08:00
|
|
|
}
|
|
|
|
VM_BUG_ON(nr_none);
|
2017-12-05 03:56:08 +08:00
|
|
|
xas_unlock_irq(&xas);
|
2016-07-27 06:26:32 +08:00
|
|
|
|
|
|
|
new_page->mapping = NULL;
|
|
|
|
}
|
2018-12-01 06:10:39 +08:00
|
|
|
|
|
|
|
unlock_page(new_page);
|
2016-07-27 06:26:32 +08:00
|
|
|
out:
|
|
|
|
VM_BUG_ON(!list_empty(&pagelist));
|
2020-06-04 07:02:04 +08:00
|
|
|
if (!IS_ERR_OR_NULL(*hpage))
|
|
|
|
mem_cgroup_uncharge(*hpage);
|
2016-07-27 06:26:32 +08:00
|
|
|
/* TODO: tracepoints */
|
|
|
|
}
|
|
|
|
|
2019-09-24 06:37:57 +08:00
|
|
|
static void khugepaged_scan_file(struct mm_struct *mm,
|
|
|
|
struct file *file, pgoff_t start, struct page **hpage)
|
2016-07-27 06:26:32 +08:00
|
|
|
{
|
|
|
|
struct page *page = NULL;
|
2019-09-24 06:37:57 +08:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
2017-12-05 04:06:23 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, start);
|
2016-07-27 06:26:32 +08:00
|
|
|
int present, swap;
|
|
|
|
int node = NUMA_NO_NODE;
|
|
|
|
int result = SCAN_SUCCEED;
|
|
|
|
|
|
|
|
present = 0;
|
|
|
|
swap = 0;
|
|
|
|
memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
|
|
|
|
rcu_read_lock();
|
2017-12-05 04:06:23 +08:00
|
|
|
xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
|
|
|
|
if (xas_retry(&xas, page))
|
2016-07-27 06:26:32 +08:00
|
|
|
continue;
|
|
|
|
|
2017-12-05 04:06:23 +08:00
|
|
|
if (xa_is_value(page)) {
|
2016-07-27 06:26:32 +08:00
|
|
|
if (++swap > khugepaged_max_ptes_swap) {
|
|
|
|
result = SCAN_EXCEED_SWAP_PTE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (PageTransCompound(page)) {
|
|
|
|
result = SCAN_PAGE_COMPOUND;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
node = page_to_nid(page);
|
|
|
|
if (khugepaged_scan_abort(node)) {
|
|
|
|
result = SCAN_SCAN_ABORT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
khugepaged_node_load[node]++;
|
|
|
|
|
|
|
|
if (!PageLRU(page)) {
|
|
|
|
result = SCAN_PAGE_LRU;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2019-09-24 06:38:00 +08:00
|
|
|
if (page_count(page) !=
|
|
|
|
1 + page_mapcount(page) + page_has_private(page)) {
|
2016-07-27 06:26:32 +08:00
|
|
|
result = SCAN_PAGE_COUNT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We probably should check if the page is referenced here, but
|
|
|
|
* nobody would transfer pte_young() to PageReferenced() for us.
|
|
|
|
* And rmap walk here is just too costly...
|
|
|
|
*/
|
|
|
|
|
|
|
|
present++;
|
|
|
|
|
|
|
|
if (need_resched()) {
|
2017-12-05 04:06:23 +08:00
|
|
|
xas_pause(&xas);
|
2016-07-27 06:26:32 +08:00
|
|
|
cond_resched_rcu();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
if (result == SCAN_SUCCEED) {
|
|
|
|
if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
|
|
|
|
result = SCAN_EXCEED_NONE_PTE;
|
|
|
|
} else {
|
|
|
|
node = khugepaged_find_target_node();
|
2019-09-24 06:37:57 +08:00
|
|
|
collapse_file(mm, file, start, hpage, node);
|
2016-07-27 06:26:32 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* TODO: tracepoints */
|
|
|
|
}
|
|
|
|
#else
|
2019-09-24 06:37:57 +08:00
|
|
|
static void khugepaged_scan_file(struct mm_struct *mm,
|
|
|
|
struct file *file, pgoff_t start, struct page **hpage)
|
2016-07-27 06:26:32 +08:00
|
|
|
{
|
|
|
|
BUILD_BUG();
|
|
|
|
}
|
2019-09-24 06:38:30 +08:00
|
|
|
|
2021-05-05 09:33:37 +08:00
|
|
|
static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
|
2019-09-24 06:38:30 +08:00
|
|
|
{
|
|
|
|
}
|
2016-07-27 06:26:32 +08:00
|
|
|
#endif
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
|
|
|
|
struct page **hpage)
|
|
|
|
__releases(&khugepaged_mm_lock)
|
|
|
|
__acquires(&khugepaged_mm_lock)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
struct mm_struct *mm;
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
int progress = 0;
|
|
|
|
|
|
|
|
VM_BUG_ON(!pages);
|
2018-10-05 14:45:47 +08:00
|
|
|
lockdep_assert_held(&khugepaged_mm_lock);
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
if (khugepaged_scan.mm_slot)
|
|
|
|
mm_slot = khugepaged_scan.mm_slot;
|
|
|
|
else {
|
|
|
|
mm_slot = list_entry(khugepaged_scan.mm_head.next,
|
|
|
|
struct mm_slot, mm_node);
|
|
|
|
khugepaged_scan.address = 0;
|
|
|
|
khugepaged_scan.mm_slot = mm_slot;
|
|
|
|
}
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
2019-09-24 06:38:30 +08:00
|
|
|
khugepaged_collapse_pte_mapped_thps(mm_slot);
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
mm = mm_slot->mm;
|
2018-02-01 08:18:28 +08:00
|
|
|
/*
|
|
|
|
* Don't wait for semaphore (to avoid long wait times). Just move to
|
|
|
|
* the next mm on the list.
|
|
|
|
*/
|
|
|
|
vma = NULL;
|
2020-06-09 12:33:25 +08:00
|
|
|
if (unlikely(!mmap_read_trylock(mm)))
|
2020-06-09 12:33:54 +08:00
|
|
|
goto breakouterloop_mmap_lock;
|
2018-02-01 08:18:28 +08:00
|
|
|
if (likely(!khugepaged_test_exit(mm)))
|
2016-07-27 06:26:24 +08:00
|
|
|
vma = find_vma(mm, khugepaged_scan.address);
|
|
|
|
|
|
|
|
progress++;
|
|
|
|
for (; vma; vma = vma->vm_next) {
|
|
|
|
unsigned long hstart, hend;
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
if (unlikely(khugepaged_test_exit(mm))) {
|
|
|
|
progress++;
|
|
|
|
break;
|
|
|
|
}
|
mm: thp: pass correct vm_flags to hugepage_vma_check()
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
hugepage_vma_check(). The argument vm_flags contains the latest value.
Therefore, it is necessary to pass this vm_flags into
hugepage_vma_check().
With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
put memory in huge pages. Here is an example of failed madvise():
/* mount /dev/shm with huge=advise:
* mount -o remount,huge=advise /dev/shm */
/* create file /dev/shm/huge */
#define HUGE_FILE "/dev/shm/huge"
fd = open(HUGE_FILE, O_RDONLY);
ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);
madvise() will return 0, but this memory region is never put in huge
page (check from /proc/meminfo: ShmemHugePages).
Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:47:00 +08:00
|
|
|
if (!hugepage_vma_check(vma, vma->vm_flags)) {
|
2016-07-27 06:26:24 +08:00
|
|
|
skip:
|
|
|
|
progress++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
|
|
|
|
hend = vma->vm_end & HPAGE_PMD_MASK;
|
|
|
|
if (hstart >= hend)
|
|
|
|
goto skip;
|
|
|
|
if (khugepaged_scan.address > hend)
|
|
|
|
goto skip;
|
|
|
|
if (khugepaged_scan.address < hstart)
|
|
|
|
khugepaged_scan.address = hstart;
|
|
|
|
VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
|
2020-04-07 11:04:35 +08:00
|
|
|
if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma))
|
|
|
|
goto skip;
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
while (khugepaged_scan.address < hend) {
|
|
|
|
int ret;
|
|
|
|
cond_resched();
|
|
|
|
if (unlikely(khugepaged_test_exit(mm)))
|
|
|
|
goto breakouterloop;
|
|
|
|
|
|
|
|
VM_BUG_ON(khugepaged_scan.address < hstart ||
|
|
|
|
khugepaged_scan.address + HPAGE_PMD_SIZE >
|
|
|
|
hend);
|
2019-09-24 06:38:00 +08:00
|
|
|
if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
|
2020-04-07 11:04:35 +08:00
|
|
|
struct file *file = get_file(vma->vm_file);
|
2016-07-27 06:26:32 +08:00
|
|
|
pgoff_t pgoff = linear_page_index(vma,
|
|
|
|
khugepaged_scan.address);
|
2019-09-24 06:38:00 +08:00
|
|
|
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm);
|
2016-07-27 06:26:32 +08:00
|
|
|
ret = 1;
|
2019-09-24 06:37:57 +08:00
|
|
|
khugepaged_scan_file(mm, file, pgoff, hpage);
|
2016-07-27 06:26:32 +08:00
|
|
|
fput(file);
|
|
|
|
} else {
|
|
|
|
ret = khugepaged_scan_pmd(mm, vma,
|
|
|
|
khugepaged_scan.address,
|
|
|
|
hpage);
|
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
/* move to next address */
|
|
|
|
khugepaged_scan.address += HPAGE_PMD_SIZE;
|
|
|
|
progress += HPAGE_PMD_NR;
|
|
|
|
if (ret)
|
2020-06-09 12:33:54 +08:00
|
|
|
/* we released mmap_lock so break loop */
|
|
|
|
goto breakouterloop_mmap_lock;
|
2016-07-27 06:26:24 +08:00
|
|
|
if (progress >= pages)
|
|
|
|
goto breakouterloop;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
breakouterloop:
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
|
2020-06-09 12:33:54 +08:00
|
|
|
breakouterloop_mmap_lock:
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
|
|
|
|
/*
|
|
|
|
* Release the current mm_slot if this mm is about to die, or
|
|
|
|
* if we scanned all vmas of this mm.
|
|
|
|
*/
|
|
|
|
if (khugepaged_test_exit(mm) || !vma) {
|
|
|
|
/*
|
|
|
|
* Make sure that if mm_users is reaching zero while
|
|
|
|
* khugepaged runs here, khugepaged_exit will find
|
|
|
|
* mm_slot not pointing to the exiting mm.
|
|
|
|
*/
|
|
|
|
if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
|
|
|
|
khugepaged_scan.mm_slot = list_entry(
|
|
|
|
mm_slot->mm_node.next,
|
|
|
|
struct mm_slot, mm_node);
|
|
|
|
khugepaged_scan.address = 0;
|
|
|
|
} else {
|
|
|
|
khugepaged_scan.mm_slot = NULL;
|
|
|
|
khugepaged_full_scans++;
|
|
|
|
}
|
|
|
|
|
|
|
|
collect_mm_slot(mm_slot);
|
|
|
|
}
|
|
|
|
|
|
|
|
return progress;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged_has_work(void)
|
|
|
|
{
|
|
|
|
return !list_empty(&khugepaged_scan.mm_head) &&
|
|
|
|
khugepaged_enabled();
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged_wait_event(void)
|
|
|
|
{
|
|
|
|
return !list_empty(&khugepaged_scan.mm_head) ||
|
|
|
|
kthread_should_stop();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void khugepaged_do_scan(void)
|
|
|
|
{
|
|
|
|
struct page *hpage = NULL;
|
|
|
|
unsigned int progress = 0, pass_through_head = 0;
|
2021-05-05 09:34:12 +08:00
|
|
|
unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
|
2016-07-27 06:26:24 +08:00
|
|
|
bool wait = true;
|
|
|
|
|
2020-06-04 07:00:12 +08:00
|
|
|
lru_add_drain_all();
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
while (progress < pages) {
|
|
|
|
if (!khugepaged_prealloc_page(&hpage, &wait))
|
|
|
|
break;
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
|
|
|
|
if (unlikely(kthread_should_stop() || try_to_freeze()))
|
|
|
|
break;
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
if (!khugepaged_scan.mm_slot)
|
|
|
|
pass_through_head++;
|
|
|
|
if (khugepaged_has_work() &&
|
|
|
|
pass_through_head < 2)
|
|
|
|
progress += khugepaged_scan_mm_slot(pages - progress,
|
|
|
|
&hpage);
|
|
|
|
else
|
|
|
|
progress = pages;
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!IS_ERR_OR_NULL(hpage))
|
|
|
|
put_page(hpage);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool khugepaged_should_wakeup(void)
|
|
|
|
{
|
|
|
|
return kthread_should_stop() ||
|
|
|
|
time_after_eq(jiffies, khugepaged_sleep_expire);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void khugepaged_wait_work(void)
|
|
|
|
{
|
|
|
|
if (khugepaged_has_work()) {
|
|
|
|
const unsigned long scan_sleep_jiffies =
|
|
|
|
msecs_to_jiffies(khugepaged_scan_sleep_millisecs);
|
|
|
|
|
|
|
|
if (!scan_sleep_jiffies)
|
|
|
|
return;
|
|
|
|
|
|
|
|
khugepaged_sleep_expire = jiffies + scan_sleep_jiffies;
|
|
|
|
wait_event_freezable_timeout(khugepaged_wait,
|
|
|
|
khugepaged_should_wakeup(),
|
|
|
|
scan_sleep_jiffies);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (khugepaged_enabled())
|
|
|
|
wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged(void *none)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
|
|
|
|
set_freezable();
|
|
|
|
set_user_nice(current, MAX_NICE);
|
|
|
|
|
|
|
|
while (!kthread_should_stop()) {
|
|
|
|
khugepaged_do_scan();
|
|
|
|
khugepaged_wait_work();
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
mm_slot = khugepaged_scan.mm_slot;
|
|
|
|
khugepaged_scan.mm_slot = NULL;
|
|
|
|
if (mm_slot)
|
|
|
|
collect_mm_slot(mm_slot);
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void set_recommended_min_free_kbytes(void)
|
|
|
|
{
|
|
|
|
struct zone *zone;
|
|
|
|
int nr_zones = 0;
|
|
|
|
unsigned long recommended_min;
|
|
|
|
|
mm/thp: don't count ZONE_MOVABLE as the target for freepage reserving
There was a regression report for "mm/cma: manage the memory of the CMA
area by using the ZONE_MOVABLE" [1] and I think that it is related to
this problem. CMA patchset makes the system use one more zone
(ZONE_MOVABLE) and then increases min_free_kbytes. It reduces usable
memory and it could cause regression.
ZONE_MOVABLE only has movable pages so we don't need to keep enough
freepages to avoid or deal with fragmentation. So, don't count it.
This changes min_free_kbytes and thus min_watermark greatly if
ZONE_MOVABLE is used. It will make the user uses more memory.
System:
22GB ram, fakenuma, 2 nodes. 5 zones are used.
Before:
min_free_kbytes: 112640
zone_info (min_watermark):
Node 0, zone DMA
min 19
Node 0, zone DMA32
min 3778
Node 0, zone Normal
min 10191
Node 0, zone Movable
min 0
Node 0, zone Device
min 0
Node 1, zone DMA
min 0
Node 1, zone DMA32
min 0
Node 1, zone Normal
min 14043
Node 1, zone Movable
min 127
Node 1, zone Device
min 0
After:
min_free_kbytes: 90112
zone_info (min_watermark):
Node 0, zone DMA
min 15
Node 0, zone DMA32
min 3022
Node 0, zone Normal
min 8152
Node 0, zone Movable
min 0
Node 0, zone Device
min 0
Node 1, zone DMA
min 0
Node 1, zone DMA32
min 0
Node 1, zone Normal
min 11234
Node 1, zone Movable
min 102
Node 1, zone Device
min 0
[1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop)
Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 07:30:27 +08:00
|
|
|
for_each_populated_zone(zone) {
|
|
|
|
/*
|
|
|
|
* We don't need to worry about fragmentation of
|
|
|
|
* ZONE_MOVABLE since it only has movable pages.
|
|
|
|
*/
|
|
|
|
if (zone_idx(zone) > gfp_zone(GFP_USER))
|
|
|
|
continue;
|
|
|
|
|
2016-07-27 06:26:24 +08:00
|
|
|
nr_zones++;
|
mm/thp: don't count ZONE_MOVABLE as the target for freepage reserving
There was a regression report for "mm/cma: manage the memory of the CMA
area by using the ZONE_MOVABLE" [1] and I think that it is related to
this problem. CMA patchset makes the system use one more zone
(ZONE_MOVABLE) and then increases min_free_kbytes. It reduces usable
memory and it could cause regression.
ZONE_MOVABLE only has movable pages so we don't need to keep enough
freepages to avoid or deal with fragmentation. So, don't count it.
This changes min_free_kbytes and thus min_watermark greatly if
ZONE_MOVABLE is used. It will make the user uses more memory.
System:
22GB ram, fakenuma, 2 nodes. 5 zones are used.
Before:
min_free_kbytes: 112640
zone_info (min_watermark):
Node 0, zone DMA
min 19
Node 0, zone DMA32
min 3778
Node 0, zone Normal
min 10191
Node 0, zone Movable
min 0
Node 0, zone Device
min 0
Node 1, zone DMA
min 0
Node 1, zone DMA32
min 0
Node 1, zone Normal
min 14043
Node 1, zone Movable
min 127
Node 1, zone Device
min 0
After:
min_free_kbytes: 90112
zone_info (min_watermark):
Node 0, zone DMA
min 15
Node 0, zone DMA32
min 3022
Node 0, zone Normal
min 8152
Node 0, zone Movable
min 0
Node 0, zone Device
min 0
Node 1, zone DMA
min 0
Node 1, zone DMA32
min 0
Node 1, zone Normal
min 11234
Node 1, zone Movable
min 102
Node 1, zone Device
min 0
[1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop)
Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 07:30:27 +08:00
|
|
|
}
|
2016-07-27 06:26:24 +08:00
|
|
|
|
|
|
|
/* Ensure 2 pageblocks are free to assist fragmentation avoidance */
|
|
|
|
recommended_min = pageblock_nr_pages * nr_zones * 2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure that on average at least two pageblocks are almost free
|
|
|
|
* of another type, one for a migratetype to fall back to and a
|
|
|
|
* second to avoid subsequent fallbacks of other types There are 3
|
|
|
|
* MIGRATE_TYPES we care about.
|
|
|
|
*/
|
|
|
|
recommended_min += pageblock_nr_pages * nr_zones *
|
|
|
|
MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
|
|
|
|
|
|
|
|
/* don't ever allow to reserve more than 5% of the lowmem */
|
|
|
|
recommended_min = min(recommended_min,
|
|
|
|
(unsigned long) nr_free_buffer_pages() / 20);
|
|
|
|
recommended_min <<= (PAGE_SHIFT-10);
|
|
|
|
|
|
|
|
if (recommended_min > min_free_kbytes) {
|
|
|
|
if (user_min_free_kbytes >= 0)
|
|
|
|
pr_info("raising min_free_kbytes from %d to %lu to help transparent hugepage allocations\n",
|
|
|
|
min_free_kbytes, recommended_min);
|
|
|
|
|
|
|
|
min_free_kbytes = recommended_min;
|
|
|
|
}
|
|
|
|
setup_per_zone_wmarks();
|
|
|
|
}
|
|
|
|
|
|
|
|
int start_stop_khugepaged(void)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
mutex_lock(&khugepaged_mutex);
|
|
|
|
if (khugepaged_enabled()) {
|
|
|
|
if (!khugepaged_thread)
|
|
|
|
khugepaged_thread = kthread_run(khugepaged, NULL,
|
|
|
|
"khugepaged");
|
|
|
|
if (IS_ERR(khugepaged_thread)) {
|
|
|
|
pr_err("khugepaged: kthread_run(khugepaged) failed\n");
|
|
|
|
err = PTR_ERR(khugepaged_thread);
|
|
|
|
khugepaged_thread = NULL;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!list_empty(&khugepaged_scan.mm_head))
|
|
|
|
wake_up_interruptible(&khugepaged_wait);
|
|
|
|
|
|
|
|
set_recommended_min_free_kbytes();
|
|
|
|
} else if (khugepaged_thread) {
|
|
|
|
kthread_stop(khugepaged_thread);
|
|
|
|
khugepaged_thread = NULL;
|
|
|
|
}
|
|
|
|
fail:
|
|
|
|
mutex_unlock(&khugepaged_mutex);
|
|
|
|
return err;
|
|
|
|
}
|
2020-10-11 14:16:40 +08:00
|
|
|
|
|
|
|
void khugepaged_min_free_kbytes_update(void)
|
|
|
|
{
|
|
|
|
mutex_lock(&khugepaged_mutex);
|
|
|
|
if (khugepaged_enabled() && khugepaged_thread)
|
|
|
|
set_recommended_min_free_kbytes();
|
|
|
|
mutex_unlock(&khugepaged_mutex);
|
|
|
|
}
|