2019-05-19 20:08:55 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* linux/mm/filemap.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1994-1999 Linus Torvalds
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This file handles the generic file mmap semantics used by
|
|
|
|
* most "normal" filesystems (but you don't /have/ to use this:
|
|
|
|
* the NFS filesystem used to do this differently, for example)
|
|
|
|
*/
|
2011-10-16 14:01:52 +08:00
|
|
|
#include <linux/export.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/compiler.h>
|
2016-01-23 07:10:40 +08:00
|
|
|
#include <linux/dax.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/fs.h>
|
2017-02-09 01:51:30 +08:00
|
|
|
#include <linux/sched/signal.h>
|
2006-06-23 17:04:16 +08:00
|
|
|
#include <linux/uaccess.h>
|
2006-01-12 04:17:46 +08:00
|
|
|
#include <linux/capability.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/kernel_stat.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/gfp.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/mman.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/uio.h>
|
2019-05-14 08:21:04 +08:00
|
|
|
#include <linux/error-injection.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/hash.h>
|
|
|
|
#include <linux/writeback.h>
|
2007-10-19 05:47:32 +08:00
|
|
|
#include <linux/backing-dev.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/pagevec.h>
|
|
|
|
#include <linux/blkdev.h>
|
|
|
|
#include <linux/security.h>
|
2006-03-24 19:16:04 +08:00
|
|
|
#include <linux/cpuset.h>
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
#include <linux/hugetlb.h>
|
2008-02-07 16:13:53 +08:00
|
|
|
#include <linux/memcontrol.h>
|
2011-05-27 00:01:43 +08:00
|
|
|
#include <linux/cleancache.h>
|
2017-11-16 09:37:41 +08:00
|
|
|
#include <linux/shmem_fs.h>
|
2014-04-08 06:37:19 +08:00
|
|
|
#include <linux/rmap.h>
|
2018-10-27 06:06:08 +08:00
|
|
|
#include <linux/delayacct.h>
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
#include <linux/psi.h>
|
2019-10-19 11:20:20 +08:00
|
|
|
#include <linux/ramfs.h>
|
2006-03-22 16:08:33 +08:00
|
|
|
#include "internal.h"
|
|
|
|
|
2013-04-30 06:06:10 +08:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/filemap.h>
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* FIXME: remove all knowledge of the buffer layer from the core VM
|
|
|
|
*/
|
2009-08-18 01:52:36 +08:00
|
|
|
#include <linux/buffer_head.h> /* for try_to_free_buffers */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#include <asm/mman.h>
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Shared mappings implemented 30.11.1994. It's not fully working yet,
|
|
|
|
* though.
|
|
|
|
*
|
|
|
|
* Shared mappings now work. 15.8.1995 Bruno.
|
|
|
|
*
|
|
|
|
* finished 'unifying' the page and buffer cache and SMP-threaded the
|
|
|
|
* page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com>
|
|
|
|
*
|
|
|
|
* SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de>
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lock ordering:
|
|
|
|
*
|
2014-12-13 08:54:24 +08:00
|
|
|
* ->i_mmap_rwsem (truncate_pagecache)
|
2005-04-17 06:20:36 +08:00
|
|
|
* ->private_lock (__free_pte->__set_page_dirty_buffers)
|
[PATCH] swap: swap_lock replace list+device
The idea of a swap_device_lock per device, and a swap_list_lock over them all,
is appealing; but in practice almost every holder of swap_device_lock must
already hold swap_list_lock, which defeats the purpose of the split.
The only exceptions have been swap_duplicate, valid_swaphandles and an
untrodden path in try_to_unuse (plus a few places added in this series).
valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
demand attention. However, with the hold time in get_swap_pages so much
reduced, I've not yet found a load and set of swap device priorities to show
even swap_duplicate benefitting from the split. Certainly the split is mere
overhead in the common case of a single swap device.
So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
(generally we seem to prefer an _ in the name, and not hide in a macro).
If someone can show a regression in swap_duplicate, then probably we should
add a hashlock for the swap_map entries alone (shorts being anatomic), so as
to help the case of the single swap device too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-04 06:54:41 +08:00
|
|
|
* ->swap_lock (exclusive_swap_page, others)
|
2018-04-11 07:36:56 +08:00
|
|
|
* ->i_pages lock
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2006-01-10 07:59:24 +08:00
|
|
|
* ->i_mutex
|
2014-12-13 08:54:24 +08:00
|
|
|
* ->i_mmap_rwsem (truncate->unmap_mapping_range)
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* ->mmap_sem
|
2014-12-13 08:54:24 +08:00
|
|
|
* ->i_mmap_rwsem
|
2005-10-30 09:16:41 +08:00
|
|
|
* ->page_table_lock or pte_lock (various, mainly in memory.c)
|
2018-04-11 07:36:56 +08:00
|
|
|
* ->i_pages lock (arch-dependent flush_dcache_mmap_lock)
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* ->mmap_sem
|
|
|
|
* ->lock_page (access_process_vm)
|
|
|
|
*
|
2014-02-12 11:36:48 +08:00
|
|
|
* ->i_mutex (generic_perform_write)
|
2006-10-20 14:29:10 +08:00
|
|
|
* ->mmap_sem (fault_in_pages_readable->do_page_fault)
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2011-04-22 08:19:44 +08:00
|
|
|
* bdi->wb.list_lock
|
2011-03-22 19:23:41 +08:00
|
|
|
* sb_lock (fs/fs-writeback.c)
|
2018-04-11 07:36:56 +08:00
|
|
|
* ->i_pages lock (__sync_single_inode)
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2014-12-13 08:54:24 +08:00
|
|
|
* ->i_mmap_rwsem
|
2005-04-17 06:20:36 +08:00
|
|
|
* ->anon_vma.lock (vma_adjust)
|
|
|
|
*
|
|
|
|
* ->anon_vma.lock
|
2005-10-30 09:16:41 +08:00
|
|
|
* ->page_table_lock or pte_lock (anon_vma_prepare and various)
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2005-10-30 09:16:41 +08:00
|
|
|
* ->page_table_lock or pte_lock
|
[PATCH] swap: swap_lock replace list+device
The idea of a swap_device_lock per device, and a swap_list_lock over them all,
is appealing; but in practice almost every holder of swap_device_lock must
already hold swap_list_lock, which defeats the purpose of the split.
The only exceptions have been swap_duplicate, valid_swaphandles and an
untrodden path in try_to_unuse (plus a few places added in this series).
valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
demand attention. However, with the hold time in get_swap_pages so much
reduced, I've not yet found a load and set of swap device priorities to show
even swap_duplicate benefitting from the split. Certainly the split is mere
overhead in the common case of a single swap device.
So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
(generally we seem to prefer an _ in the name, and not hide in a macro).
If someone can show a regression in swap_duplicate, then probably we should
add a hashlock for the swap_map entries alone (shorts being anatomic), so as
to help the case of the single swap device too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-04 06:54:41 +08:00
|
|
|
* ->swap_lock (try_to_unmap_one)
|
2005-04-17 06:20:36 +08:00
|
|
|
* ->private_lock (try_to_unmap_one)
|
2018-04-11 07:36:56 +08:00
|
|
|
* ->i_pages lock (try_to_unmap_one)
|
2019-03-06 07:49:39 +08:00
|
|
|
* ->pgdat->lru_lock (follow_page->mark_page_accessed)
|
|
|
|
* ->pgdat->lru_lock (check_pte_range->isolate_lru_page)
|
2005-04-17 06:20:36 +08:00
|
|
|
* ->private_lock (page_remove_rmap->set_page_dirty)
|
2018-04-11 07:36:56 +08:00
|
|
|
* ->i_pages lock (page_remove_rmap->set_page_dirty)
|
2011-04-22 08:19:44 +08:00
|
|
|
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
|
2011-03-22 19:23:36 +08:00
|
|
|
* ->inode->i_lock (page_remove_rmap->set_page_dirty)
|
2016-03-16 05:57:04 +08:00
|
|
|
* ->memcg->move_lock (page_remove_rmap->lock_page_memcg)
|
2011-04-22 08:19:44 +08:00
|
|
|
* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
|
2011-03-22 19:23:36 +08:00
|
|
|
* ->inode->i_lock (zap_pte_range->set_page_dirty)
|
2005-04-17 06:20:36 +08:00
|
|
|
* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
|
|
|
|
*
|
2014-12-13 08:54:24 +08:00
|
|
|
* ->i_mmap_rwsem
|
2012-03-22 07:34:09 +08:00
|
|
|
* ->tasklist_lock (memory_failure, collect_procs_ao)
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
2017-11-21 22:17:59 +08:00
|
|
|
static void page_cache_delete(struct address_space *mapping,
|
2014-04-04 05:47:49 +08:00
|
|
|
struct page *page, void *shadow)
|
|
|
|
{
|
2017-11-21 22:17:59 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, page->index);
|
|
|
|
unsigned int nr = 1;
|
2016-12-13 08:43:17 +08:00
|
|
|
|
2017-11-21 22:17:59 +08:00
|
|
|
mapping_set_update(&xas, mapping);
|
2016-12-13 08:43:17 +08:00
|
|
|
|
2017-11-21 22:17:59 +08:00
|
|
|
/* hugetlb pages are represented by a single entry in the xarray */
|
|
|
|
if (!PageHuge(page)) {
|
|
|
|
xas_set_order(&xas, page->index, compound_order(page));
|
2019-09-24 06:34:30 +08:00
|
|
|
nr = compound_nr(page);
|
2017-11-21 22:17:59 +08:00
|
|
|
}
|
2014-04-04 05:47:49 +08:00
|
|
|
|
2016-07-27 06:26:04 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
VM_BUG_ON_PAGE(PageTail(page), page);
|
|
|
|
VM_BUG_ON_PAGE(nr != 1 && shadow, page);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
|
2017-11-21 22:17:59 +08:00
|
|
|
xas_store(&xas, shadow);
|
|
|
|
xas_init_marks(&xas);
|
mm: filemap: don't plant shadow entries without radix tree node
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:
kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100
This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.
Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.
Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.
Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-05 04:02:08 +08:00
|
|
|
|
2017-11-16 09:37:26 +08:00
|
|
|
page->mapping = NULL;
|
|
|
|
/* Leave page->index set: truncation lookup relies upon it */
|
|
|
|
|
mm: filemap: don't plant shadow entries without radix tree node
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:
kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100
This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.
Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.
Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.
Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-05 04:02:08 +08:00
|
|
|
if (shadow) {
|
|
|
|
mapping->nrexceptional += nr;
|
|
|
|
/*
|
|
|
|
* Make sure the nrexceptional update is committed before
|
|
|
|
* the nrpages update so that final truncate racing
|
|
|
|
* with reclaim does not see both counters 0 at the
|
|
|
|
* same time and miss a shadow entry.
|
|
|
|
*/
|
|
|
|
smp_wmb();
|
|
|
|
}
|
|
|
|
mapping->nrpages -= nr;
|
2014-04-04 05:47:49 +08:00
|
|
|
}
|
|
|
|
|
2017-11-16 09:37:29 +08:00
|
|
|
static void unaccount_page_cache_page(struct address_space *mapping,
|
|
|
|
struct page *page)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-11-16 09:37:29 +08:00
|
|
|
int nr;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-05-27 00:01:43 +08:00
|
|
|
/*
|
|
|
|
* if we're uptodate, flush out into the cleancache, otherwise
|
|
|
|
* invalidate any existing cleancache entries. We can't leave
|
|
|
|
* stale data around in the cleancache once our page is gone
|
|
|
|
*/
|
|
|
|
if (PageUptodate(page) && PageMappedToDisk(page))
|
|
|
|
cleancache_put_page(page);
|
|
|
|
else
|
2011-09-21 23:56:28 +08:00
|
|
|
cleancache_invalidate_page(mapping, page);
|
2011-05-27 00:01:43 +08:00
|
|
|
|
2016-07-27 06:26:04 +08:00
|
|
|
VM_BUG_ON_PAGE(PageTail(page), page);
|
mm: __delete_from_page_cache show Bad page if mapped
Commit e1534ae95004 ("mm: differentiate page_mapped() from
page_mapcount() for compound pages") changed the famous
BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have any
vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-10 06:08:07 +08:00
|
|
|
VM_BUG_ON_PAGE(page_mapped(page), page);
|
|
|
|
if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(page_mapped(page))) {
|
|
|
|
int mapcount;
|
|
|
|
|
|
|
|
pr_alert("BUG: Bad page cache in process %s pfn:%05lx\n",
|
|
|
|
current->comm, page_to_pfn(page));
|
|
|
|
dump_page(page, "still mapped when deleted");
|
|
|
|
dump_stack();
|
|
|
|
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
|
|
|
|
|
|
|
|
mapcount = page_mapcount(page);
|
|
|
|
if (mapping_exiting(mapping) &&
|
|
|
|
page_count(page) >= mapcount + 2) {
|
|
|
|
/*
|
|
|
|
* All vmas have already been torn down, so it's
|
|
|
|
* a good bet that actually the page is unmapped,
|
|
|
|
* and we'd prefer not to leak it: if we're wrong,
|
|
|
|
* some other bad page check should catch it later.
|
|
|
|
*/
|
|
|
|
page_mapcount_reset(page);
|
2016-05-20 08:10:46 +08:00
|
|
|
page_ref_sub(page, mapcount);
|
mm: __delete_from_page_cache show Bad page if mapped
Commit e1534ae95004 ("mm: differentiate page_mapped() from
page_mapcount() for compound pages") changed the famous
BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have any
vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-10 06:08:07 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-06-25 07:57:24 +08:00
|
|
|
/* hugetlb pages do not participate in page cache accounting. */
|
2017-11-16 09:37:29 +08:00
|
|
|
if (PageHuge(page))
|
|
|
|
return;
|
2017-07-11 06:47:35 +08:00
|
|
|
|
2017-11-16 09:37:29 +08:00
|
|
|
nr = hpage_nr_pages(page);
|
|
|
|
|
2020-06-04 07:01:54 +08:00
|
|
|
__mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
|
2017-11-16 09:37:29 +08:00
|
|
|
if (PageSwapBacked(page)) {
|
2020-06-04 07:01:54 +08:00
|
|
|
__mod_lruvec_page_state(page, NR_SHMEM, -nr);
|
2017-11-16 09:37:29 +08:00
|
|
|
if (PageTransHuge(page))
|
|
|
|
__dec_node_page_state(page, NR_SHMEM_THPS);
|
2019-09-24 06:38:00 +08:00
|
|
|
} else if (PageTransHuge(page)) {
|
|
|
|
__dec_node_page_state(page, NR_FILE_THPS);
|
2019-09-24 06:38:03 +08:00
|
|
|
filemap_nr_thps_dec(mapping);
|
2016-07-27 06:26:18 +08:00
|
|
|
}
|
2017-11-16 09:37:29 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* At this point page must be either written or cleaned by
|
|
|
|
* truncate. Dirty page here signals a bug and loss of
|
|
|
|
* unwritten data.
|
|
|
|
*
|
|
|
|
* This fixes dirty accounting after removing the page entirely
|
|
|
|
* but leaves PageDirty set: it has no effect for truncated
|
|
|
|
* page and anyway will be cleared before returning page into
|
|
|
|
* buddy allocator.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(PageDirty(page)))
|
|
|
|
account_page_cleaned(page, mapping, inode_to_wb(mapping->host));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Delete a page from the page cache and free it. Caller has to make
|
|
|
|
* sure the page is locked and that nobody else uses it - or that usage
|
2018-04-11 07:36:56 +08:00
|
|
|
* is safe. The caller must hold the i_pages lock.
|
2017-11-16 09:37:29 +08:00
|
|
|
*/
|
|
|
|
void __delete_from_page_cache(struct page *page, void *shadow)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = page->mapping;
|
|
|
|
|
|
|
|
trace_mm_filemap_delete_from_page_cache(page);
|
|
|
|
|
|
|
|
unaccount_page_cache_page(mapping, page);
|
2017-11-21 22:17:59 +08:00
|
|
|
page_cache_delete(mapping, page, shadow);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2017-11-16 09:37:18 +08:00
|
|
|
static void page_cache_free_page(struct address_space *mapping,
|
|
|
|
struct page *page)
|
|
|
|
{
|
|
|
|
void (*freepage)(struct page *);
|
|
|
|
|
|
|
|
freepage = mapping->a_ops->freepage;
|
|
|
|
if (freepage)
|
|
|
|
freepage(page);
|
|
|
|
|
|
|
|
if (PageTransHuge(page) && !PageHuge(page)) {
|
|
|
|
page_ref_sub(page, HPAGE_PMD_NR);
|
|
|
|
VM_BUG_ON_PAGE(page_count(page) <= 0, page);
|
|
|
|
} else {
|
|
|
|
put_page(page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-03-23 07:32:43 +08:00
|
|
|
/**
|
|
|
|
* delete_from_page_cache - delete page from page cache
|
|
|
|
* @page: the page which the kernel is trying to remove from page cache
|
|
|
|
*
|
|
|
|
* This must be called only on pages that have been verified to be in the page
|
|
|
|
* cache and locked. It will never put the page into the free list, the caller
|
|
|
|
* has a reference on the page.
|
|
|
|
*/
|
|
|
|
void delete_from_page_cache(struct page *page)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2016-07-27 06:26:04 +08:00
|
|
|
struct address_space *mapping = page_mapping(page);
|
memcg: add per cgroup dirty page accounting
When modifying PG_Dirty on cached file pages, update the new
MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
per memcg memory.stat cgroupfs file. The most recent past attempt at
this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
The new accounting supports future efforts to add per cgroup dirty
page throttling and writeback. It also helps an administrator break
down a container's memory usage and provides evidence to understand
memcg oom kills (the new dirty count is included in memcg oom kill
messages).
The ability to move page accounting between memcg
(memory.move_charge_at_immigrate) makes this accounting more
complicated than the global counter. The existing
mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
accounting with stat updates.
Typical update operation:
memcg = mem_cgroup_begin_page_stat(page)
if (TestSetPageDirty()) {
[...]
mem_cgroup_update_page_stat(memcg)
}
mem_cgroup_end_page_stat(memcg)
Summary of mem_cgroup_end_page_stat() overhead:
- Without CONFIG_MEMCG it's a no-op
- With CONFIG_MEMCG and no inter memcg task movement, it's just
rcu_read_lock()
- With CONFIG_MEMCG and inter memcg task movement, it's
rcu_read_lock() + spin_lock_irqsave()
A memcg parameter is added to several routines because their callers
now grab mem_cgroup_begin_page_stat() which returns the memcg later
needed by for mem_cgroup_update_page_stat().
Because mem_cgroup_begin_page_stat() may disable interrupts, some
adjustments are needed:
- move __mark_inode_dirty() from __set_page_dirty() to its caller.
__mark_inode_dirty() locking does not want interrupts disabled.
- use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
__delete_from_page_cache(), replace_page_cache_page(),
invalidate_complete_page2(), and __remove_mapping().
text data bss dec hex filename
8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
+192 text bytes
8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
+773 text bytes
Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
all metrics, they're all wall clock or cycle counts. The read and write
fault benchmarks just measure fault time, they do not include I/O time.
* CONFIG_MEMCG not set:
baseline patched
kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)
* CONFIG_MEMCG=y root_memcg:
baseline patched
kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)
* CONFIG_MEMCG=y non-root_memcg:
baseline patched
kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)
As expected anon page faults are not affected by this patch.
tj: Updated to apply on top of the recent cancel_dirty_page() changes.
Signed-off-by: Sha Zhengju <handai.szj@gmail.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-23 05:13:16 +08:00
|
|
|
unsigned long flags;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2005-05-01 23:59:01 +08:00
|
|
|
BUG_ON(!PageLocked(page));
|
2018-04-11 07:36:56 +08:00
|
|
|
xa_lock_irqsave(&mapping->i_pages, flags);
|
2016-03-16 05:57:22 +08:00
|
|
|
__delete_from_page_cache(page, NULL);
|
2018-04-11 07:36:56 +08:00
|
|
|
xa_unlock_irqrestore(&mapping->i_pages, flags);
|
2010-12-02 02:35:19 +08:00
|
|
|
|
2017-11-16 09:37:18 +08:00
|
|
|
page_cache_free_page(mapping, page);
|
2011-03-23 07:30:53 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(delete_from_page_cache);
|
|
|
|
|
2017-11-16 09:37:33 +08:00
|
|
|
/*
|
2017-12-04 16:59:45 +08:00
|
|
|
* page_cache_delete_batch - delete several pages from page cache
|
2017-11-16 09:37:33 +08:00
|
|
|
* @mapping: the mapping to which pages belong
|
|
|
|
* @pvec: pagevec with pages to delete
|
|
|
|
*
|
2018-04-11 07:36:56 +08:00
|
|
|
* The function walks over mapping->i_pages and removes pages passed in @pvec
|
2019-09-24 06:34:52 +08:00
|
|
|
* from the mapping. The function expects @pvec to be sorted by page index
|
|
|
|
* and is optimised for it to be dense.
|
2018-04-11 07:36:56 +08:00
|
|
|
* It tolerates holes in @pvec (mapping entries at those indices are not
|
2017-11-16 09:37:33 +08:00
|
|
|
* modified). The function expects only THP head pages to be present in the
|
2019-09-24 06:34:52 +08:00
|
|
|
* @pvec.
|
2017-11-16 09:37:33 +08:00
|
|
|
*
|
2018-04-11 07:36:56 +08:00
|
|
|
* The function expects the i_pages lock to be held.
|
2017-11-16 09:37:33 +08:00
|
|
|
*/
|
2017-12-04 16:59:45 +08:00
|
|
|
static void page_cache_delete_batch(struct address_space *mapping,
|
2017-11-16 09:37:33 +08:00
|
|
|
struct pagevec *pvec)
|
|
|
|
{
|
2017-12-04 16:59:45 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
|
2017-11-16 09:37:33 +08:00
|
|
|
int total_pages = 0;
|
2019-09-24 06:34:52 +08:00
|
|
|
int i = 0;
|
2017-11-16 09:37:33 +08:00
|
|
|
struct page *page;
|
|
|
|
|
2017-12-04 16:59:45 +08:00
|
|
|
mapping_set_update(&xas, mapping);
|
|
|
|
xas_for_each(&xas, page, ULONG_MAX) {
|
2019-09-24 06:34:52 +08:00
|
|
|
if (i >= pagevec_count(pvec))
|
2017-11-16 09:37:33 +08:00
|
|
|
break;
|
2019-09-24 06:34:52 +08:00
|
|
|
|
|
|
|
/* A swap/dax/shadow entry got inserted? Skip it. */
|
2017-11-04 01:30:42 +08:00
|
|
|
if (xa_is_value(page))
|
2017-11-16 09:37:33 +08:00
|
|
|
continue;
|
2019-09-24 06:34:52 +08:00
|
|
|
/*
|
|
|
|
* A page got inserted in our range? Skip it. We have our
|
|
|
|
* pages locked so they are protected from being removed.
|
|
|
|
* If we see a page whose index is higher than ours, it
|
|
|
|
* means our page has been removed, which shouldn't be
|
|
|
|
* possible because we're holding the PageLock.
|
|
|
|
*/
|
|
|
|
if (page != pvec->pages[i]) {
|
|
|
|
VM_BUG_ON_PAGE(page->index > pvec->pages[i]->index,
|
|
|
|
page);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!PageLocked(page));
|
|
|
|
|
|
|
|
if (page->index == xas.xa_index)
|
2017-11-16 09:37:33 +08:00
|
|
|
page->mapping = NULL;
|
2019-09-24 06:34:52 +08:00
|
|
|
/* Leave page->index set: truncation lookup relies on it */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Move to the next page in the vector if this is a regular
|
|
|
|
* page or the index is of the last sub-page of this compound
|
|
|
|
* page.
|
|
|
|
*/
|
|
|
|
if (page->index + compound_nr(page) - 1 == xas.xa_index)
|
2017-11-16 09:37:33 +08:00
|
|
|
i++;
|
2017-12-04 16:59:45 +08:00
|
|
|
xas_store(&xas, NULL);
|
2017-11-16 09:37:33 +08:00
|
|
|
total_pages++;
|
|
|
|
}
|
|
|
|
mapping->nrpages -= total_pages;
|
|
|
|
}
|
|
|
|
|
|
|
|
void delete_from_page_cache_batch(struct address_space *mapping,
|
|
|
|
struct pagevec *pvec)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
if (!pagevec_count(pvec))
|
|
|
|
return;
|
|
|
|
|
2018-04-11 07:36:56 +08:00
|
|
|
xa_lock_irqsave(&mapping->i_pages, flags);
|
2017-11-16 09:37:33 +08:00
|
|
|
for (i = 0; i < pagevec_count(pvec); i++) {
|
|
|
|
trace_mm_filemap_delete_from_page_cache(pvec->pages[i]);
|
|
|
|
|
|
|
|
unaccount_page_cache_page(mapping, pvec->pages[i]);
|
|
|
|
}
|
2017-12-04 16:59:45 +08:00
|
|
|
page_cache_delete_batch(mapping, pvec);
|
2018-04-11 07:36:56 +08:00
|
|
|
xa_unlock_irqrestore(&mapping->i_pages, flags);
|
2017-11-16 09:37:33 +08:00
|
|
|
|
|
|
|
for (i = 0; i < pagevec_count(pvec); i++)
|
|
|
|
page_cache_free_page(mapping, pvec->pages[i]);
|
|
|
|
}
|
|
|
|
|
2016-07-29 20:10:57 +08:00
|
|
|
int filemap_check_errors(struct address_space *mapping)
|
2013-04-30 06:08:42 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
/* Check for outstanding write errors */
|
2014-05-23 02:54:16 +08:00
|
|
|
if (test_bit(AS_ENOSPC, &mapping->flags) &&
|
|
|
|
test_and_clear_bit(AS_ENOSPC, &mapping->flags))
|
2013-04-30 06:08:42 +08:00
|
|
|
ret = -ENOSPC;
|
2014-05-23 02:54:16 +08:00
|
|
|
if (test_bit(AS_EIO, &mapping->flags) &&
|
|
|
|
test_and_clear_bit(AS_EIO, &mapping->flags))
|
2013-04-30 06:08:42 +08:00
|
|
|
ret = -EIO;
|
|
|
|
return ret;
|
|
|
|
}
|
2016-07-29 20:10:57 +08:00
|
|
|
EXPORT_SYMBOL(filemap_check_errors);
|
2013-04-30 06:08:42 +08:00
|
|
|
|
2017-07-06 19:02:22 +08:00
|
|
|
static int filemap_check_and_keep_errors(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
/* Check for outstanding write errors */
|
|
|
|
if (test_bit(AS_EIO, &mapping->flags))
|
|
|
|
return -EIO;
|
|
|
|
if (test_bit(AS_ENOSPC, &mapping->flags))
|
|
|
|
return -ENOSPC;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
2006-06-23 17:03:49 +08:00
|
|
|
* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
|
2005-05-01 23:59:26 +08:00
|
|
|
* @mapping: address space structure to write
|
|
|
|
* @start: offset in bytes where the range starts
|
2006-03-24 19:17:45 +08:00
|
|
|
* @end: offset in bytes where the range ends (inclusive)
|
2005-05-01 23:59:26 +08:00
|
|
|
* @sync_mode: enable synchronous operation
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2006-06-23 17:03:49 +08:00
|
|
|
* Start writeback against all of a mapping's dirty pages that lie
|
|
|
|
* within the byte offsets <start, end> inclusive.
|
|
|
|
*
|
2005-04-17 06:20:36 +08:00
|
|
|
* If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
|
2006-06-23 17:03:49 +08:00
|
|
|
* opposed to a regular memory cleansing writeback. The difference between
|
2005-04-17 06:20:36 +08:00
|
|
|
* these two operations is that if a dirty page/buffer is encountered, it must
|
|
|
|
* be waited upon, and not just skipped over.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
[PATCH] fadvise(): write commands
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.
LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.
By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.
It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.
To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().
The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.
As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".
But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:18:04 +08:00
|
|
|
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
|
|
|
|
loff_t end, int sync_mode)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = sync_mode,
|
mm: write_cache_pages integrity fix
In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
so the function will return success after writing out nr_to_write pages,
even if that was not sufficient to guarantee data integrity.
The callers tend to set it to values that could break data interity
semantics easily in practice. For example, nr_to_write can be set to
mapping->nr_pages * 2, however if a file has a single, dirty page, then
fsync is called, subsequent pages might be concurrently added and dirtied,
then write_cache_pages might writeout two of these newly dirty pages,
while not writing out the old page that should have been written out.
Fix this by ignoring nr_to_write if it is a data integrity sync.
This is a data integrity bug.
The reason this has been done in the past is to avoid stalling sync
operations behind page dirtiers.
"If a file has one dirty page at offset 1000000000000000 then someone
does an fsync() and someone else gets in first and starts madly writing
pages at offset 0, we want to write that page at 1000000000000000.
Somehow."
What we do today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for. Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.
This change does indeed make the possibility of long stalls lager, and
that's not a good thing, but lying about data integrity is even worse. We
have to either perform the sync, or return -ELINUXISLAME so at least the
caller knows what has happened.
There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-07 06:39:08 +08:00
|
|
|
.nr_to_write = LONG_MAX,
|
[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:26 +08:00
|
|
|
.range_start = start,
|
|
|
|
.range_end = end,
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2019-09-24 06:34:45 +08:00
|
|
|
if (!mapping_cap_writeback_dirty(mapping) ||
|
|
|
|
!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
|
2005-04-17 06:20:36 +08:00
|
|
|
return 0;
|
|
|
|
|
2015-06-02 22:39:48 +08:00
|
|
|
wbc_attach_fdatawrite_inode(&wbc, mapping->host);
|
2005-04-17 06:20:36 +08:00
|
|
|
ret = do_writepages(mapping, &wbc);
|
2015-06-02 22:39:48 +08:00
|
|
|
wbc_detach_inode(&wbc);
|
2005-04-17 06:20:36 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int __filemap_fdatawrite(struct address_space *mapping,
|
|
|
|
int sync_mode)
|
|
|
|
{
|
[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:26 +08:00
|
|
|
return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
int filemap_fdatawrite(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_fdatawrite);
|
|
|
|
|
2008-07-12 07:27:31 +08:00
|
|
|
int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
|
[PATCH] fadvise(): write commands
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.
LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.
By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.
It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.
To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().
The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.
As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".
But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 19:18:04 +08:00
|
|
|
loff_t end)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
|
|
|
|
}
|
2008-07-12 07:27:31 +08:00
|
|
|
EXPORT_SYMBOL(filemap_fdatawrite_range);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
|
|
|
* filemap_flush - mostly a non-blocking flush
|
|
|
|
* @mapping: target address_space
|
|
|
|
*
|
2005-04-17 06:20:36 +08:00
|
|
|
* This is a mostly non-blocking flush. Not suitable for data-integrity
|
|
|
|
* purposes - I/O may not be started against all dirty pages.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
int filemap_flush(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_flush);
|
|
|
|
|
2017-06-20 20:05:41 +08:00
|
|
|
/**
|
|
|
|
* filemap_range_has_page - check if a page exists in range.
|
|
|
|
* @mapping: address space within which to check
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Find at least one page in the range supplied, usually used to check if
|
|
|
|
* direct writing in this range will trigger a writeback.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: %true if at least one page exists in the specified range,
|
|
|
|
* %false otherwise.
|
2017-06-20 20:05:41 +08:00
|
|
|
*/
|
|
|
|
bool filemap_range_has_page(struct address_space *mapping,
|
|
|
|
loff_t start_byte, loff_t end_byte)
|
|
|
|
{
|
2017-09-07 07:21:40 +08:00
|
|
|
struct page *page;
|
2018-01-16 19:26:49 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
|
|
|
|
pgoff_t max = end_byte >> PAGE_SHIFT;
|
2017-06-20 20:05:41 +08:00
|
|
|
|
|
|
|
if (end_byte < start_byte)
|
|
|
|
return false;
|
|
|
|
|
2018-01-16 19:26:49 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
for (;;) {
|
|
|
|
page = xas_find(&xas, max);
|
|
|
|
if (xas_retry(&xas, page))
|
|
|
|
continue;
|
|
|
|
/* Shadow entries don't count */
|
|
|
|
if (xa_is_value(page))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* We don't need to try to pin this page; we're about to
|
|
|
|
* release the RCU lock anyway. It is enough to know that
|
|
|
|
* there was a page here recently.
|
|
|
|
*/
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2017-06-20 20:05:41 +08:00
|
|
|
|
2018-01-16 19:26:49 +08:00
|
|
|
return page != NULL;
|
2017-06-20 20:05:41 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_range_has_page);
|
|
|
|
|
2017-07-06 19:02:24 +08:00
|
|
|
static void __filemap_fdatawait_range(struct address_space *mapping,
|
2015-11-06 10:47:23 +08:00
|
|
|
loff_t start_byte, loff_t end_byte)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
pgoff_t index = start_byte >> PAGE_SHIFT;
|
|
|
|
pgoff_t end = end_byte >> PAGE_SHIFT;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct pagevec pvec;
|
|
|
|
int nr_pages;
|
|
|
|
|
2009-10-01 04:16:33 +08:00
|
|
|
if (end_byte < start_byte)
|
2017-07-06 19:02:24 +08:00
|
|
|
return;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-11-16 09:37:52 +08:00
|
|
|
pagevec_init(&pvec);
|
2017-11-16 09:35:05 +08:00
|
|
|
while (index <= end) {
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned i;
|
|
|
|
|
2017-11-16 09:35:05 +08:00
|
|
|
nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index,
|
2017-11-16 09:35:19 +08:00
|
|
|
end, PAGECACHE_TAG_WRITEBACK);
|
2017-11-16 09:35:05 +08:00
|
|
|
if (!nr_pages)
|
|
|
|
break;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
struct page *page = pvec.pages[i];
|
|
|
|
|
|
|
|
wait_on_page_writeback(page);
|
2017-07-06 19:02:24 +08:00
|
|
|
ClearPageError(page);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
pagevec_release(&pvec);
|
|
|
|
cond_resched();
|
|
|
|
}
|
2015-11-06 10:47:23 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* filemap_fdatawait_range - wait for writeback to complete
|
|
|
|
* @mapping: address space structure to wait for
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Walk the list of under-writeback pages of the given address space
|
|
|
|
* in the given range and wait for all of them. Check error status of
|
|
|
|
* the address space and return it.
|
|
|
|
*
|
|
|
|
* Since the error status of the address space is cleared by this function,
|
|
|
|
* callers are responsible for checking the return value and handling and/or
|
|
|
|
* reporting the error.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: error status of the address space.
|
2015-11-06 10:47:23 +08:00
|
|
|
*/
|
|
|
|
int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
|
|
|
|
loff_t end_byte)
|
|
|
|
{
|
2017-07-06 19:02:24 +08:00
|
|
|
__filemap_fdatawait_range(mapping, start_byte, end_byte);
|
|
|
|
return filemap_check_errors(mapping);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2009-08-18 01:30:27 +08:00
|
|
|
EXPORT_SYMBOL(filemap_fdatawait_range);
|
|
|
|
|
2019-06-21 05:05:37 +08:00
|
|
|
/**
|
|
|
|
* filemap_fdatawait_range_keep_errors - wait for writeback to complete
|
|
|
|
* @mapping: address space structure to wait for
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Walk the list of under-writeback pages of the given address space in the
|
|
|
|
* given range and wait for all of them. Unlike filemap_fdatawait_range(),
|
|
|
|
* this function does not clear error status of the address space.
|
|
|
|
*
|
|
|
|
* Use this function if callers don't handle errors themselves. Expected
|
|
|
|
* call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
|
|
|
|
* fsfreeze(8)
|
|
|
|
*/
|
|
|
|
int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
|
|
|
|
loff_t start_byte, loff_t end_byte)
|
|
|
|
{
|
|
|
|
__filemap_fdatawait_range(mapping, start_byte, end_byte);
|
|
|
|
return filemap_check_and_keep_errors(mapping);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors);
|
|
|
|
|
2017-07-28 19:24:43 +08:00
|
|
|
/**
|
|
|
|
* file_fdatawait_range - wait for writeback to complete
|
|
|
|
* @file: file pointing to address space structure to wait for
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Walk the list of under-writeback pages of the address space that file
|
|
|
|
* refers to, in the given range and wait for all of them. Check error
|
|
|
|
* status of the address space vs. the file->f_wb_err cursor and return it.
|
|
|
|
*
|
|
|
|
* Since the error status of the file is advanced by this function,
|
|
|
|
* callers are responsible for checking the return value and handling and/or
|
|
|
|
* reporting the error.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: error status of the address space vs. the file->f_wb_err cursor.
|
2017-07-28 19:24:43 +08:00
|
|
|
*/
|
|
|
|
int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
|
|
|
|
__filemap_fdatawait_range(mapping, start_byte, end_byte);
|
|
|
|
return file_check_and_advance_wb_err(file);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(file_fdatawait_range);
|
2009-08-18 01:30:27 +08:00
|
|
|
|
2015-11-06 10:47:23 +08:00
|
|
|
/**
|
|
|
|
* filemap_fdatawait_keep_errors - wait for writeback without clearing errors
|
|
|
|
* @mapping: address space structure to wait for
|
|
|
|
*
|
|
|
|
* Walk the list of under-writeback pages of the given address space
|
|
|
|
* and wait for all of them. Unlike filemap_fdatawait(), this function
|
|
|
|
* does not clear error status of the address space.
|
|
|
|
*
|
|
|
|
* Use this function if callers don't handle errors themselves. Expected
|
|
|
|
* call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
|
|
|
|
* fsfreeze(8)
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: error status of the address space.
|
2015-11-06 10:47:23 +08:00
|
|
|
*/
|
2017-07-06 19:02:22 +08:00
|
|
|
int filemap_fdatawait_keep_errors(struct address_space *mapping)
|
2015-11-06 10:47:23 +08:00
|
|
|
{
|
2017-07-31 22:29:38 +08:00
|
|
|
__filemap_fdatawait_range(mapping, 0, LLONG_MAX);
|
2017-07-06 19:02:22 +08:00
|
|
|
return filemap_check_and_keep_errors(mapping);
|
2015-11-06 10:47:23 +08:00
|
|
|
}
|
2017-07-06 19:02:22 +08:00
|
|
|
EXPORT_SYMBOL(filemap_fdatawait_keep_errors);
|
2015-11-06 10:47:23 +08:00
|
|
|
|
2019-09-24 06:34:48 +08:00
|
|
|
/* Returns true if writeback might be needed or already in progress. */
|
2017-07-26 22:21:11 +08:00
|
|
|
static bool mapping_needs_writeback(struct address_space *mapping)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2019-09-24 06:34:48 +08:00
|
|
|
if (dax_mapping(mapping))
|
|
|
|
return mapping->nrexceptional;
|
|
|
|
|
|
|
|
return mapping->nrpages;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
|
|
|
* filemap_write_and_wait_range - write out & wait on a file range
|
|
|
|
* @mapping: the address_space for the pages
|
|
|
|
* @lstart: offset in bytes where the range starts
|
|
|
|
* @lend: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
2006-03-24 19:17:45 +08:00
|
|
|
* Write out and wait upon file offsets lstart->lend, inclusive.
|
|
|
|
*
|
2017-03-31 04:11:36 +08:00
|
|
|
* Note that @lend is inclusive (describes the last byte to be written) so
|
2006-03-24 19:17:45 +08:00
|
|
|
* that this function can be used to write to the very end-of-file (end = -1).
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: error status of the address space.
|
2006-03-24 19:17:45 +08:00
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
int filemap_write_and_wait_range(struct address_space *mapping,
|
|
|
|
loff_t lstart, loff_t lend)
|
|
|
|
{
|
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:02:14 +08:00
|
|
|
int err = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-07-26 22:21:11 +08:00
|
|
|
if (mapping_needs_writeback(mapping)) {
|
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:02:14 +08:00
|
|
|
err = __filemap_fdatawrite_range(mapping, lstart, lend,
|
|
|
|
WB_SYNC_ALL);
|
2020-01-31 14:12:07 +08:00
|
|
|
/*
|
|
|
|
* Even if the above returned error, the pages may be
|
|
|
|
* written partially (e.g. -ENOSPC), so we wait for it.
|
|
|
|
* But the -EIO is special case, it may indicate the worst
|
|
|
|
* thing (e.g. bug) happened, so we avoid waiting for it.
|
|
|
|
*/
|
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:02:14 +08:00
|
|
|
if (err != -EIO) {
|
2009-10-01 04:16:33 +08:00
|
|
|
int err2 = filemap_fdatawait_range(mapping,
|
|
|
|
lstart, lend);
|
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:02:14 +08:00
|
|
|
if (!err)
|
|
|
|
err = err2;
|
2017-07-06 19:02:23 +08:00
|
|
|
} else {
|
|
|
|
/* Clear any previously stored errors */
|
|
|
|
filemap_check_errors(mapping);
|
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:02:14 +08:00
|
|
|
}
|
2013-04-30 06:08:42 +08:00
|
|
|
} else {
|
|
|
|
err = filemap_check_errors(mapping);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 17:02:14 +08:00
|
|
|
return err;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2009-04-16 01:22:37 +08:00
|
|
|
EXPORT_SYMBOL(filemap_write_and_wait_range);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 19:02:25 +08:00
|
|
|
void __filemap_set_wb_err(struct address_space *mapping, int err)
|
|
|
|
{
|
2017-07-24 18:22:15 +08:00
|
|
|
errseq_t eseq = errseq_set(&mapping->wb_err, err);
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 19:02:25 +08:00
|
|
|
|
|
|
|
trace_filemap_set_wb_err(mapping, eseq);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__filemap_set_wb_err);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* file_check_and_advance_wb_err - report wb error (if any) that was previously
|
|
|
|
* and advance wb_err to current one
|
|
|
|
* @file: struct file on which the error is being reported
|
|
|
|
*
|
|
|
|
* When userland calls fsync (or something like nfsd does the equivalent), we
|
|
|
|
* want to report any writeback errors that occurred since the last fsync (or
|
|
|
|
* since the file was opened if there haven't been any).
|
|
|
|
*
|
|
|
|
* Grab the wb_err from the mapping. If it matches what we have in the file,
|
|
|
|
* then just quickly return 0. The file is all caught up.
|
|
|
|
*
|
|
|
|
* If it doesn't match, then take the mapping value, set the "seen" flag in
|
|
|
|
* it and try to swap it into place. If it works, or another task beat us
|
|
|
|
* to it with the new value, then update the f_wb_err and return the error
|
|
|
|
* portion. The error at this point must be reported via proper channels
|
|
|
|
* (a'la fsync, or NFS COMMIT operation, etc.).
|
|
|
|
*
|
|
|
|
* While we handle mapping->wb_err with atomic operations, the f_wb_err
|
|
|
|
* value is protected by the f_lock since we must ensure that it reflects
|
|
|
|
* the latest value swapped in for this file descriptor.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 19:02:25 +08:00
|
|
|
*/
|
|
|
|
int file_check_and_advance_wb_err(struct file *file)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
errseq_t old = READ_ONCE(file->f_wb_err);
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
|
|
|
|
/* Locklessly handle the common case where nothing has changed */
|
|
|
|
if (errseq_check(&mapping->wb_err, old)) {
|
|
|
|
/* Something changed, must use slow path */
|
|
|
|
spin_lock(&file->f_lock);
|
|
|
|
old = file->f_wb_err;
|
|
|
|
err = errseq_check_and_advance(&mapping->wb_err,
|
|
|
|
&file->f_wb_err);
|
|
|
|
trace_file_check_and_advance_wb_err(file, old);
|
|
|
|
spin_unlock(&file->f_lock);
|
|
|
|
}
|
2017-10-04 07:15:25 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We're mostly using this function as a drop in replacement for
|
|
|
|
* filemap_check_errors. Clear AS_EIO/AS_ENOSPC to emulate the effect
|
|
|
|
* that the legacy code would have had on these flags.
|
|
|
|
*/
|
|
|
|
clear_bit(AS_EIO, &mapping->flags);
|
|
|
|
clear_bit(AS_ENOSPC, &mapping->flags);
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 19:02:25 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(file_check_and_advance_wb_err);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* file_write_and_wait_range - write out & wait on a file range
|
|
|
|
* @file: file pointing to address_space with pages
|
|
|
|
* @lstart: offset in bytes where the range starts
|
|
|
|
* @lend: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Write out and wait upon file offsets lstart->lend, inclusive.
|
|
|
|
*
|
|
|
|
* Note that @lend is inclusive (describes the last byte to be written) so
|
|
|
|
* that this function can be used to write to the very end-of-file (end = -1).
|
|
|
|
*
|
|
|
|
* After writing out and waiting on the data, we check and advance the
|
|
|
|
* f_wb_err cursor to the latest value, and return any errors detected there.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 19:02:25 +08:00
|
|
|
*/
|
|
|
|
int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
|
|
|
|
{
|
|
|
|
int err = 0, err2;
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
|
2017-07-26 22:21:11 +08:00
|
|
|
if (mapping_needs_writeback(mapping)) {
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 19:02:25 +08:00
|
|
|
err = __filemap_fdatawrite_range(mapping, lstart, lend,
|
|
|
|
WB_SYNC_ALL);
|
|
|
|
/* See comment of filemap_write_and_wait() */
|
|
|
|
if (err != -EIO)
|
|
|
|
__filemap_fdatawait_range(mapping, lstart, lend);
|
|
|
|
}
|
|
|
|
err2 = file_check_and_advance_wb_err(file);
|
|
|
|
if (!err)
|
|
|
|
err = err2;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(file_write_and_wait_range);
|
|
|
|
|
2011-03-23 07:30:52 +08:00
|
|
|
/**
|
|
|
|
* replace_page_cache_page - replace a pagecache page with a new one
|
|
|
|
* @old: page to be replaced
|
|
|
|
* @new: page to replace with
|
|
|
|
* @gfp_mask: allocation mode
|
|
|
|
*
|
|
|
|
* This function replaces a page in the pagecache with a new one. On
|
|
|
|
* success it acquires the pagecache reference for the new page and
|
|
|
|
* drops it for the old page. Both the old and new pages must be
|
|
|
|
* locked. This function does not add the new page to the LRU, the
|
|
|
|
* caller must do that.
|
|
|
|
*
|
2017-11-17 23:01:45 +08:00
|
|
|
* The remove + add is atomic. This function cannot fail.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: %0
|
2011-03-23 07:30:52 +08:00
|
|
|
*/
|
|
|
|
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
|
|
|
|
{
|
2017-11-17 23:01:45 +08:00
|
|
|
struct address_space *mapping = old->mapping;
|
|
|
|
void (*freepage)(struct page *) = mapping->a_ops->freepage;
|
|
|
|
pgoff_t offset = old->index;
|
|
|
|
XA_STATE(xas, &mapping->i_pages, offset);
|
|
|
|
unsigned long flags;
|
2011-03-23 07:30:52 +08:00
|
|
|
|
2014-01-24 07:52:54 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageLocked(old), old);
|
|
|
|
VM_BUG_ON_PAGE(!PageLocked(new), new);
|
|
|
|
VM_BUG_ON_PAGE(new->mapping, new);
|
2011-03-23 07:30:52 +08:00
|
|
|
|
2017-11-17 23:01:45 +08:00
|
|
|
get_page(new);
|
|
|
|
new->mapping = mapping;
|
|
|
|
new->index = offset;
|
2011-03-23 07:30:52 +08:00
|
|
|
|
2020-06-04 07:01:54 +08:00
|
|
|
mem_cgroup_migrate(old, new);
|
|
|
|
|
2017-11-17 23:01:45 +08:00
|
|
|
xas_lock_irqsave(&xas, flags);
|
|
|
|
xas_store(&xas, new);
|
2015-06-25 07:57:24 +08:00
|
|
|
|
2017-11-17 23:01:45 +08:00
|
|
|
old->mapping = NULL;
|
|
|
|
/* hugetlb pages do not participate in page cache accounting. */
|
|
|
|
if (!PageHuge(old))
|
2020-06-04 07:01:54 +08:00
|
|
|
__dec_lruvec_page_state(old, NR_FILE_PAGES);
|
2017-11-17 23:01:45 +08:00
|
|
|
if (!PageHuge(new))
|
2020-06-04 07:01:54 +08:00
|
|
|
__inc_lruvec_page_state(new, NR_FILE_PAGES);
|
2017-11-17 23:01:45 +08:00
|
|
|
if (PageSwapBacked(old))
|
2020-06-04 07:01:54 +08:00
|
|
|
__dec_lruvec_page_state(old, NR_SHMEM);
|
2017-11-17 23:01:45 +08:00
|
|
|
if (PageSwapBacked(new))
|
2020-06-04 07:01:54 +08:00
|
|
|
__inc_lruvec_page_state(new, NR_SHMEM);
|
2017-11-17 23:01:45 +08:00
|
|
|
xas_unlock_irqrestore(&xas, flags);
|
|
|
|
if (freepage)
|
|
|
|
freepage(old);
|
|
|
|
put_page(old);
|
2011-03-23 07:30:52 +08:00
|
|
|
|
2017-11-17 23:01:45 +08:00
|
|
|
return 0;
|
2011-03-23 07:30:52 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(replace_page_cache_page);
|
|
|
|
|
2014-04-04 05:47:51 +08:00
|
|
|
static int __add_to_page_cache_locked(struct page *page,
|
|
|
|
struct address_space *mapping,
|
|
|
|
pgoff_t offset, gfp_t gfp_mask,
|
|
|
|
void **shadowp)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-11-17 23:01:45 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, offset);
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:20 +08:00
|
|
|
int huge = PageHuge(page);
|
2008-07-26 10:45:30 +08:00
|
|
|
int error;
|
2017-11-17 23:01:45 +08:00
|
|
|
void *old;
|
2008-07-26 10:45:30 +08:00
|
|
|
|
2014-01-24 07:52:54 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
VM_BUG_ON_PAGE(PageSwapBacked(page), page);
|
2017-11-17 23:01:45 +08:00
|
|
|
mapping_set_update(&xas, mapping);
|
2008-07-26 10:45:30 +08:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
get_page(page);
|
2013-09-13 06:13:59 +08:00
|
|
|
page->mapping = mapping;
|
|
|
|
page->index = offset;
|
|
|
|
|
mm: memcontrol: convert page cache to a new mem_cgroup_charge() API
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:01:41 +08:00
|
|
|
if (!huge) {
|
2020-06-04 07:02:24 +08:00
|
|
|
error = mem_cgroup_charge(page, current->mm, gfp_mask);
|
mm: memcontrol: convert page cache to a new mem_cgroup_charge() API
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:01:41 +08:00
|
|
|
if (error)
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
2017-11-17 23:01:45 +08:00
|
|
|
do {
|
|
|
|
xas_lock_irq(&xas);
|
|
|
|
old = xas_load(&xas);
|
|
|
|
if (old && !xa_is_value(old))
|
|
|
|
xas_set_err(&xas, -EEXIST);
|
|
|
|
xas_store(&xas, page);
|
|
|
|
if (xas_error(&xas))
|
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
if (xa_is_value(old)) {
|
|
|
|
mapping->nrexceptional--;
|
|
|
|
if (shadowp)
|
|
|
|
*shadowp = old;
|
|
|
|
}
|
|
|
|
mapping->nrpages++;
|
|
|
|
|
|
|
|
/* hugetlb pages do not participate in page cache accounting */
|
|
|
|
if (!huge)
|
2020-06-04 07:01:54 +08:00
|
|
|
__inc_lruvec_page_state(page, NR_FILE_PAGES);
|
2017-11-17 23:01:45 +08:00
|
|
|
unlock:
|
|
|
|
xas_unlock_irq(&xas);
|
|
|
|
} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
|
|
|
|
|
mm: memcontrol: convert page cache to a new mem_cgroup_charge() API
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:01:41 +08:00
|
|
|
if (xas_error(&xas)) {
|
|
|
|
error = xas_error(&xas);
|
2017-11-17 23:01:45 +08:00
|
|
|
goto error;
|
mm: memcontrol: convert page cache to a new mem_cgroup_charge() API
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:01:41 +08:00
|
|
|
}
|
2015-06-25 07:57:24 +08:00
|
|
|
|
2013-09-13 06:13:59 +08:00
|
|
|
trace_mm_filemap_add_to_page_cache(page);
|
|
|
|
return 0;
|
2017-11-17 23:01:45 +08:00
|
|
|
error:
|
2013-09-13 06:13:59 +08:00
|
|
|
page->mapping = NULL;
|
|
|
|
/* Leave page->index set: truncation relies upon it */
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
mm: memcontrol: convert page cache to a new mem_cgroup_charge() API
The try/commit/cancel protocol that memcg uses dates back to when pages
used to be uncharged upon removal from the page cache, and thus couldn't
be committed before the insertion had succeeded. Nowadays, pages are
uncharged when they are physically freed; it doesn't matter whether the
insertion was successful or not. For the page cache, the transaction
dance has become unnecessary.
Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything but
free/put the page.
Then switch the page cache over to this new API.
Subsequent patches will also convert anon pages, but it needs a bit more
prep work. Right now, memcg depends on page->mapping being already set up
at the time of charging, so that it can maintain its own MEMCG_CACHE and
MEMCG_RSS counters. For anon, page->mapping is set under the same pte
lock under which the page is publishd, so a single charge point that can
block doesn't work there just yet.
The following prep patches will replace the private memcg counters with
the generic vmstat counters, thus removing the page->mapping dependency,
then complete the transition to the new single-point charge API and delete
the old transactional scheme.
v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 07:01:41 +08:00
|
|
|
return error;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2019-05-14 08:21:04 +08:00
|
|
|
ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
|
2014-04-04 05:47:51 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* add_to_page_cache_locked - add a locked page to the pagecache
|
|
|
|
* @page: page to add
|
|
|
|
* @mapping: the page's address_space
|
|
|
|
* @offset: page index
|
|
|
|
* @gfp_mask: page allocation mode
|
|
|
|
*
|
|
|
|
* This function is used to add a page to the pagecache. It must be locked.
|
|
|
|
* This function does not add the page to the LRU. The caller must do that.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
2014-04-04 05:47:51 +08:00
|
|
|
*/
|
|
|
|
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
|
|
|
|
pgoff_t offset, gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
return __add_to_page_cache_locked(page, mapping, offset,
|
|
|
|
gfp_mask, NULL);
|
|
|
|
}
|
2008-07-26 10:45:30 +08:00
|
|
|
EXPORT_SYMBOL(add_to_page_cache_locked);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
|
2005-10-21 15:18:50 +08:00
|
|
|
pgoff_t offset, gfp_t gfp_mask)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2014-04-04 05:47:51 +08:00
|
|
|
void *shadow = NULL;
|
2008-10-19 11:26:32 +08:00
|
|
|
int ret;
|
|
|
|
|
2016-01-16 08:51:24 +08:00
|
|
|
__SetPageLocked(page);
|
2014-04-04 05:47:51 +08:00
|
|
|
ret = __add_to_page_cache_locked(page, mapping, offset,
|
|
|
|
gfp_mask, &shadow);
|
|
|
|
if (unlikely(ret))
|
2016-01-16 08:51:24 +08:00
|
|
|
__ClearPageLocked(page);
|
2014-04-04 05:47:51 +08:00
|
|
|
else {
|
|
|
|
/*
|
|
|
|
* The page might have been evicted from cache only
|
|
|
|
* recently, in which case it should be activated like
|
|
|
|
* any other repeatedly accessed page.
|
2016-05-21 07:56:25 +08:00
|
|
|
* The exception is pages getting rewritten; evicting other
|
|
|
|
* data from the working set, only to cache data that will
|
|
|
|
* get overwritten with something else, is a waste of memory.
|
2014-04-04 05:47:51 +08:00
|
|
|
*/
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
WARN_ON_ONCE(PageActive(page));
|
|
|
|
if (!(gfp_mask & __GFP_WRITE) && shadow)
|
|
|
|
workingset_refault(page, shadow);
|
2014-04-04 05:47:51 +08:00
|
|
|
lru_cache_add(page);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2009-02-09 22:02:42 +08:00
|
|
|
EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-03-24 19:16:04 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
2006-10-29 01:38:23 +08:00
|
|
|
struct page *__page_cache_alloc(gfp_t gfp)
|
2006-03-24 19:16:04 +08:00
|
|
|
{
|
2010-05-25 05:32:08 +08:00
|
|
|
int n;
|
|
|
|
struct page *page;
|
|
|
|
|
2006-03-24 19:16:04 +08:00
|
|
|
if (cpuset_do_page_mem_spread()) {
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
|
|
|
unsigned int cpuset_mems_cookie;
|
|
|
|
do {
|
2014-04-04 05:47:24 +08:00
|
|
|
cpuset_mems_cookie = read_mems_allowed_begin();
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
|
|
|
n = cpuset_mem_spread_node();
|
mm: rename alloc_pages_exact_node() to __alloc_pages_node()
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 06:03:50 +08:00
|
|
|
page = __alloc_pages_node(n, gfp, 0);
|
2014-04-04 05:47:24 +08:00
|
|
|
} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:11 +08:00
|
|
|
|
2010-05-25 05:32:08 +08:00
|
|
|
return page;
|
2006-03-24 19:16:04 +08:00
|
|
|
}
|
2006-10-29 01:38:23 +08:00
|
|
|
return alloc_pages(gfp, 0);
|
2006-03-24 19:16:04 +08:00
|
|
|
}
|
2006-10-29 01:38:23 +08:00
|
|
|
EXPORT_SYMBOL(__page_cache_alloc);
|
2006-03-24 19:16:04 +08:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* In order to wait for pages to become available there must be
|
|
|
|
* waitqueues associated with pages. By using a hash table of
|
|
|
|
* waitqueues where the bucket discipline is to maintain all
|
|
|
|
* waiters on the same queue and wake all when any of the pages
|
|
|
|
* become available, and for the woken contexts to check to be
|
|
|
|
* sure the appropriate page became available, this saves space
|
|
|
|
* at a cost of "thundering herd" phenomena during rare hash
|
|
|
|
* collisions.
|
|
|
|
*/
|
2016-12-25 11:00:30 +08:00
|
|
|
#define PAGE_WAIT_TABLE_BITS 8
|
|
|
|
#define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
|
|
|
|
static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;
|
|
|
|
|
|
|
|
static wait_queue_head_t *page_waitqueue(struct page *page)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2016-12-25 11:00:30 +08:00
|
|
|
return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)];
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
void __init pagecache_init(void)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2016-12-25 11:00:30 +08:00
|
|
|
int i;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++)
|
|
|
|
init_waitqueue_head(&page_wait_table[i]);
|
|
|
|
|
|
|
|
page_writeback_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
Minor page waitqueue cleanups
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 04:55:12 +08:00
|
|
|
/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
|
2016-12-25 11:00:30 +08:00
|
|
|
struct wait_page_key {
|
|
|
|
struct page *page;
|
|
|
|
int bit_nr;
|
|
|
|
int page_match;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct wait_page_queue {
|
|
|
|
struct page *page;
|
|
|
|
int bit_nr;
|
2017-06-20 18:06:13 +08:00
|
|
|
wait_queue_entry_t wait;
|
2016-12-25 11:00:30 +08:00
|
|
|
};
|
|
|
|
|
2017-06-20 18:06:13 +08:00
|
|
|
static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
|
2011-05-25 08:11:29 +08:00
|
|
|
{
|
2016-12-25 11:00:30 +08:00
|
|
|
struct wait_page_key *key = arg;
|
|
|
|
struct wait_page_queue *wait_page
|
|
|
|
= container_of(wait, struct wait_page_queue, wait);
|
|
|
|
|
|
|
|
if (wait_page->page != key->page)
|
|
|
|
return 0;
|
|
|
|
key->page_match = 1;
|
2011-05-25 08:11:29 +08:00
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
if (wait_page->bit_nr != key->bit_nr)
|
|
|
|
return 0;
|
Minor page waitqueue cleanups
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 04:55:12 +08:00
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
/*
|
|
|
|
* Stop walking if it's locked.
|
|
|
|
* Is this safe if put_and_wait_on_page_locked() is in use?
|
|
|
|
* Yes: the waker must hold a reference to this page, and if PG_locked
|
|
|
|
* has now already been set by another task, that task must also hold
|
|
|
|
* a reference to the *same usage* of this page; so there is no need
|
|
|
|
* to walk on to wake even the put_and_wait_on_page_locked() callers.
|
|
|
|
*/
|
2016-12-25 11:00:30 +08:00
|
|
|
if (test_bit(key->bit_nr, &key->page->flags))
|
Minor page waitqueue cleanups
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 04:55:12 +08:00
|
|
|
return -1;
|
2011-05-25 08:11:29 +08:00
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
return autoremove_wake_function(wait, mode, sync, key);
|
2011-05-25 08:11:29 +08:00
|
|
|
}
|
|
|
|
|
2017-02-23 07:44:41 +08:00
|
|
|
static void wake_up_page_bit(struct page *page, int bit_nr)
|
2014-09-25 11:55:19 +08:00
|
|
|
{
|
2016-12-25 11:00:30 +08:00
|
|
|
wait_queue_head_t *q = page_waitqueue(page);
|
|
|
|
struct wait_page_key key;
|
|
|
|
unsigned long flags;
|
2017-08-26 00:13:55 +08:00
|
|
|
wait_queue_entry_t bookmark;
|
2014-09-25 11:55:19 +08:00
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
key.page = page;
|
|
|
|
key.bit_nr = bit_nr;
|
|
|
|
key.page_match = 0;
|
|
|
|
|
2017-08-26 00:13:55 +08:00
|
|
|
bookmark.flags = 0;
|
|
|
|
bookmark.private = NULL;
|
|
|
|
bookmark.func = NULL;
|
|
|
|
INIT_LIST_HEAD(&bookmark.entry);
|
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
spin_lock_irqsave(&q->lock, flags);
|
2017-08-26 00:13:55 +08:00
|
|
|
__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
|
|
|
|
|
|
|
|
while (bookmark.flags & WQ_FLAG_BOOKMARK) {
|
|
|
|
/*
|
|
|
|
* Take a breather from holding the lock,
|
|
|
|
* allow pages that finish wake up asynchronously
|
|
|
|
* to acquire the lock and remove themselves
|
|
|
|
* from wait queue
|
|
|
|
*/
|
|
|
|
spin_unlock_irqrestore(&q->lock, flags);
|
|
|
|
cpu_relax();
|
|
|
|
spin_lock_irqsave(&q->lock, flags);
|
|
|
|
__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
|
|
|
|
}
|
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
/*
|
|
|
|
* It is possible for other pages to have collided on the waitqueue
|
|
|
|
* hash, so in that case check for a page match. That prevents a long-
|
|
|
|
* term waiter
|
|
|
|
*
|
|
|
|
* It is still possible to miss a case here, when we woke page waiters
|
|
|
|
* and removed them from the waitqueue, but there are still other
|
|
|
|
* page waiters.
|
|
|
|
*/
|
|
|
|
if (!waitqueue_active(q) || !key.page_match) {
|
|
|
|
ClearPageWaiters(page);
|
|
|
|
/*
|
|
|
|
* It's possible to miss clearing Waiters here, when we woke
|
|
|
|
* our page waiters, but the hashed waitqueue has waiters for
|
|
|
|
* other pages on it.
|
|
|
|
*
|
|
|
|
* That's okay, it's a rare case. The next waker will clear it.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(&q->lock, flags);
|
|
|
|
}
|
2017-02-23 07:44:41 +08:00
|
|
|
|
|
|
|
static void wake_up_page(struct page *page, int bit)
|
|
|
|
{
|
|
|
|
if (!PageWaiters(page))
|
|
|
|
return;
|
|
|
|
wake_up_page_bit(page, bit);
|
|
|
|
}
|
2016-12-25 11:00:30 +08:00
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
/*
|
|
|
|
* A choice of three behaviors for wait_on_page_bit_common():
|
|
|
|
*/
|
|
|
|
enum behavior {
|
|
|
|
EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
|
|
|
|
* __lock_page() waiting on then setting PG_locked.
|
|
|
|
*/
|
|
|
|
SHARED, /* Hold ref to page and check the bit when woken, like
|
|
|
|
* wait_on_page_writeback() waiting on PG_writeback.
|
|
|
|
*/
|
|
|
|
DROP, /* Drop ref to page before wait, no check when woken,
|
|
|
|
* like put_and_wait_on_page_locked() on PG_locked.
|
|
|
|
*/
|
|
|
|
};
|
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
static inline int wait_on_page_bit_common(wait_queue_head_t *q,
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
struct page *page, int bit_nr, int state, enum behavior behavior)
|
2016-12-25 11:00:30 +08:00
|
|
|
{
|
|
|
|
struct wait_page_queue wait_page;
|
2017-06-20 18:06:13 +08:00
|
|
|
wait_queue_entry_t *wait = &wait_page.wait;
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
bool bit_is_set;
|
2018-10-27 06:06:08 +08:00
|
|
|
bool thrashing = false;
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
bool delayacct = false;
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
unsigned long pflags;
|
2016-12-25 11:00:30 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
if (bit_nr == PG_locked &&
|
2018-10-27 06:06:08 +08:00
|
|
|
!PageUptodate(page) && PageWorkingset(page)) {
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
if (!PageSwapBacked(page)) {
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
delayacct_thrashing_start();
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
delayacct = true;
|
|
|
|
}
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
psi_memstall_enter(&pflags);
|
2018-10-27 06:06:08 +08:00
|
|
|
thrashing = true;
|
|
|
|
}
|
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
init_wait(wait);
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0;
|
2016-12-25 11:00:30 +08:00
|
|
|
wait->func = wake_page_function;
|
|
|
|
wait_page.page = page;
|
|
|
|
wait_page.bit_nr = bit_nr;
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
spin_lock_irq(&q->lock);
|
|
|
|
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
if (likely(list_empty(&wait->entry))) {
|
Minor page waitqueue cleanups
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 04:55:12 +08:00
|
|
|
__add_wait_queue_entry_tail(q, wait);
|
2016-12-25 11:00:30 +08:00
|
|
|
SetPageWaiters(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
set_current_state(state);
|
|
|
|
|
|
|
|
spin_unlock_irq(&q->lock);
|
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
bit_is_set = test_bit(bit_nr, &page->flags);
|
|
|
|
if (behavior == DROP)
|
|
|
|
put_page(page);
|
|
|
|
|
|
|
|
if (likely(bit_is_set))
|
2016-12-25 11:00:30 +08:00
|
|
|
io_schedule();
|
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
if (behavior == EXCLUSIVE) {
|
2016-12-25 11:00:30 +08:00
|
|
|
if (!test_and_set_bit_lock(bit_nr, &page->flags))
|
|
|
|
break;
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
} else if (behavior == SHARED) {
|
2016-12-25 11:00:30 +08:00
|
|
|
if (!test_bit(bit_nr, &page->flags))
|
|
|
|
break;
|
|
|
|
}
|
2017-08-28 07:25:09 +08:00
|
|
|
|
2019-01-04 07:28:55 +08:00
|
|
|
if (signal_pending_state(state, current)) {
|
2017-08-28 07:25:09 +08:00
|
|
|
ret = -EINTR;
|
|
|
|
break;
|
|
|
|
}
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
|
|
|
|
if (behavior == DROP) {
|
|
|
|
/*
|
|
|
|
* We can no longer safely access page->flags:
|
|
|
|
* even if CONFIG_MEMORY_HOTREMOVE is not enabled,
|
|
|
|
* there is a risk of waiting forever on a page reused
|
|
|
|
* for something that keeps it locked indefinitely.
|
|
|
|
* But best check for -EINTR above before breaking.
|
|
|
|
*/
|
|
|
|
break;
|
|
|
|
}
|
2016-12-25 11:00:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
finish_wait(q, wait);
|
|
|
|
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
if (thrashing) {
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
if (delayacct)
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
delayacct_thrashing_end();
|
|
|
|
psi_memstall_leave(&pflags);
|
|
|
|
}
|
2018-10-27 06:06:08 +08:00
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
/*
|
|
|
|
* A signal could leave PageWaiters set. Clearing it here if
|
|
|
|
* !waitqueue_active would be possible (by open-coding finish_wait),
|
|
|
|
* but still fail to catch it in the case of wait hash collision. We
|
|
|
|
* already can fail to clear wait hash collision cases, so don't
|
|
|
|
* bother with signals either.
|
|
|
|
*/
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
void wait_on_page_bit(struct page *page, int bit_nr)
|
|
|
|
{
|
|
|
|
wait_queue_head_t *q = page_waitqueue(page);
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
|
2016-12-25 11:00:30 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(wait_on_page_bit);
|
|
|
|
|
|
|
|
int wait_on_page_bit_killable(struct page *page, int bit_nr)
|
|
|
|
{
|
|
|
|
wait_queue_head_t *q = page_waitqueue(page);
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED);
|
2014-09-25 11:55:19 +08:00
|
|
|
}
|
2017-11-02 23:27:52 +08:00
|
|
|
EXPORT_SYMBOL(wait_on_page_bit_killable);
|
2014-09-25 11:55:19 +08:00
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
/**
|
|
|
|
* put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
|
|
|
|
* @page: The page to wait for.
|
|
|
|
*
|
|
|
|
* The caller should hold a reference on @page. They expect the page to
|
|
|
|
* become unlocked relatively soon, but do not wish to hold up migration
|
|
|
|
* (for example) by holding the reference while waiting for the page to
|
|
|
|
* come unlocked. After this function returns, the caller should not
|
|
|
|
* dereference @page.
|
|
|
|
*/
|
|
|
|
void put_and_wait_on_page_locked(struct page *page)
|
|
|
|
{
|
|
|
|
wait_queue_head_t *q;
|
|
|
|
|
|
|
|
page = compound_head(page);
|
|
|
|
q = page_waitqueue(page);
|
|
|
|
wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, DROP);
|
|
|
|
}
|
|
|
|
|
2009-04-03 23:42:39 +08:00
|
|
|
/**
|
|
|
|
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
|
2009-04-14 05:39:54 +08:00
|
|
|
* @page: Page defining the wait queue of interest
|
|
|
|
* @waiter: Waiter to add to the queue
|
2009-04-03 23:42:39 +08:00
|
|
|
*
|
|
|
|
* Add an arbitrary @waiter to the wait queue for the nominated @page.
|
|
|
|
*/
|
2017-06-20 18:06:13 +08:00
|
|
|
void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter)
|
2009-04-03 23:42:39 +08:00
|
|
|
{
|
|
|
|
wait_queue_head_t *q = page_waitqueue(page);
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&q->lock, flags);
|
2017-08-29 07:45:40 +08:00
|
|
|
__add_wait_queue_entry_tail(q, waiter);
|
2016-12-25 11:00:30 +08:00
|
|
|
SetPageWaiters(page);
|
2009-04-03 23:42:39 +08:00
|
|
|
spin_unlock_irqrestore(&q->lock, flags);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(add_page_wait_queue);
|
|
|
|
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
#ifndef clear_bit_unlock_is_negative_byte
|
|
|
|
|
|
|
|
/*
|
|
|
|
* PG_waiters is the high bit in the same byte as PG_lock.
|
|
|
|
*
|
|
|
|
* On x86 (and on many other architectures), we can clear PG_lock and
|
|
|
|
* test the sign bit at the same time. But if the architecture does
|
|
|
|
* not support that special operation, we just do this all by hand
|
|
|
|
* instead.
|
|
|
|
*
|
|
|
|
* The read of PG_waiters has to be after (or concurrently with) PG_locked
|
2020-06-05 07:49:22 +08:00
|
|
|
* being cleared, but a memory barrier should be unnecessary since it is
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
* in the same byte as PG_locked.
|
|
|
|
*/
|
|
|
|
static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem)
|
|
|
|
{
|
|
|
|
clear_bit_unlock(nr, mem);
|
|
|
|
/* smp_mb__after_atomic(); */
|
2016-12-30 06:16:07 +08:00
|
|
|
return test_bit(PG_waiters, mem);
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
2006-06-23 17:03:49 +08:00
|
|
|
* unlock_page - unlock a locked page
|
2005-04-17 06:20:36 +08:00
|
|
|
* @page: the page
|
|
|
|
*
|
|
|
|
* Unlocks the page and wakes up sleepers in ___wait_on_page_locked().
|
|
|
|
* Also wakes sleepers in wait_on_page_writeback() because the wakeup
|
2014-09-09 00:27:23 +08:00
|
|
|
* mechanism between PageLocked pages and PageWriteback pages is shared.
|
2005-04-17 06:20:36 +08:00
|
|
|
* But that's OK - sleepers in wait_on_page_writeback() just go back to sleep.
|
|
|
|
*
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
* Note that this depends on PG_waiters being the sign bit in the byte
|
|
|
|
* that contains PG_locked - thus the BUILD_BUG_ON(). That allows us to
|
|
|
|
* clear the PG_locked bit and test PG_waiters at the same time fairly
|
|
|
|
* portably (architectures that do LL/SC can test any bit, while x86 can
|
|
|
|
* test the sign bit).
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2008-02-05 14:29:26 +08:00
|
|
|
void unlock_page(struct page *page)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
BUILD_BUG_ON(PG_waiters != 7);
|
2016-01-16 08:51:24 +08:00
|
|
|
page = compound_head(page);
|
2014-01-24 07:52:54 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
|
|
|
|
wake_up_page_bit(page, PG_locked);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(unlock_page);
|
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
|
|
|
* end_page_writeback - end writeback against a page
|
|
|
|
* @page: the page
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
void end_page_writeback(struct page *page)
|
|
|
|
{
|
2014-06-05 07:10:34 +08:00
|
|
|
/*
|
|
|
|
* TestClearPageReclaim could be used here but it is an atomic
|
|
|
|
* operation and overkill in this particular case. Failing to
|
|
|
|
* shuffle a page marked for immediate reclaim is too mild to
|
|
|
|
* justify taking an atomic operation penalty at the end of
|
|
|
|
* ever page writeback.
|
|
|
|
*/
|
|
|
|
if (PageReclaim(page)) {
|
|
|
|
ClearPageReclaim(page);
|
2008-04-28 17:12:38 +08:00
|
|
|
rotate_reclaimable_page(page);
|
2014-06-05 07:10:34 +08:00
|
|
|
}
|
2008-04-28 17:12:38 +08:00
|
|
|
|
|
|
|
if (!test_clear_page_writeback(page))
|
|
|
|
BUG();
|
|
|
|
|
2014-03-18 01:06:10 +08:00
|
|
|
smp_mb__after_atomic();
|
2005-04-17 06:20:36 +08:00
|
|
|
wake_up_page(page, PG_writeback);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(end_page_writeback);
|
|
|
|
|
2014-06-05 07:07:45 +08:00
|
|
|
/*
|
|
|
|
* After completing I/O on a page, call this routine to update the page
|
|
|
|
* flags appropriately
|
|
|
|
*/
|
2016-08-05 22:11:04 +08:00
|
|
|
void page_endio(struct page *page, bool is_write, int err)
|
2014-06-05 07:07:45 +08:00
|
|
|
{
|
2016-08-05 22:11:04 +08:00
|
|
|
if (!is_write) {
|
2014-06-05 07:07:45 +08:00
|
|
|
if (!err) {
|
|
|
|
SetPageUptodate(page);
|
|
|
|
} else {
|
|
|
|
ClearPageUptodate(page);
|
|
|
|
SetPageError(page);
|
|
|
|
}
|
|
|
|
unlock_page(page);
|
2016-08-05 04:23:34 +08:00
|
|
|
} else {
|
2014-06-05 07:07:45 +08:00
|
|
|
if (err) {
|
2017-02-25 06:59:59 +08:00
|
|
|
struct address_space *mapping;
|
|
|
|
|
2014-06-05 07:07:45 +08:00
|
|
|
SetPageError(page);
|
2017-02-25 06:59:59 +08:00
|
|
|
mapping = page_mapping(page);
|
|
|
|
if (mapping)
|
|
|
|
mapping_set_error(mapping, err);
|
2014-06-05 07:07:45 +08:00
|
|
|
}
|
|
|
|
end_page_writeback(page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(page_endio);
|
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
|
|
|
* __lock_page - get a lock on the page, assuming we need to sleep to get it
|
2017-02-23 07:44:44 +08:00
|
|
|
* @__page: the page to lock
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2016-12-25 11:00:30 +08:00
|
|
|
void __lock_page(struct page *__page)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2016-12-25 11:00:30 +08:00
|
|
|
struct page *page = compound_head(__page);
|
|
|
|
wait_queue_head_t *q = page_waitqueue(page);
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE,
|
|
|
|
EXCLUSIVE);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__lock_page);
|
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
int __lock_page_killable(struct page *__page)
|
2007-12-07 00:18:49 +08:00
|
|
|
{
|
2016-12-25 11:00:30 +08:00
|
|
|
struct page *page = compound_head(__page);
|
|
|
|
wait_queue_head_t *q = page_waitqueue(page);
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:36:14 +08:00
|
|
|
return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE,
|
|
|
|
EXCLUSIVE);
|
2007-12-07 00:18:49 +08:00
|
|
|
}
|
2009-02-09 22:02:42 +08:00
|
|
|
EXPORT_SYMBOL_GPL(__lock_page_killable);
|
2007-12-07 00:18:49 +08:00
|
|
|
|
2014-08-07 07:07:24 +08:00
|
|
|
/*
|
|
|
|
* Return values:
|
|
|
|
* 1 - page is locked; mmap_sem is still held.
|
|
|
|
* 0 - page is not locked.
|
|
|
|
* mmap_sem has been released (up_read()), unless flags had both
|
|
|
|
* FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
|
|
|
|
* which case mmap_sem is still held.
|
|
|
|
*
|
|
|
|
* If neither ALLOW_RETRY nor KILLABLE are set, will always return 1
|
|
|
|
* with the page locked and the mmap_sem unperturbed.
|
|
|
|
*/
|
2010-10-27 05:21:57 +08:00
|
|
|
int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
|
|
|
|
unsigned int flags)
|
|
|
|
{
|
2020-04-02 12:08:45 +08:00
|
|
|
if (fault_flag_allow_retry_first(flags)) {
|
2011-05-25 08:11:30 +08:00
|
|
|
/*
|
|
|
|
* CAUTION! In this case, mmap_sem is not released
|
|
|
|
* even though return 0.
|
|
|
|
*/
|
|
|
|
if (flags & FAULT_FLAG_RETRY_NOWAIT)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
if (flags & FAULT_FLAG_KILLABLE)
|
|
|
|
wait_on_page_locked_killable(page);
|
|
|
|
else
|
2011-03-23 07:30:51 +08:00
|
|
|
wait_on_page_locked(page);
|
2010-10-27 05:21:57 +08:00
|
|
|
return 0;
|
2011-05-25 08:11:30 +08:00
|
|
|
} else {
|
|
|
|
if (flags & FAULT_FLAG_KILLABLE) {
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = __lock_page_killable(page);
|
|
|
|
if (ret) {
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
__lock_page(page);
|
|
|
|
return 1;
|
2010-10-27 05:21:57 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-04-04 05:47:44 +08:00
|
|
|
/**
|
2017-11-22 03:07:06 +08:00
|
|
|
* page_cache_next_miss() - Find the next gap in the page cache.
|
|
|
|
* @mapping: Mapping.
|
|
|
|
* @index: Index.
|
|
|
|
* @max_scan: Maximum range to search.
|
2014-04-04 05:47:44 +08:00
|
|
|
*
|
2017-11-22 03:07:06 +08:00
|
|
|
* Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the
|
|
|
|
* gap with the lowest index.
|
2014-04-04 05:47:44 +08:00
|
|
|
*
|
2017-11-22 03:07:06 +08:00
|
|
|
* This function may be called under the rcu_read_lock. However, this will
|
|
|
|
* not atomically search a snapshot of the cache at a single point in time.
|
|
|
|
* For example, if a gap is created at index 5, then subsequently a gap is
|
|
|
|
* created at index 10, page_cache_next_miss covering both indices may
|
|
|
|
* return 10 if called under the rcu_read_lock.
|
2014-04-04 05:47:44 +08:00
|
|
|
*
|
2017-11-22 03:07:06 +08:00
|
|
|
* Return: The index of the gap if found, otherwise an index outside the
|
|
|
|
* range specified (in which case 'return - index >= max_scan' will be true).
|
|
|
|
* In the rare case of index wrap-around, 0 will be returned.
|
2014-04-04 05:47:44 +08:00
|
|
|
*/
|
2017-11-22 03:07:06 +08:00
|
|
|
pgoff_t page_cache_next_miss(struct address_space *mapping,
|
2014-04-04 05:47:44 +08:00
|
|
|
pgoff_t index, unsigned long max_scan)
|
|
|
|
{
|
2017-11-22 03:07:06 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, index);
|
2014-04-04 05:47:44 +08:00
|
|
|
|
2017-11-22 03:07:06 +08:00
|
|
|
while (max_scan--) {
|
|
|
|
void *entry = xas_next(&xas);
|
|
|
|
if (!entry || xa_is_value(entry))
|
2014-04-04 05:47:44 +08:00
|
|
|
break;
|
2017-11-22 03:07:06 +08:00
|
|
|
if (xas.xa_index == 0)
|
2014-04-04 05:47:44 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2017-11-22 03:07:06 +08:00
|
|
|
return xas.xa_index;
|
2014-04-04 05:47:44 +08:00
|
|
|
}
|
2017-11-22 03:07:06 +08:00
|
|
|
EXPORT_SYMBOL(page_cache_next_miss);
|
2014-04-04 05:47:44 +08:00
|
|
|
|
|
|
|
/**
|
2019-05-14 08:21:29 +08:00
|
|
|
* page_cache_prev_miss() - Find the previous gap in the page cache.
|
2017-11-22 03:07:06 +08:00
|
|
|
* @mapping: Mapping.
|
|
|
|
* @index: Index.
|
|
|
|
* @max_scan: Maximum range to search.
|
2014-04-04 05:47:44 +08:00
|
|
|
*
|
2017-11-22 03:07:06 +08:00
|
|
|
* Search the range [max(index - max_scan + 1, 0), index] for the
|
|
|
|
* gap with the highest index.
|
2014-04-04 05:47:44 +08:00
|
|
|
*
|
2017-11-22 03:07:06 +08:00
|
|
|
* This function may be called under the rcu_read_lock. However, this will
|
|
|
|
* not atomically search a snapshot of the cache at a single point in time.
|
|
|
|
* For example, if a gap is created at index 10, then subsequently a gap is
|
|
|
|
* created at index 5, page_cache_prev_miss() covering both indices may
|
|
|
|
* return 5 if called under the rcu_read_lock.
|
2014-04-04 05:47:44 +08:00
|
|
|
*
|
2017-11-22 03:07:06 +08:00
|
|
|
* Return: The index of the gap if found, otherwise an index outside the
|
|
|
|
* range specified (in which case 'index - return >= max_scan' will be true).
|
|
|
|
* In the rare case of wrap-around, ULONG_MAX will be returned.
|
2014-04-04 05:47:44 +08:00
|
|
|
*/
|
2017-11-22 03:07:06 +08:00
|
|
|
pgoff_t page_cache_prev_miss(struct address_space *mapping,
|
2014-04-04 05:47:44 +08:00
|
|
|
pgoff_t index, unsigned long max_scan)
|
|
|
|
{
|
2017-11-22 03:07:06 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, index);
|
2014-04-04 05:47:44 +08:00
|
|
|
|
2017-11-22 03:07:06 +08:00
|
|
|
while (max_scan--) {
|
|
|
|
void *entry = xas_prev(&xas);
|
|
|
|
if (!entry || xa_is_value(entry))
|
2014-04-04 05:47:44 +08:00
|
|
|
break;
|
2017-11-22 03:07:06 +08:00
|
|
|
if (xas.xa_index == ULONG_MAX)
|
2014-04-04 05:47:44 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2017-11-22 03:07:06 +08:00
|
|
|
return xas.xa_index;
|
2014-04-04 05:47:44 +08:00
|
|
|
}
|
2017-11-22 03:07:06 +08:00
|
|
|
EXPORT_SYMBOL(page_cache_prev_miss);
|
2014-04-04 05:47:44 +08:00
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
2014-04-04 05:47:46 +08:00
|
|
|
* find_get_entry - find and get a page cache entry
|
2006-06-23 17:03:49 +08:00
|
|
|
* @mapping: the address_space to search
|
2014-04-04 05:47:46 +08:00
|
|
|
* @offset: the page cache index
|
|
|
|
*
|
|
|
|
* Looks up the page cache slot at @mapping & @offset. If there is a
|
|
|
|
* page cache page, it is returned with an increased refcount.
|
2006-06-23 17:03:49 +08:00
|
|
|
*
|
2014-05-07 03:50:05 +08:00
|
|
|
* If the slot holds a shadow entry of a previously evicted page, or a
|
|
|
|
* swap entry from shmem/tmpfs, it is returned.
|
2014-04-04 05:47:46 +08:00
|
|
|
*
|
2019-03-06 07:48:42 +08:00
|
|
|
* Return: the found page or shadow entry, %NULL if nothing is found.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2014-04-04 05:47:46 +08:00
|
|
|
struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2018-05-17 04:12:50 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, offset);
|
2019-09-24 06:34:52 +08:00
|
|
|
struct page *page;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-07-26 10:45:31 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
repeat:
|
2018-05-17 04:12:50 +08:00
|
|
|
xas_reset(&xas);
|
|
|
|
page = xas_load(&xas);
|
|
|
|
if (xas_retry(&xas, page))
|
|
|
|
goto repeat;
|
|
|
|
/*
|
|
|
|
* A shadow entry of a recently evicted page, or a swap entry from
|
|
|
|
* shmem/tmpfs. Return it without attempting to raise page count.
|
|
|
|
*/
|
|
|
|
if (!page || xa_is_value(page))
|
|
|
|
goto out;
|
2016-07-27 06:26:04 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
if (!page_cache_get_speculative(page))
|
2018-05-17 04:12:50 +08:00
|
|
|
goto repeat;
|
2016-07-27 06:26:04 +08:00
|
|
|
|
2018-05-17 04:12:50 +08:00
|
|
|
/*
|
2019-09-24 06:34:52 +08:00
|
|
|
* Has the page moved or been split?
|
2018-05-17 04:12:50 +08:00
|
|
|
* This is part of the lockless pagecache protocol. See
|
|
|
|
* include/linux/pagemap.h for details.
|
|
|
|
*/
|
|
|
|
if (unlikely(page != xas_reload(&xas))) {
|
2019-09-24 06:34:52 +08:00
|
|
|
put_page(page);
|
2018-05-17 04:12:50 +08:00
|
|
|
goto repeat;
|
2008-07-26 10:45:31 +08:00
|
|
|
}
|
2019-09-24 06:34:52 +08:00
|
|
|
page = find_subpage(page, offset);
|
2010-11-12 06:05:19 +08:00
|
|
|
out:
|
2008-07-26 10:45:31 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
return page;
|
|
|
|
}
|
|
|
|
|
2014-04-04 05:47:46 +08:00
|
|
|
/**
|
|
|
|
* find_lock_entry - locate, pin and lock a page cache entry
|
|
|
|
* @mapping: the address_space to search
|
|
|
|
* @offset: the page cache index
|
|
|
|
*
|
|
|
|
* Looks up the page cache slot at @mapping & @offset. If there is a
|
|
|
|
* page cache page, it is returned locked and with an increased
|
|
|
|
* refcount.
|
|
|
|
*
|
2014-05-07 03:50:05 +08:00
|
|
|
* If the slot holds a shadow entry of a previously evicted page, or a
|
|
|
|
* swap entry from shmem/tmpfs, it is returned.
|
2014-04-04 05:47:46 +08:00
|
|
|
*
|
|
|
|
* find_lock_entry() may sleep.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: the found page or shadow entry, %NULL if nothing is found.
|
2014-04-04 05:47:46 +08:00
|
|
|
*/
|
|
|
|
struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
repeat:
|
2014-04-04 05:47:46 +08:00
|
|
|
page = find_get_entry(mapping, offset);
|
2018-05-17 04:12:50 +08:00
|
|
|
if (page && !xa_is_value(page)) {
|
2008-07-26 10:45:31 +08:00
|
|
|
lock_page(page);
|
|
|
|
/* Has the page been truncated? */
|
2016-07-27 06:26:04 +08:00
|
|
|
if (unlikely(page_mapping(page) != mapping)) {
|
2008-07-26 10:45:31 +08:00
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2008-07-26 10:45:31 +08:00
|
|
|
goto repeat;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2016-07-27 06:26:04 +08:00
|
|
|
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
return page;
|
|
|
|
}
|
2014-04-04 05:47:46 +08:00
|
|
|
EXPORT_SYMBOL(find_lock_entry);
|
|
|
|
|
|
|
|
/**
|
2020-04-02 12:05:07 +08:00
|
|
|
* pagecache_get_page - Find and get a reference to a page.
|
|
|
|
* @mapping: The address_space to search.
|
|
|
|
* @index: The page index.
|
|
|
|
* @fgp_flags: %FGP flags modify how the page is returned.
|
|
|
|
* @gfp_mask: Memory allocation flags to use if %FGP_CREAT is specified.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2020-04-02 12:05:07 +08:00
|
|
|
* Looks up the page cache entry at @mapping & @index.
|
2014-04-04 05:47:46 +08:00
|
|
|
*
|
2020-04-02 12:05:07 +08:00
|
|
|
* @fgp_flags can be zero or more of these flags:
|
2017-03-31 04:11:36 +08:00
|
|
|
*
|
2020-04-02 12:05:07 +08:00
|
|
|
* * %FGP_ACCESSED - The page will be marked accessed.
|
|
|
|
* * %FGP_LOCK - The page is returned locked.
|
|
|
|
* * %FGP_CREAT - If no page is present then a new page is allocated using
|
|
|
|
* @gfp_mask and added to the page cache and the VM's LRU list.
|
|
|
|
* The page is returned locked and with an increased refcount.
|
|
|
|
* * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
|
|
|
|
* page is already in cache. If the page was allocated, unlock it before
|
|
|
|
* returning so the caller can do the same dance.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2020-04-02 12:05:07 +08:00
|
|
|
* If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
|
|
|
|
* if the %GFP flags specified for %FGP_CREAT are atomic.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2014-06-05 07:10:31 +08:00
|
|
|
* If there is a page cache page, it is returned with an increased refcount.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
2020-04-02 12:05:07 +08:00
|
|
|
* Return: The found page or %NULL otherwise.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2020-04-02 12:05:07 +08:00
|
|
|
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
|
|
|
|
int fgp_flags, gfp_t gfp_mask)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-10-16 16:24:57 +08:00
|
|
|
struct page *page;
|
2014-06-05 07:10:31 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
repeat:
|
2020-04-02 12:05:07 +08:00
|
|
|
page = find_get_entry(mapping, index);
|
2017-11-04 01:30:42 +08:00
|
|
|
if (xa_is_value(page))
|
2014-06-05 07:10:31 +08:00
|
|
|
page = NULL;
|
|
|
|
if (!page)
|
|
|
|
goto no_page;
|
|
|
|
|
|
|
|
if (fgp_flags & FGP_LOCK) {
|
|
|
|
if (fgp_flags & FGP_NOWAIT) {
|
|
|
|
if (!trylock_page(page)) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2014-06-05 07:10:31 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
lock_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Has the page been truncated? */
|
2019-09-24 06:37:47 +08:00
|
|
|
if (unlikely(compound_head(page)->mapping != mapping)) {
|
2014-06-05 07:10:31 +08:00
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2014-06-05 07:10:31 +08:00
|
|
|
goto repeat;
|
|
|
|
}
|
2020-04-02 12:05:07 +08:00
|
|
|
VM_BUG_ON_PAGE(page->index != index, page);
|
2014-06-05 07:10:31 +08:00
|
|
|
}
|
|
|
|
|
2018-12-28 16:37:35 +08:00
|
|
|
if (fgp_flags & FGP_ACCESSED)
|
2014-06-05 07:10:31 +08:00
|
|
|
mark_page_accessed(page);
|
|
|
|
|
|
|
|
no_page:
|
|
|
|
if (!page && (fgp_flags & FGP_CREAT)) {
|
|
|
|
int err;
|
|
|
|
if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
|
2014-12-30 03:30:35 +08:00
|
|
|
gfp_mask |= __GFP_WRITE;
|
|
|
|
if (fgp_flags & FGP_NOFS)
|
|
|
|
gfp_mask &= ~__GFP_FS;
|
2014-06-05 07:10:31 +08:00
|
|
|
|
2014-12-30 03:30:35 +08:00
|
|
|
page = __page_cache_alloc(gfp_mask);
|
2007-10-16 16:24:57 +08:00
|
|
|
if (!page)
|
|
|
|
return NULL;
|
2014-06-05 07:10:31 +08:00
|
|
|
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:14 +08:00
|
|
|
if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
|
2014-06-05 07:10:31 +08:00
|
|
|
fgp_flags |= FGP_LOCK;
|
|
|
|
|
2014-08-07 07:06:43 +08:00
|
|
|
/* Init accessed so avoid atomic mark_page_accessed later */
|
2014-06-05 07:10:31 +08:00
|
|
|
if (fgp_flags & FGP_ACCESSED)
|
2014-08-07 07:06:43 +08:00
|
|
|
__SetPageReferenced(page);
|
2014-06-05 07:10:31 +08:00
|
|
|
|
2020-04-02 12:05:07 +08:00
|
|
|
err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
|
2007-10-16 16:24:57 +08:00
|
|
|
if (unlikely(err)) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2007-10-16 16:24:57 +08:00
|
|
|
page = NULL;
|
|
|
|
if (err == -EEXIST)
|
|
|
|
goto repeat;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:14 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* add_to_page_cache_lru locks the page, and for mmap we expect
|
|
|
|
* an unlocked page.
|
|
|
|
*/
|
|
|
|
if (page && (fgp_flags & FGP_FOR_MMAP))
|
|
|
|
unlock_page(page);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2014-06-05 07:10:31 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
return page;
|
|
|
|
}
|
2014-06-05 07:10:31 +08:00
|
|
|
EXPORT_SYMBOL(pagecache_get_page);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2014-04-04 05:47:46 +08:00
|
|
|
/**
|
|
|
|
* find_get_entries - gang pagecache lookup
|
|
|
|
* @mapping: The address_space to search
|
|
|
|
* @start: The starting page cache index
|
|
|
|
* @nr_entries: The maximum number of entries
|
|
|
|
* @entries: Where the resulting entries are placed
|
|
|
|
* @indices: The cache indices corresponding to the entries in @entries
|
|
|
|
*
|
|
|
|
* find_get_entries() will search for and return a group of up to
|
|
|
|
* @nr_entries entries in the mapping. The entries are placed at
|
|
|
|
* @entries. find_get_entries() takes a reference against any actual
|
|
|
|
* pages it returns.
|
|
|
|
*
|
|
|
|
* The search returns a group of mapping-contiguous page cache entries
|
|
|
|
* with ascending indexes. There may be holes in the indices due to
|
|
|
|
* not-present pages.
|
|
|
|
*
|
2014-05-07 03:50:05 +08:00
|
|
|
* Any shadow entries of evicted pages, or swap entries from
|
|
|
|
* shmem/tmpfs, are included in the returned array.
|
2014-04-04 05:47:46 +08:00
|
|
|
*
|
mm: huge tmpfs: try to split_huge_page() when punching hole
Yang Shi writes:
Currently, when truncating a shmem file, if the range is partly in a THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed, unless the range covers the whole THP.
Even though all the subpages are truncated (randomly or sequentially), the
THP may still be kept in page cache.
This might be fine for some usecases which prefer preserving THP, but
balloon inflation is handled in base page size. So when using shmem THP
as memory backend, QEMU inflation actually doesn't work as expected since
it doesn't free memory. But the inflation usecase really needs to get the
memory freed. (Anonymous THP will also not get freed right away, but will
be freed eventually when all subpages are unmapped: whereas shmem THP
still stays in page cache.)
Split THP right away when doing partial hole punch, and if split fails
just clear the page so that read of the punched area will return zeroes.
Hugh Dickins adds:
Our earlier "team of pages" huge tmpfs implementation worked in the way
that Yang Shi proposes; and we have been using this patch to continue to
split the huge page when hole-punched or truncated, since converting over
to the compound page implementation. Although huge tmpfs gives out huge
pages when available, if the user specifically asks to truncate or punch a
hole (perhaps to free memory, perhaps to reduce the memcg charge), then
the filesystem should do so as best it can, splitting the huge page.
That is not always possible: any additional reference to the huge page
prevents split_huge_page() from succeeding, so the result can be flaky.
But in practice it works successfully enough that we've not seen any
problem from that.
Add shmem_punch_compound() to encapsulate the decision of when a split is
needed, and doing the split if so. Using this simplifies the flow in
shmem_undo_range(); and the first (trylock) pass does not need to do any
page clearing on failure, because the second pass will either succeed or
do that clearing. Following the example of zero_user_segment() when
clearing a partial page, add flush_dcache_page() and set_page_dirty() when
clearing a hole - though I'm not certain that either is needed.
But: split_huge_page() would be sure to fail if shmem_undo_range()'s
pagevec holds further references to the huge page. The easiest way to fix
that is for find_get_entries() to return early, as soon as it has put one
compound head or tail into the pagevec. At first this felt like a hack;
but on examination, this convention better suits all its callers - or will
do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
speedup by checking for compound pages there.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 11:07:57 +08:00
|
|
|
* If it finds a Transparent Huge Page, head or tail, find_get_entries()
|
|
|
|
* stops at that page: the caller is likely to have a better way to handle
|
|
|
|
* the compound page as a whole, and then skip its extent, than repeatedly
|
|
|
|
* calling find_get_entries() to return all its tails.
|
|
|
|
*
|
2019-03-06 07:48:42 +08:00
|
|
|
* Return: the number of pages and shadow entries which were found.
|
2014-04-04 05:47:46 +08:00
|
|
|
*/
|
|
|
|
unsigned find_get_entries(struct address_space *mapping,
|
|
|
|
pgoff_t start, unsigned int nr_entries,
|
|
|
|
struct page **entries, pgoff_t *indices)
|
|
|
|
{
|
2018-05-17 05:20:45 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, start);
|
|
|
|
struct page *page;
|
2014-04-04 05:47:46 +08:00
|
|
|
unsigned int ret = 0;
|
|
|
|
|
|
|
|
if (!nr_entries)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2018-05-17 05:20:45 +08:00
|
|
|
xas_for_each(&xas, page, ULONG_MAX) {
|
|
|
|
if (xas_retry(&xas, page))
|
2014-04-04 05:47:46 +08:00
|
|
|
continue;
|
2018-05-17 05:20:45 +08:00
|
|
|
/*
|
|
|
|
* A shadow entry of a recently evicted page, a swap
|
|
|
|
* entry from shmem/tmpfs or a DAX entry. Return it
|
|
|
|
* without attempting to raise page count.
|
|
|
|
*/
|
|
|
|
if (xa_is_value(page))
|
2014-04-04 05:47:46 +08:00
|
|
|
goto export;
|
2016-07-27 06:26:04 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
if (!page_cache_get_speculative(page))
|
2018-05-17 05:20:45 +08:00
|
|
|
goto retry;
|
2016-07-27 06:26:04 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
/* Has the page moved or been split? */
|
2018-05-17 05:20:45 +08:00
|
|
|
if (unlikely(page != xas_reload(&xas)))
|
|
|
|
goto put_page;
|
|
|
|
|
mm: huge tmpfs: try to split_huge_page() when punching hole
Yang Shi writes:
Currently, when truncating a shmem file, if the range is partly in a THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed, unless the range covers the whole THP.
Even though all the subpages are truncated (randomly or sequentially), the
THP may still be kept in page cache.
This might be fine for some usecases which prefer preserving THP, but
balloon inflation is handled in base page size. So when using shmem THP
as memory backend, QEMU inflation actually doesn't work as expected since
it doesn't free memory. But the inflation usecase really needs to get the
memory freed. (Anonymous THP will also not get freed right away, but will
be freed eventually when all subpages are unmapped: whereas shmem THP
still stays in page cache.)
Split THP right away when doing partial hole punch, and if split fails
just clear the page so that read of the punched area will return zeroes.
Hugh Dickins adds:
Our earlier "team of pages" huge tmpfs implementation worked in the way
that Yang Shi proposes; and we have been using this patch to continue to
split the huge page when hole-punched or truncated, since converting over
to the compound page implementation. Although huge tmpfs gives out huge
pages when available, if the user specifically asks to truncate or punch a
hole (perhaps to free memory, perhaps to reduce the memcg charge), then
the filesystem should do so as best it can, splitting the huge page.
That is not always possible: any additional reference to the huge page
prevents split_huge_page() from succeeding, so the result can be flaky.
But in practice it works successfully enough that we've not seen any
problem from that.
Add shmem_punch_compound() to encapsulate the decision of when a split is
needed, and doing the split if so. Using this simplifies the flow in
shmem_undo_range(); and the first (trylock) pass does not need to do any
page clearing on failure, because the second pass will either succeed or
do that clearing. Following the example of zero_user_segment() when
clearing a partial page, add flush_dcache_page() and set_page_dirty() when
clearing a hole - though I'm not certain that either is needed.
But: split_huge_page() would be sure to fail if shmem_undo_range()'s
pagevec holds further references to the huge page. The easiest way to fix
that is for find_get_entries() to return early, as soon as it has put one
compound head or tail into the pagevec. At first this felt like a hack;
but on examination, this convention better suits all its callers - or will
do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
speedup by checking for compound pages there.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 11:07:57 +08:00
|
|
|
/*
|
|
|
|
* Terminate early on finding a THP, to allow the caller to
|
|
|
|
* handle it all at once; but continue if this is hugetlbfs.
|
|
|
|
*/
|
|
|
|
if (PageTransHuge(page) && !PageHuge(page)) {
|
|
|
|
page = find_subpage(page, xas.xa_index);
|
|
|
|
nr_entries = ret + 1;
|
|
|
|
}
|
2014-04-04 05:47:46 +08:00
|
|
|
export:
|
2018-05-17 05:20:45 +08:00
|
|
|
indices[ret] = xas.xa_index;
|
2014-04-04 05:47:46 +08:00
|
|
|
entries[ret] = page;
|
|
|
|
if (++ret == nr_entries)
|
|
|
|
break;
|
2018-05-17 05:20:45 +08:00
|
|
|
continue;
|
|
|
|
put_page:
|
2019-09-24 06:34:52 +08:00
|
|
|
put_page(page);
|
2018-05-17 05:20:45 +08:00
|
|
|
retry:
|
|
|
|
xas_reset(&xas);
|
2014-04-04 05:47:46 +08:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
2017-09-07 07:21:21 +08:00
|
|
|
* find_get_pages_range - gang pagecache lookup
|
2005-04-17 06:20:36 +08:00
|
|
|
* @mapping: The address_space to search
|
|
|
|
* @start: The starting page index
|
2017-09-07 07:21:21 +08:00
|
|
|
* @end: The final page index (inclusive)
|
2005-04-17 06:20:36 +08:00
|
|
|
* @nr_pages: The maximum number of pages
|
|
|
|
* @pages: Where the resulting pages are placed
|
|
|
|
*
|
2017-09-07 07:21:21 +08:00
|
|
|
* find_get_pages_range() will search for and return a group of up to @nr_pages
|
|
|
|
* pages in the mapping starting at index @start and up to index @end
|
|
|
|
* (inclusive). The pages are placed at @pages. find_get_pages_range() takes
|
|
|
|
* a reference against the returned pages.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* The search returns a group of mapping-contiguous pages with ascending
|
|
|
|
* indexes. There may be holes in the indices due to not-present pages.
|
2017-09-07 07:21:18 +08:00
|
|
|
* We also update @start to index the next page for the traversal.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2019-03-06 07:48:42 +08:00
|
|
|
* Return: the number of pages which were found. If this number is
|
|
|
|
* smaller than @nr_pages, the end of specified range has been
|
2017-09-07 07:21:21 +08:00
|
|
|
* reached.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-09-07 07:21:21 +08:00
|
|
|
unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
|
|
|
|
pgoff_t end, unsigned int nr_pages,
|
|
|
|
struct page **pages)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2018-05-17 05:38:56 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, *start);
|
|
|
|
struct page *page;
|
2012-03-29 05:42:54 +08:00
|
|
|
unsigned ret = 0;
|
|
|
|
|
|
|
|
if (unlikely(!nr_pages))
|
|
|
|
return 0;
|
2008-07-26 10:45:31 +08:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2018-05-17 05:38:56 +08:00
|
|
|
xas_for_each(&xas, page, end) {
|
|
|
|
if (xas_retry(&xas, page))
|
2008-07-26 10:45:31 +08:00
|
|
|
continue;
|
2018-05-17 05:38:56 +08:00
|
|
|
/* Skip over shadow, swap and DAX entries */
|
|
|
|
if (xa_is_value(page))
|
2011-08-04 07:21:28 +08:00
|
|
|
continue;
|
2008-07-26 10:45:31 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
if (!page_cache_get_speculative(page))
|
2018-05-17 05:38:56 +08:00
|
|
|
goto retry;
|
2016-07-27 06:26:04 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
/* Has the page moved or been split? */
|
2018-05-17 05:38:56 +08:00
|
|
|
if (unlikely(page != xas_reload(&xas)))
|
|
|
|
goto put_page;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
pages[ret] = find_subpage(page, xas.xa_index);
|
2017-09-07 07:21:21 +08:00
|
|
|
if (++ret == nr_pages) {
|
2019-03-06 07:49:17 +08:00
|
|
|
*start = xas.xa_index + 1;
|
2017-09-07 07:21:21 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2018-05-17 05:38:56 +08:00
|
|
|
continue;
|
|
|
|
put_page:
|
2019-09-24 06:34:52 +08:00
|
|
|
put_page(page);
|
2018-05-17 05:38:56 +08:00
|
|
|
retry:
|
|
|
|
xas_reset(&xas);
|
2008-07-26 10:45:31 +08:00
|
|
|
}
|
2011-03-23 07:33:07 +08:00
|
|
|
|
2017-09-07 07:21:21 +08:00
|
|
|
/*
|
|
|
|
* We come here when there is no page beyond @end. We take care to not
|
|
|
|
* overflow the index @start as it confuses some of the callers. This
|
2018-05-17 05:38:56 +08:00
|
|
|
* breaks the iteration when there is a page at index -1 but that is
|
2017-09-07 07:21:21 +08:00
|
|
|
* already broken anyway.
|
|
|
|
*/
|
|
|
|
if (end == (pgoff_t)-1)
|
|
|
|
*start = (pgoff_t)-1;
|
|
|
|
else
|
|
|
|
*start = end + 1;
|
|
|
|
out:
|
2008-07-26 10:45:31 +08:00
|
|
|
rcu_read_unlock();
|
2017-09-07 07:21:18 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2006-04-27 14:46:01 +08:00
|
|
|
/**
|
|
|
|
* find_get_pages_contig - gang contiguous pagecache lookup
|
|
|
|
* @mapping: The address_space to search
|
|
|
|
* @index: The starting page index
|
|
|
|
* @nr_pages: The maximum number of pages
|
|
|
|
* @pages: Where the resulting pages are placed
|
|
|
|
*
|
|
|
|
* find_get_pages_contig() works exactly like find_get_pages(), except
|
|
|
|
* that the returned number of pages are guaranteed to be contiguous.
|
|
|
|
*
|
2019-03-06 07:48:42 +08:00
|
|
|
* Return: the number of pages which were found.
|
2006-04-27 14:46:01 +08:00
|
|
|
*/
|
|
|
|
unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
|
|
|
|
unsigned int nr_pages, struct page **pages)
|
|
|
|
{
|
2018-05-17 06:00:33 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, index);
|
|
|
|
struct page *page;
|
2012-03-29 05:42:54 +08:00
|
|
|
unsigned int ret = 0;
|
|
|
|
|
|
|
|
if (unlikely(!nr_pages))
|
|
|
|
return 0;
|
2008-07-26 10:45:31 +08:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2018-05-17 06:00:33 +08:00
|
|
|
for (page = xas_load(&xas); page; page = xas_next(&xas)) {
|
|
|
|
if (xas_retry(&xas, page))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* If the entry has been swapped out, we can stop looking.
|
|
|
|
* No current caller is looking for DAX entries.
|
|
|
|
*/
|
|
|
|
if (xa_is_value(page))
|
2011-08-04 07:21:28 +08:00
|
|
|
break;
|
2006-04-27 14:46:01 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
if (!page_cache_get_speculative(page))
|
2018-05-17 06:00:33 +08:00
|
|
|
goto retry;
|
2016-07-27 06:26:04 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
/* Has the page moved or been split? */
|
2018-05-17 06:00:33 +08:00
|
|
|
if (unlikely(page != xas_reload(&xas)))
|
|
|
|
goto put_page;
|
2008-07-26 10:45:31 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
pages[ret] = find_subpage(page, xas.xa_index);
|
2012-03-29 05:42:54 +08:00
|
|
|
if (++ret == nr_pages)
|
|
|
|
break;
|
2018-05-17 06:00:33 +08:00
|
|
|
continue;
|
|
|
|
put_page:
|
2019-09-24 06:34:52 +08:00
|
|
|
put_page(page);
|
2018-05-17 06:00:33 +08:00
|
|
|
retry:
|
|
|
|
xas_reset(&xas);
|
2006-04-27 14:46:01 +08:00
|
|
|
}
|
2008-07-26 10:45:31 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
return ret;
|
2006-04-27 14:46:01 +08:00
|
|
|
}
|
2007-05-09 17:33:44 +08:00
|
|
|
EXPORT_SYMBOL(find_get_pages_contig);
|
2006-04-27 14:46:01 +08:00
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
2017-11-16 09:34:33 +08:00
|
|
|
* find_get_pages_range_tag - find and return pages in given range matching @tag
|
2006-06-23 17:03:49 +08:00
|
|
|
* @mapping: the address_space to search
|
|
|
|
* @index: the starting page index
|
2017-11-16 09:34:33 +08:00
|
|
|
* @end: The final page index (inclusive)
|
2006-06-23 17:03:49 +08:00
|
|
|
* @tag: the tag index
|
|
|
|
* @nr_pages: the maximum number of pages
|
|
|
|
* @pages: where the resulting pages are placed
|
|
|
|
*
|
2005-04-17 06:20:36 +08:00
|
|
|
* Like find_get_pages, except we only return pages which are tagged with
|
2006-06-23 17:03:49 +08:00
|
|
|
* @tag. We update @index to index the next page for the traversal.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: the number of pages which were found.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-11-16 09:34:33 +08:00
|
|
|
unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,
|
2018-05-17 06:12:54 +08:00
|
|
|
pgoff_t end, xa_mark_t tag, unsigned int nr_pages,
|
2017-11-16 09:34:33 +08:00
|
|
|
struct page **pages)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2018-05-17 06:12:54 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, *index);
|
|
|
|
struct page *page;
|
2012-03-29 05:42:54 +08:00
|
|
|
unsigned ret = 0;
|
|
|
|
|
|
|
|
if (unlikely(!nr_pages))
|
|
|
|
return 0;
|
2008-07-26 10:45:31 +08:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2018-05-17 06:12:54 +08:00
|
|
|
xas_for_each_marked(&xas, page, end, tag) {
|
|
|
|
if (xas_retry(&xas, page))
|
2008-07-26 10:45:31 +08:00
|
|
|
continue;
|
2018-05-17 06:12:54 +08:00
|
|
|
/*
|
|
|
|
* Shadow entries should never be tagged, but this iteration
|
|
|
|
* is lockless so there is a window for page reclaim to evict
|
|
|
|
* a page we saw tagged. Skip over it.
|
|
|
|
*/
|
|
|
|
if (xa_is_value(page))
|
2014-05-07 03:50:05 +08:00
|
|
|
continue;
|
2008-07-26 10:45:31 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
if (!page_cache_get_speculative(page))
|
2018-05-17 06:12:54 +08:00
|
|
|
goto retry;
|
2008-07-26 10:45:31 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
/* Has the page moved or been split? */
|
2018-05-17 06:12:54 +08:00
|
|
|
if (unlikely(page != xas_reload(&xas)))
|
|
|
|
goto put_page;
|
2008-07-26 10:45:31 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
pages[ret] = find_subpage(page, xas.xa_index);
|
2017-11-16 09:34:33 +08:00
|
|
|
if (++ret == nr_pages) {
|
2019-03-06 07:49:17 +08:00
|
|
|
*index = xas.xa_index + 1;
|
2017-11-16 09:34:33 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2018-05-17 06:12:54 +08:00
|
|
|
continue;
|
|
|
|
put_page:
|
2019-09-24 06:34:52 +08:00
|
|
|
put_page(page);
|
2018-05-17 06:12:54 +08:00
|
|
|
retry:
|
|
|
|
xas_reset(&xas);
|
2008-07-26 10:45:31 +08:00
|
|
|
}
|
2011-03-23 07:33:07 +08:00
|
|
|
|
2017-11-16 09:34:33 +08:00
|
|
|
/*
|
2018-05-17 06:12:54 +08:00
|
|
|
* We come here when we got to @end. We take care to not overflow the
|
2017-11-16 09:34:33 +08:00
|
|
|
* index @index as it confuses some of the callers. This breaks the
|
2018-05-17 06:12:54 +08:00
|
|
|
* iteration when there is a page at index -1 but that is already
|
|
|
|
* broken anyway.
|
2017-11-16 09:34:33 +08:00
|
|
|
*/
|
|
|
|
if (end == (pgoff_t)-1)
|
|
|
|
*index = (pgoff_t)-1;
|
|
|
|
else
|
|
|
|
*index = end + 1;
|
|
|
|
out:
|
2008-07-26 10:45:31 +08:00
|
|
|
rcu_read_unlock();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2017-11-16 09:34:33 +08:00
|
|
|
EXPORT_SYMBOL(find_get_pages_range_tag);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
[PATCH] readahead: backoff on I/O error
Backoff readahead size exponentially on I/O error.
Michael Tokarev <mjt@tls.msk.ru> described the problem as:
[QUOTE]
Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream). Let's assume the unreadable block is number U.
But current behavior is just insane. An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time. Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.
Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1. Retrying again as in previous step.
And so on, up to when an app requests block number U-1. And when,
finally, it requests block U, it receives read error.
So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).
This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore. Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.
[/QUOTE]
Which was confirmed by Jens Axboe <axboe@suse.de>:
[QUOTE]
For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/
[/QUOTE]
With this patch, retries are expected to be reduced from, say, 256, to 5.
[akpm@osdl.org: cleanups]
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 20:48:43 +08:00
|
|
|
/*
|
|
|
|
* CD/DVDs are error prone. When a medium error occurs, the driver may fail
|
|
|
|
* a _large_ part of the i/o request. Imagine the worst scenario:
|
|
|
|
*
|
|
|
|
* ---R__________________________________________B__________
|
|
|
|
* ^ reading here ^ bad block(assume 4k)
|
|
|
|
*
|
|
|
|
* read(R) => miss => readahead(R...B) => media error => frustrating retries
|
|
|
|
* => failing the whole request => read(R) => read(R+1) =>
|
|
|
|
* readahead(R+1...B+1) => bang => read(R+2) => read(R+3) =>
|
|
|
|
* readahead(R+3...B+2) => bang => read(R+3) => read(R+4) =>
|
|
|
|
* readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
|
|
|
|
*
|
|
|
|
* It is going insane. Fix it by quickly scaling down the readahead size.
|
|
|
|
*/
|
2020-04-02 12:04:50 +08:00
|
|
|
static void shrink_readahead_size_eio(struct file_ra_state *ra)
|
[PATCH] readahead: backoff on I/O error
Backoff readahead size exponentially on I/O error.
Michael Tokarev <mjt@tls.msk.ru> described the problem as:
[QUOTE]
Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream). Let's assume the unreadable block is number U.
But current behavior is just insane. An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time. Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.
Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1. Retrying again as in previous step.
And so on, up to when an app requests block number U-1. And when,
finally, it requests block U, it receives read error.
So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).
This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore. Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.
[/QUOTE]
Which was confirmed by Jens Axboe <axboe@suse.de>:
[QUOTE]
For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/
[/QUOTE]
With this patch, retries are expected to be reduced from, say, 256, to 5.
[akpm@osdl.org: cleanups]
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 20:48:43 +08:00
|
|
|
{
|
|
|
|
ra->ra_pages /= 4;
|
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
2017-08-29 22:13:18 +08:00
|
|
|
* generic_file_buffered_read - generic file read routine
|
|
|
|
* @iocb: the iocb to read
|
2014-02-04 06:07:03 +08:00
|
|
|
* @iter: data destination
|
|
|
|
* @written: already copied
|
2006-06-23 17:03:49 +08:00
|
|
|
*
|
2005-04-17 06:20:36 +08:00
|
|
|
* This is a generic file read routine, and uses the
|
2006-06-23 17:03:49 +08:00
|
|
|
* mapping->a_ops->readpage() function for the actual low-level stuff.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* This is really ugly. But the goto's actually try to clarify some
|
|
|
|
* of the logic when it comes to error handling etc.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* * total number of bytes copied, including those the were already @written
|
|
|
|
* * negative error code if nothing was copied
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2019-08-31 01:09:24 +08:00
|
|
|
ssize_t generic_file_buffered_read(struct kiocb *iocb,
|
2014-02-04 06:07:03 +08:00
|
|
|
struct iov_iter *iter, ssize_t written)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-08-29 22:13:18 +08:00
|
|
|
struct file *filp = iocb->ki_filp;
|
2008-02-08 20:21:24 +08:00
|
|
|
struct address_space *mapping = filp->f_mapping;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct inode *inode = mapping->host;
|
2008-02-08 20:21:24 +08:00
|
|
|
struct file_ra_state *ra = &filp->f_ra;
|
2017-08-29 22:13:18 +08:00
|
|
|
loff_t *ppos = &iocb->ki_pos;
|
2007-10-16 16:24:37 +08:00
|
|
|
pgoff_t index;
|
|
|
|
pgoff_t last_index;
|
|
|
|
pgoff_t prev_index;
|
|
|
|
unsigned long offset; /* offset into pagecache page */
|
2007-05-07 05:49:25 +08:00
|
|
|
unsigned int prev_offset;
|
2014-02-04 06:07:03 +08:00
|
|
|
int error = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
vfs,mm: fix a dead loop in truncate_inode_pages_range()
We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:
...
fd = open();
write(fd, buf, 4096);
preadv64(fd, &iovec, 1, 0xffffffff000);
ftruncate(fd, 0);
...
Then ftruncate() will not return forever.
The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.
When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping. Then
this page can be found in ->mapping with pagevec_lookup(). After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:
- find a page with index=0xffffffff, since index>=end, this page won't
be truncated
- index++, and index become 0
- the page with index=0xffffffff will be found again
The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.
Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:
INFO: task truncate_test:3364 blocked for more than 120 seconds.
...
call_rwsem_down_write_failed+0x17/0x30
generic_file_write_iter+0x32/0x1c0
ubifs_write_iter+0xcc/0x170
__vfs_write+0xc4/0x120
vfs_write+0xb2/0x1b0
SyS_write+0x46/0xa0
The page with index=0xffffffff added to ->mapping is useless. Fix this
by checking the read position before allocating pages.
Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 08:01:52 +08:00
|
|
|
if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
|
2016-12-15 04:45:25 +08:00
|
|
|
return 0;
|
vfs,mm: fix a dead loop in truncate_inode_pages_range()
We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:
...
fd = open();
write(fd, buf, 4096);
preadv64(fd, &iovec, 1, 0xffffffff000);
ftruncate(fd, 0);
...
Then ftruncate() will not return forever.
The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.
When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping. Then
this page can be found in ->mapping with pagevec_lookup(). After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:
- find a page with index=0xffffffff, since index>=end, this page won't
be truncated
- index++, and index become 0
- the page with index=0xffffffff will be found again
The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.
Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:
INFO: task truncate_test:3364 blocked for more than 120 seconds.
...
call_rwsem_down_write_failed+0x17/0x30
generic_file_write_iter+0x32/0x1c0
ubifs_write_iter+0xcc/0x170
__vfs_write+0xc4/0x120
vfs_write+0xb2/0x1b0
SyS_write+0x46/0xa0
The page with index=0xffffffff added to ->mapping is useless. Fix this
by checking the read position before allocating pages.
Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 08:01:52 +08:00
|
|
|
iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
index = *ppos >> PAGE_SHIFT;
|
|
|
|
prev_index = ra->prev_pos >> PAGE_SHIFT;
|
|
|
|
prev_offset = ra->prev_pos & (PAGE_SIZE-1);
|
|
|
|
last_index = (*ppos + iter->count + PAGE_SIZE-1) >> PAGE_SHIFT;
|
|
|
|
offset = *ppos & ~PAGE_MASK;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
struct page *page;
|
2007-10-16 16:24:37 +08:00
|
|
|
pgoff_t end_index;
|
2007-07-17 19:03:04 +08:00
|
|
|
loff_t isize;
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned long nr, ret;
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
find_page:
|
2017-02-04 05:13:29 +08:00
|
|
|
if (fatal_signal_pending(current)) {
|
|
|
|
error = -EINTR;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
page = find_get_page(mapping, index);
|
2007-07-19 16:48:02 +08:00
|
|
|
if (!page) {
|
2017-08-29 22:13:19 +08:00
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT)
|
|
|
|
goto would_block;
|
2007-07-19 16:48:08 +08:00
|
|
|
page_cache_sync_readahead(mapping,
|
2007-10-16 16:24:35 +08:00
|
|
|
ra, filp,
|
2007-07-19 16:48:02 +08:00
|
|
|
index, last_index - index);
|
|
|
|
page = find_get_page(mapping, index);
|
|
|
|
if (unlikely(page == NULL))
|
|
|
|
goto no_cached_page;
|
|
|
|
}
|
|
|
|
if (PageReadahead(page)) {
|
2007-07-19 16:48:08 +08:00
|
|
|
page_cache_async_readahead(mapping,
|
2007-10-16 16:24:35 +08:00
|
|
|
ra, filp, page,
|
2007-07-19 16:48:02 +08:00
|
|
|
index, last_index - index);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
vfs: pagecache usage optimization for pagesize!=blocksize
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 06:46:36 +08:00
|
|
|
if (!PageUptodate(page)) {
|
2017-08-29 22:13:19 +08:00
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT) {
|
|
|
|
put_page(page);
|
|
|
|
goto would_block;
|
|
|
|
}
|
|
|
|
|
2016-03-16 05:55:39 +08:00
|
|
|
/*
|
|
|
|
* See comment in do_read_cache_page on why
|
|
|
|
* wait_on_page_locked is used to avoid unnecessarily
|
|
|
|
* serialisations and why it's safe.
|
|
|
|
*/
|
2016-10-08 07:58:33 +08:00
|
|
|
error = wait_on_page_locked_killable(page);
|
|
|
|
if (unlikely(error))
|
|
|
|
goto readpage_error;
|
2016-03-16 05:55:39 +08:00
|
|
|
if (PageUptodate(page))
|
|
|
|
goto page_ok;
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
if (inode->i_blkbits == PAGE_SHIFT ||
|
vfs: pagecache usage optimization for pagesize!=blocksize
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 06:46:36 +08:00
|
|
|
!mapping->a_ops->is_partially_uptodate)
|
|
|
|
goto page_not_up_to_date;
|
mm/filemap: don't allow partially uptodate page for pipes
Starting from 4.9-rc1 kernel, I started noticing some test failures
of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
testing on sub-page block size filesystems (tested both XFS and
ext4), these syscalls start to return EIO in the tests. e.g.
sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1
This is because that in sub-page block size cases, we don't need the
whole page to be uptodate, only the part we care about is uptodate
is OK (if fs has ->is_partially_uptodate defined). But
page_cache_pipe_buf_confirm() doesn't have the ability to check the
partially-uptodate case, it needs the whole page to be uptodate. So
it returns EIO in this case.
This is a regression introduced by commit 82c156f85384 ("switch
generic_file_splice_read() to use of ->read_iter()"). Prior to the
change, generic_file_splice_read() doesn't allow partially-uptodate
page either, so it worked fine.
Fix it by skipping the partially-uptodate check if we're working on
a pipe in do_generic_file_read(), so we read the whole page from
disk as long as the page is not uptodate.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-01 15:43:07 +08:00
|
|
|
/* pipes can't handle partially uptodate pages */
|
2018-10-22 20:07:28 +08:00
|
|
|
if (unlikely(iov_iter_is_pipe(iter)))
|
mm/filemap: don't allow partially uptodate page for pipes
Starting from 4.9-rc1 kernel, I started noticing some test failures
of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
testing on sub-page block size filesystems (tested both XFS and
ext4), these syscalls start to return EIO in the tests. e.g.
sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1
This is because that in sub-page block size cases, we don't need the
whole page to be uptodate, only the part we care about is uptodate
is OK (if fs has ->is_partially_uptodate defined). But
page_cache_pipe_buf_confirm() doesn't have the ability to check the
partially-uptodate case, it needs the whole page to be uptodate. So
it returns EIO in this case.
This is a regression introduced by commit 82c156f85384 ("switch
generic_file_splice_read() to use of ->read_iter()"). Prior to the
change, generic_file_splice_read() doesn't allow partially-uptodate
page either, so it worked fine.
Fix it by skipping the partially-uptodate check if we're working on
a pipe in do_generic_file_read(), so we read the whole page from
disk as long as the page is not uptodate.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-01 15:43:07 +08:00
|
|
|
goto page_not_up_to_date;
|
2008-08-02 18:01:03 +08:00
|
|
|
if (!trylock_page(page))
|
vfs: pagecache usage optimization for pagesize!=blocksize
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 06:46:36 +08:00
|
|
|
goto page_not_up_to_date;
|
mm/vfs: revalidate page->mapping in do_generic_file_read()
70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
ran into a NULL dereference in here:
int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
unsigned long from)
{
----> struct inode *inode = page->mapping->host;
It looks like page->mapping was the culprit. (xmon trace is below).
After closer examination, I realized that do_generic_file_read() does a
find_get_page(), and eventually locks the page before calling
block_is_partially_uptodate(). However, it doesn't revalidate the
page->mapping after the page is locked. So, there's a small window
between the find_get_page() and ->is_partially_uptodate() where the page
could get truncated and page->mapping cleared.
We _have_ a reference, so it can't get reclaimed, but it certainly
can be truncated.
I think the correct thing is to check page->mapping after the
trylock_page(), and jump out if it got truncated. This patch has been
running in the test environment for a month or so now, and we have not
seen this bug pop up again.
xmon info:
1f:mon> e
cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
sp: c0000002ae36f9f0
msr: 8000000000009032
dar: 0
dsisr: 40000000
current = 0xc000000378f99e30
paca = 0xc000000000f66300
pid = 21946, comm = bash
1f:mon> r
R00 = 0025c0500000006d R16 = 0000000000000000
R01 = c0000002ae36f9f0 R17 = c000000362cd3af0
R02 = c000000000e8cd80 R18 = ffffffffffffffff
R03 = c0000000031d0f88 R19 = 0000000000000001
R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0
R05 = 0000000000000000 R21 = c0000002ae36fa68
R06 = 0000000000000000 R22 = 0000000000000000
R07 = 0000000000000001 R23 = c0000002ae36fbb0
R08 = 0000000000000002 R24 = 0000000000000000
R09 = 0000000000000000 R25 = c000000362cd3a80
R10 = 0000000000000000 R26 = 0000000000000002
R11 = c0000000001e7b60 R27 = 0000000000000000
R12 = 0000000042000484 R28 = 0000000000000001
R13 = c000000000f66300 R29 = c0000003bb97b9b8
R14 = 0000000000000001 R30 = c000000000e28a08
R15 = 000000000000ffff R31 = c0000000031d0f88
pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770
msr = 8000000000009032 cr = 22000488
ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300
dar = 0000000000000000 dsisr = 40000000
1f:mon> t
[link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
[c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
[c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
[c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
[c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
[c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 00000080a840bc54
SP (fffca15df30) is in userspace
1f:mon> di c0000000001e7a6c
c0000000001e7a6c e9290000 ld r9,0(r9)
c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100
c0000000001e7a74 e9440008 ld r10,8(r4)
c0000000001e7a78 78a80020 clrldi r8,r5,32
c0000000001e7a7c 3c000001 lis r0,1
c0000000001e7a80 812900a8 lwz r9,168(r9)
c0000000001e7a84 39600001 li r11,1
c0000000001e7a88 7c080050 subf r0,r8,r0
c0000000001e7a8c 7f805040 cmplw cr7,r0,r10
c0000000001e7a90 7d6b4830 slw r11,r11,r9
c0000000001e7a94 796b0020 clrldi r11,r11,32
c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100
c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11
c0000000001e7aa0 7d004214 add r8,r0,r8
c0000000001e7aa4 79080020 clrldi r8,r8,32
c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <arunabal@in.ibm.com>
Cc: <sbest@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-12 06:05:15 +08:00
|
|
|
/* Did it get truncated before we got the lock? */
|
|
|
|
if (!page->mapping)
|
|
|
|
goto page_not_up_to_date_locked;
|
vfs: pagecache usage optimization for pagesize!=blocksize
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 06:46:36 +08:00
|
|
|
if (!mapping->a_ops->is_partially_uptodate(page,
|
2014-02-04 06:07:03 +08:00
|
|
|
offset, iter->count))
|
vfs: pagecache usage optimization for pagesize!=blocksize
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 06:46:36 +08:00
|
|
|
goto page_not_up_to_date_locked;
|
|
|
|
unlock_page(page);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
page_ok:
|
2007-07-17 19:03:04 +08:00
|
|
|
/*
|
|
|
|
* i_size must be checked after we know the page is Uptodate.
|
|
|
|
*
|
|
|
|
* Checking i_size after the check allows us to calculate
|
|
|
|
* the correct value for "nr", which means the zero-filled
|
|
|
|
* part of the page is not copied back to userspace (unless
|
|
|
|
* another truncate extends the file - this is desired though).
|
|
|
|
*/
|
|
|
|
|
|
|
|
isize = i_size_read(inode);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
end_index = (isize - 1) >> PAGE_SHIFT;
|
2007-07-17 19:03:04 +08:00
|
|
|
if (unlikely(!isize || index > end_index)) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2007-07-17 19:03:04 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* nr is the maximum number of bytes to copy from this page */
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
nr = PAGE_SIZE;
|
2007-07-17 19:03:04 +08:00
|
|
|
if (index == end_index) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
nr = ((isize - 1) & ~PAGE_MASK) + 1;
|
2007-07-17 19:03:04 +08:00
|
|
|
if (nr <= offset) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2007-07-17 19:03:04 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
nr = nr - offset;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* If users can be writing to this page using arbitrary
|
|
|
|
* virtual addresses, take care about potential aliasing
|
|
|
|
* before reading the page on the kernel side.
|
|
|
|
*/
|
|
|
|
if (mapping_writably_mapped(mapping))
|
|
|
|
flush_dcache_page(page);
|
|
|
|
|
|
|
|
/*
|
2007-05-07 05:49:25 +08:00
|
|
|
* When a sequential read accesses a page several times,
|
|
|
|
* only mark it as accessed the first time.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2007-05-07 05:49:25 +08:00
|
|
|
if (prev_index != index || offset != prev_offset)
|
2005-04-17 06:20:36 +08:00
|
|
|
mark_page_accessed(page);
|
|
|
|
prev_index = index;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, we have the page, and it's up-to-date, so
|
|
|
|
* now we can copy it to user space...
|
|
|
|
*/
|
2014-02-04 06:07:03 +08:00
|
|
|
|
|
|
|
ret = copy_page_to_iter(page, offset, nr, iter);
|
2005-04-17 06:20:36 +08:00
|
|
|
offset += ret;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
index += offset >> PAGE_SHIFT;
|
|
|
|
offset &= ~PAGE_MASK;
|
2007-05-07 05:49:26 +08:00
|
|
|
prev_offset = offset;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2014-02-04 06:07:03 +08:00
|
|
|
written += ret;
|
|
|
|
if (!iov_iter_count(iter))
|
|
|
|
goto out;
|
|
|
|
if (ret < nr) {
|
|
|
|
error = -EFAULT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
continue;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
page_not_up_to_date:
|
|
|
|
/* Get exclusive access to the page ... */
|
do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails
If lock_page_killable() fails because the task was killed by SIGKILL or
any other fatal signal, do_generic_file_read() returns -EIO.
This seems to be OK, because in fact the userspace won't see this error,
the task will dequeue SIGKILL and exit.
However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
return to the user-space with the bogus -EIO.
Change the code to return the error code from lock_page_killable(), -EINTR.
This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
change the code looks a bit more logical, and the "good" init should handle
the spurious EINTR or short read.
Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
but this can't prevent the short reads.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-09 01:20:43 +08:00
|
|
|
error = lock_page_killable(page);
|
|
|
|
if (unlikely(error))
|
|
|
|
goto readpage_error;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
vfs: pagecache usage optimization for pagesize!=blocksize
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 06:46:36 +08:00
|
|
|
page_not_up_to_date_locked:
|
2006-09-26 14:31:35 +08:00
|
|
|
/* Did it get truncated before we got the lock? */
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!page->mapping) {
|
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2005-04-17 06:20:36 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Did somebody else fill it already? */
|
|
|
|
if (PageUptodate(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
goto page_ok;
|
|
|
|
}
|
|
|
|
|
|
|
|
readpage:
|
2010-05-26 23:49:40 +08:00
|
|
|
/*
|
|
|
|
* A previous I/O error may have been due to temporary
|
|
|
|
* failures, eg. multipath errors.
|
|
|
|
* PG_error will be set again if readpage fails.
|
|
|
|
*/
|
|
|
|
ClearPageError(page);
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Start the actual read. The read will unlock the page. */
|
|
|
|
error = mapping->a_ops->readpage(filp, page);
|
|
|
|
|
2005-12-16 06:28:17 +08:00
|
|
|
if (unlikely(error)) {
|
|
|
|
if (error == AOP_TRUNCATED_PAGE) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2014-02-04 06:07:03 +08:00
|
|
|
error = 0;
|
2005-12-16 06:28:17 +08:00
|
|
|
goto find_page;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
goto readpage_error;
|
2005-12-16 06:28:17 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
if (!PageUptodate(page)) {
|
do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails
If lock_page_killable() fails because the task was killed by SIGKILL or
any other fatal signal, do_generic_file_read() returns -EIO.
This seems to be OK, because in fact the userspace won't see this error,
the task will dequeue SIGKILL and exit.
However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
return to the user-space with the bogus -EIO.
Change the code to return the error code from lock_page_killable(), -EINTR.
This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
change the code looks a bit more logical, and the "good" init should handle
the spurious EINTR or short read.
Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
but this can't prevent the short reads.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-09 01:20:43 +08:00
|
|
|
error = lock_page_killable(page);
|
|
|
|
if (unlikely(error))
|
|
|
|
goto readpage_error;
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!PageUptodate(page)) {
|
|
|
|
if (page->mapping == NULL) {
|
|
|
|
/*
|
2010-01-27 00:27:20 +08:00
|
|
|
* invalidate_mapping_pages got it
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2005-04-17 06:20:36 +08:00
|
|
|
goto find_page;
|
|
|
|
}
|
|
|
|
unlock_page(page);
|
2020-04-02 12:04:50 +08:00
|
|
|
shrink_readahead_size_eio(ra);
|
do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails
If lock_page_killable() fails because the task was killed by SIGKILL or
any other fatal signal, do_generic_file_read() returns -EIO.
This seems to be OK, because in fact the userspace won't see this error,
the task will dequeue SIGKILL and exit.
However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
return to the user-space with the bogus -EIO.
Change the code to return the error code from lock_page_killable(), -EINTR.
This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
change the code looks a bit more logical, and the "good" init should handle
the spurious EINTR or short read.
Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
but this can't prevent the short reads.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-09 01:20:43 +08:00
|
|
|
error = -EIO;
|
|
|
|
goto readpage_error;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
unlock_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
goto page_ok;
|
|
|
|
|
|
|
|
readpage_error:
|
|
|
|
/* UHHUH! A synchronous read error occurred. Report it */
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
no_cached_page:
|
|
|
|
/*
|
|
|
|
* Ok, it wasn't cached, so we need to create a new
|
|
|
|
* page..
|
|
|
|
*/
|
2017-11-16 09:38:03 +08:00
|
|
|
page = page_cache_alloc(mapping);
|
2007-10-16 16:24:57 +08:00
|
|
|
if (!page) {
|
2014-02-04 06:07:03 +08:00
|
|
|
error = -ENOMEM;
|
2007-10-16 16:24:57 +08:00
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2015-06-25 07:58:06 +08:00
|
|
|
error = add_to_page_cache_lru(page, mapping, index,
|
2015-11-07 08:28:49 +08:00
|
|
|
mapping_gfp_constraint(mapping, GFP_KERNEL));
|
2005-04-17 06:20:36 +08:00
|
|
|
if (error) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2014-02-04 06:07:03 +08:00
|
|
|
if (error == -EEXIST) {
|
|
|
|
error = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto find_page;
|
2014-02-04 06:07:03 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
goto readpage;
|
|
|
|
}
|
|
|
|
|
2017-08-29 22:13:19 +08:00
|
|
|
would_block:
|
|
|
|
error = -EAGAIN;
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
2007-10-16 16:24:35 +08:00
|
|
|
ra->prev_pos = prev_index;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
ra->prev_pos <<= PAGE_SHIFT;
|
2007-10-16 16:24:35 +08:00
|
|
|
ra->prev_pos |= prev_offset;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
*ppos = ((loff_t)index << PAGE_SHIFT) + offset;
|
2008-10-16 13:01:13 +08:00
|
|
|
file_accessed(filp);
|
2014-02-04 06:07:03 +08:00
|
|
|
return written ? written : error;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2019-08-31 01:09:24 +08:00
|
|
|
EXPORT_SYMBOL_GPL(generic_file_buffered_read);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
2014-04-05 02:20:57 +08:00
|
|
|
* generic_file_read_iter - generic filesystem read routine
|
2006-06-23 17:03:49 +08:00
|
|
|
* @iocb: kernel I/O control block
|
2014-04-05 02:20:57 +08:00
|
|
|
* @iter: destination for the data read
|
2006-06-23 17:03:49 +08:00
|
|
|
*
|
2014-04-05 02:20:57 +08:00
|
|
|
* This is the "read_iter()" routine for all filesystems
|
2005-04-17 06:20:36 +08:00
|
|
|
* that can use the page cache directly.
|
2019-03-06 07:48:42 +08:00
|
|
|
* Return:
|
|
|
|
* * number of bytes copied, even for partial reads
|
|
|
|
* * negative error code if nothing was read
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
ssize_t
|
2014-03-06 11:53:04 +08:00
|
|
|
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
- and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.
Since roundup_pow_of_two(0) is undefined, UBSAN reports
UBSAN: Undefined behaviour in include/linux/log2.h:63:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
[...]
Call Trace:
[...]
[<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
[<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
[<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
[<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
[<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
[<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
[...]
[<ffffffff81510b06>] __vfs_read+0x256/0x3d0
[...]
when get_init_ra_size() gets called from ondemand_readahead().
The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since
1UL << (sizeof(unsigned long) * 8)
evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.
What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().
According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-26 05:22:14 +08:00
|
|
|
size_t count = iov_iter_count(iter);
|
2017-08-29 22:13:18 +08:00
|
|
|
ssize_t retval = 0;
|
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
- and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.
Since roundup_pow_of_two(0) is undefined, UBSAN reports
UBSAN: Undefined behaviour in include/linux/log2.h:63:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
[...]
Call Trace:
[...]
[<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
[<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
[<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
[<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
[<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
[<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
[...]
[<ffffffff81510b06>] __vfs_read+0x256/0x3d0
[...]
when get_init_ra_size() gets called from ondemand_readahead().
The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since
1UL << (sizeof(unsigned long) * 8)
evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.
What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().
According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-26 05:22:14 +08:00
|
|
|
|
|
|
|
if (!count)
|
|
|
|
goto out; /* skip atime */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2015-04-10 01:52:01 +08:00
|
|
|
if (iocb->ki_flags & IOCB_DIRECT) {
|
2017-08-29 22:13:18 +08:00
|
|
|
struct file *file = iocb->ki_filp;
|
2014-03-06 11:53:04 +08:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
struct inode *inode = mapping->host;
|
2006-10-01 14:28:48 +08:00
|
|
|
loff_t size;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
size = i_size_read(inode);
|
2017-06-20 20:05:44 +08:00
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT) {
|
|
|
|
if (filemap_range_has_page(mapping, iocb->ki_pos,
|
|
|
|
iocb->ki_pos + count - 1))
|
|
|
|
return -EAGAIN;
|
|
|
|
} else {
|
|
|
|
retval = filemap_write_and_wait_range(mapping,
|
|
|
|
iocb->ki_pos,
|
|
|
|
iocb->ki_pos + count - 1);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2014-03-05 10:27:34 +08:00
|
|
|
|
2016-10-03 06:48:08 +08:00
|
|
|
file_accessed(file);
|
|
|
|
|
2017-04-14 02:13:36 +08:00
|
|
|
retval = mapping->a_ops->direct_IO(iocb, iter);
|
2016-10-11 01:26:27 +08:00
|
|
|
if (retval >= 0) {
|
2016-04-07 23:51:55 +08:00
|
|
|
iocb->ki_pos += retval;
|
2017-04-14 02:13:36 +08:00
|
|
|
count -= retval;
|
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 22:42:22 +08:00
|
|
|
}
|
2017-05-09 01:54:47 +08:00
|
|
|
iov_iter_revert(iter, count - iov_iter_count(iter));
|
2010-05-23 23:00:54 +08:00
|
|
|
|
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 22:42:22 +08:00
|
|
|
/*
|
|
|
|
* Btrfs can have a short DIO read if we encounter
|
|
|
|
* compressed extents, so if there was an error, or if
|
|
|
|
* we've already read everything we wanted to, or if
|
|
|
|
* there was a short read because we hit EOF, go ahead
|
|
|
|
* and return. Otherwise fallthrough to buffered io for
|
2015-02-17 07:58:53 +08:00
|
|
|
* the rest of the read. Buffered reads will not work for
|
|
|
|
* DAX files, so don't bother trying.
|
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 22:42:22 +08:00
|
|
|
*/
|
2017-04-14 02:13:36 +08:00
|
|
|
if (retval < 0 || !count || iocb->ki_pos >= size ||
|
2016-10-03 06:48:08 +08:00
|
|
|
IS_DAX(inode))
|
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 22:42:22 +08:00
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2017-08-29 22:13:18 +08:00
|
|
|
retval = generic_file_buffered_read(iocb, iter, retval);
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
|
|
|
return retval;
|
|
|
|
}
|
2014-03-06 11:53:04 +08:00
|
|
|
EXPORT_SYMBOL(generic_file_read_iter);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
#define MMAP_LOTSAMISS (100)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
/*
|
|
|
|
* lock_page_maybe_drop_mmap - lock the page, possibly dropping the mmap_sem
|
|
|
|
* @vmf - the vm_fault for this fault.
|
|
|
|
* @page - the page to lock.
|
|
|
|
* @fpin - the pointer to the file we may pin (or is already pinned).
|
|
|
|
*
|
|
|
|
* This works similar to lock_page_or_retry in that it can drop the mmap_sem.
|
|
|
|
* It differs in that it actually returns the page locked if it returns 1 and 0
|
|
|
|
* if it couldn't lock the page. If we did have to drop the mmap_sem then fpin
|
|
|
|
* will point to the pinned file and needs to be fput()'ed at a later point.
|
|
|
|
*/
|
|
|
|
static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page,
|
|
|
|
struct file **fpin)
|
|
|
|
{
|
|
|
|
if (trylock_page(page))
|
|
|
|
return 1;
|
|
|
|
|
2019-03-16 02:26:07 +08:00
|
|
|
/*
|
|
|
|
* NOTE! This will make us return with VM_FAULT_RETRY, but with
|
|
|
|
* the mmap_sem still held. That's how FAULT_FLAG_RETRY_NOWAIT
|
|
|
|
* is supposed to work. We have way too many special cases..
|
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
|
|
|
|
if (vmf->flags & FAULT_FLAG_KILLABLE) {
|
|
|
|
if (__lock_page_killable(page)) {
|
|
|
|
/*
|
|
|
|
* We didn't have the right flags to drop the mmap_sem,
|
|
|
|
* but all fault_handlers only check for fatal signals
|
|
|
|
* if we return VM_FAULT_RETRY, so we need to drop the
|
|
|
|
* mmap_sem here and return 0 if we don't have a fpin.
|
|
|
|
*/
|
|
|
|
if (*fpin == NULL)
|
|
|
|
up_read(&vmf->vma->vm_mm->mmap_sem);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
__lock_page(page);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-06-17 06:31:25 +08:00
|
|
|
/*
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
* Synchronous readahead happens when we don't even find a page in the page
|
|
|
|
* cache at all. We don't want to perform IO under the mmap sem, so if we have
|
|
|
|
* to drop the mmap sem we return the file that was pinned in order for us to do
|
|
|
|
* that. If we didn't pin a file then we return NULL. The file that is
|
|
|
|
* returned needs to be fput()'ed when we're done with it.
|
2009-06-17 06:31:25 +08:00
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
|
2009-06-17 06:31:25 +08:00
|
|
|
{
|
2019-03-14 02:44:18 +08:00
|
|
|
struct file *file = vmf->vma->vm_file;
|
|
|
|
struct file_ra_state *ra = &file->f_ra;
|
2009-06-17 06:31:25 +08:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
struct file *fpin = NULL;
|
2019-03-14 02:44:18 +08:00
|
|
|
pgoff_t offset = vmf->pgoff;
|
2009-06-17 06:31:25 +08:00
|
|
|
|
|
|
|
/* If we don't want any read-ahead, don't bother */
|
2019-03-14 02:44:18 +08:00
|
|
|
if (vmf->vma->vm_flags & VM_RAND_READ)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
return fpin;
|
2011-05-25 08:12:28 +08:00
|
|
|
if (!ra->ra_pages)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
return fpin;
|
2009-06-17 06:31:25 +08:00
|
|
|
|
2019-03-14 02:44:18 +08:00
|
|
|
if (vmf->vma->vm_flags & VM_SEQ_READ) {
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
readahead: enforce full sync mmap readahead size
Now that we do readahead for sequential mmap reads, here is a simple
evaluation of the impacts, and one further optimization.
It's an NFS-root debian desktop system, readahead size = 60 pages.
The numbers are grabbed after a fresh boot into console.
approach pgmajfault RA miss ratio mmap IO count avg IO size(pages)
A 383 31.6% 383 11
B 225 32.4% 390 11
C 224 32.6% 307 13
case A: mmap sync/async readahead disabled
case B: mmap sync/async readahead enabled, with enforced full async readahead size
case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
or:
A = vanilla 2.6.30-rc1
B = A plus mmap readahead
C = B plus this patch
The numbers show that
- there are good possibilities for random mmap reads to trigger readahead
- 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
- case C can further reduce IO count by 1/4
- readahead miss ratios are not quite affected
The theory is
- readahead is _good_ for clustered random reads, and can perform
_better_ than readaround because they could be _async_.
- async readahead size is guaranteed to be larger than readaround
size, and they are _async_, hence will mostly behave better
However for B
- sync readahead size could be smaller than readaround size, hence may
make things worse by produce more smaller IOs
which will be fixed by this patch.
Final conclusion:
- mmap readahead reduced major faults by 1/3 and no obvious overheads;
- mmap io can be further reduced by 1/4 with this patch.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:38 +08:00
|
|
|
page_cache_sync_readahead(mapping, ra, file, offset,
|
|
|
|
ra->ra_pages);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
return fpin;
|
2009-06-17 06:31:25 +08:00
|
|
|
}
|
|
|
|
|
2011-05-25 08:12:29 +08:00
|
|
|
/* Avoid banging the cache line if not needed */
|
|
|
|
if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
|
2009-06-17 06:31:25 +08:00
|
|
|
ra->mmap_miss++;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do we miss much more than hit in this file? If so,
|
|
|
|
* stop bothering with read-ahead. It will only hurt.
|
|
|
|
*/
|
|
|
|
if (ra->mmap_miss > MMAP_LOTSAMISS)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
return fpin;
|
2009-06-17 06:31:25 +08:00
|
|
|
|
2009-06-17 06:31:30 +08:00
|
|
|
/*
|
|
|
|
* mmap read-around
|
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
2015-11-06 10:47:08 +08:00
|
|
|
ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
|
|
|
|
ra->size = ra->ra_pages;
|
|
|
|
ra->async_size = ra->ra_pages / 4;
|
2011-05-25 08:12:28 +08:00
|
|
|
ra_submit(ra, mapping, file);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
return fpin;
|
2009-06-17 06:31:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Asynchronous readahead happens when we find the page and PG_readahead,
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
* so we want to possibly extend the readahead further. We return the file that
|
|
|
|
* was pinned if we have to drop the mmap_sem in order to do IO.
|
2009-06-17 06:31:25 +08:00
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
|
|
|
|
struct page *page)
|
2009-06-17 06:31:25 +08:00
|
|
|
{
|
2019-03-14 02:44:18 +08:00
|
|
|
struct file *file = vmf->vma->vm_file;
|
|
|
|
struct file_ra_state *ra = &file->f_ra;
|
2009-06-17 06:31:25 +08:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
struct file *fpin = NULL;
|
2019-03-14 02:44:18 +08:00
|
|
|
pgoff_t offset = vmf->pgoff;
|
2009-06-17 06:31:25 +08:00
|
|
|
|
|
|
|
/* If we don't want any read-ahead, don't bother */
|
2020-04-02 12:04:40 +08:00
|
|
|
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
return fpin;
|
2009-06-17 06:31:25 +08:00
|
|
|
if (ra->mmap_miss > 0)
|
|
|
|
ra->mmap_miss--;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
if (PageReadahead(page)) {
|
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
readahead: enforce full readahead size on async mmap readahead
We need this in one particular case and two more general ones.
Now we do async readahead for sequential mmap reads, and do it with the
help of PG_readahead. For normal reads, PG_readahead is the sufficient
condition to do a sequential readahead. But unfortunately, for mmap
reads, there is a tiny nuisance:
[11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
[11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
[11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
[11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~
An unfavorably small readahead. The original dumb read-around size could
be more efficient.
That happened because ld-linux.so does a read(832) in L1 before mmap(),
which triggers a 4-page readahead, with the second page tagged
PG_readahead.
L0: open("/lib/libc.so.6", O_RDONLY) = 3
L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
L7: close(3) = 0
In general, the PG_readahead flag will also be hit in cases
- sequential reads
- clustered random reads
A full readahead size is desirable in both cases.
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:29 +08:00
|
|
|
page_cache_async_readahead(mapping, ra, file,
|
|
|
|
page, offset, ra->ra_pages);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
}
|
|
|
|
return fpin;
|
2009-06-17 06:31:25 +08:00
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:49 +08:00
|
|
|
/**
|
2007-07-19 16:46:59 +08:00
|
|
|
* filemap_fault - read in file data for page fault handling
|
2007-07-19 16:47:03 +08:00
|
|
|
* @vmf: struct vm_fault containing details of the fault
|
2006-06-23 17:03:49 +08:00
|
|
|
*
|
2007-07-19 16:46:59 +08:00
|
|
|
* filemap_fault() is invoked via the vma operations vector for a
|
2005-04-17 06:20:36 +08:00
|
|
|
* mapped memory region to read in file data during a page fault.
|
|
|
|
*
|
|
|
|
* The goto's are kind of ugly, but this streamlines the normal case of having
|
|
|
|
* it in the page cache, and handles the special cases reasonably without
|
|
|
|
* having a lot of duplicated code.
|
2014-08-07 07:07:24 +08:00
|
|
|
*
|
|
|
|
* vma->vm_mm->mmap_sem must be held on entry.
|
|
|
|
*
|
2019-07-12 11:55:29 +08:00
|
|
|
* If our return value has VM_FAULT_RETRY set, it's because the mmap_sem
|
|
|
|
* may be dropped before doing I/O or by lock_page_maybe_drop_mmap().
|
2014-08-07 07:07:24 +08:00
|
|
|
*
|
|
|
|
* If our return value does not have VM_FAULT_RETRY set, the mmap_sem
|
|
|
|
* has not been released.
|
|
|
|
*
|
|
|
|
* We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: bitwise-OR of %VM_FAULT_ codes.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2018-06-08 08:08:00 +08:00
|
|
|
vm_fault_t filemap_fault(struct vm_fault *vmf)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
int error;
|
2017-02-25 06:56:41 +08:00
|
|
|
struct file *file = vmf->vma->vm_file;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
struct file *fpin = NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
struct file_ra_state *ra = &file->f_ra;
|
|
|
|
struct inode *inode = mapping->host;
|
2009-06-17 06:31:25 +08:00
|
|
|
pgoff_t offset = vmf->pgoff;
|
mm: tighten up the fault path a little
The round_up() macro generates a couple of unnecessary instructions
in this usage:
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 83 e8 01 sub $0x1,%rax
48d5: 48 0d ff 0f 00 00 or $0xfff,%rax
48db: 48 83 c0 01 add $0x1,%rax
48df: 48 c1 f8 0c sar $0xc,%rax
48e3: 48 39 c3 cmp %rax,%rbx
48e6: 72 2e jb 4916 <filemap_fault+0x96>
If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
then GCC can see through it and remove the mask (because that would be
dead code given the subsequent shift):
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 05 ff 0f 00 00 add $0xfff,%rax
48d7: 48 c1 e8 0c shr $0xc,%rax
48db: 48 39 c3 cmp %rax,%rbx
48de: 72 2e jb 490e <filemap_fault+0x8e>
But that's problematic because we'd evaluate 'y' twice. Converting
round_up into an inline function prevents it from being used in other
definitions. The easiest thing to do is just change these three usages
of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's
heuristic is wrong in this case.
Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:53:29 +08:00
|
|
|
pgoff_t max_off;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct page *page;
|
2018-06-08 08:08:00 +08:00
|
|
|
vm_fault_t ret = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
mm: tighten up the fault path a little
The round_up() macro generates a couple of unnecessary instructions
in this usage:
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 83 e8 01 sub $0x1,%rax
48d5: 48 0d ff 0f 00 00 or $0xfff,%rax
48db: 48 83 c0 01 add $0x1,%rax
48df: 48 c1 f8 0c sar $0xc,%rax
48e3: 48 39 c3 cmp %rax,%rbx
48e6: 72 2e jb 4916 <filemap_fault+0x96>
If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
then GCC can see through it and remove the mask (because that would be
dead code given the subsequent shift):
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 05 ff 0f 00 00 add $0xfff,%rax
48d7: 48 c1 e8 0c shr $0xc,%rax
48db: 48 39 c3 cmp %rax,%rbx
48de: 72 2e jb 490e <filemap_fault+0x8e>
But that's problematic because we'd evaluate 'y' twice. Converting
round_up into an inline function prevents it from being used in other
definitions. The easiest thing to do is just change these three usages
of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's
heuristic is wrong in this case.
Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:53:29 +08:00
|
|
|
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
|
|
|
|
if (unlikely(offset >= max_off))
|
2007-11-01 00:19:46 +08:00
|
|
|
return VM_FAULT_SIGBUS;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
2013-10-17 04:46:59 +08:00
|
|
|
* Do we have something in the page cache already?
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2009-06-17 06:31:25 +08:00
|
|
|
page = find_get_page(mapping, offset);
|
2012-10-09 07:32:19 +08:00
|
|
|
if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2009-06-17 06:31:25 +08:00
|
|
|
* We found the page, so try async readahead before
|
|
|
|
* waiting for the lock.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
fpin = do_async_mmap_readahead(vmf, page);
|
2012-10-09 07:32:19 +08:00
|
|
|
} else if (!page) {
|
2009-06-17 06:31:25 +08:00
|
|
|
/* No page in the page cache at all */
|
|
|
|
count_vm_event(PGMAJFAULT);
|
2017-07-07 06:40:25 +08:00
|
|
|
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
|
2009-06-17 06:31:25 +08:00
|
|
|
ret = VM_FAULT_MAJOR;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
fpin = do_sync_mmap_readahead(vmf);
|
2009-06-17 06:31:25 +08:00
|
|
|
retry_find:
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:14 +08:00
|
|
|
page = pagecache_get_page(mapping, offset,
|
|
|
|
FGP_CREAT|FGP_FOR_MMAP,
|
|
|
|
vmf->gfp_mask);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
if (!page) {
|
|
|
|
if (fpin)
|
|
|
|
goto out_retry;
|
2020-04-02 12:04:53 +08:00
|
|
|
return VM_FAULT_OOM;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
|
|
|
|
goto out_retry;
|
2010-10-27 05:21:56 +08:00
|
|
|
|
|
|
|
/* Did it get truncated? */
|
filemap: check compound_head(page)->mapping in filemap_fault()
Patch series "Enable THP for text section of non-shmem files", v10;
This patchset follows up discussion at LSF/MM 2019. The motivation is to
put text section of an application in THP, and thus reduces iTLB miss rate
and improves performance. Both Facebook and Oracle showed strong
interests to this feature.
To make reviews easier, this set aims a mininal valid product. Current
version of the work does not have any changes to file system specific
code. This comes with some limitations (discussed later).
This set enables an application to "hugify" its text section by simply
running something like:
madvise(0x600000, 0x80000, MADV_HUGEPAGE);
Before this call, the /proc/<pid>/maps looks like:
00400000-074d0000 r-xp 00000000 00:27 2006927 app
After this call, part of the text section is split out and mapped to
THP:
00400000-00425000 r-xp 00000000 00:27 2006927 app
00600000-00e00000 r-xp 00200000 00:27 2006927 app <<< on THP
00e00000-074d0000 r-xp 00a00000 00:27 2006927 app
Limitations:
1. This only works for text section (vma with VM_DENYWRITE).
2. Original limitation #2 is removed in v3.
We gated this feature with an experimental config, READ_ONLY_THP_FOR_FS.
Once we get better support on the write path, we can remove the config and
enable it by default.
Tested cases:
1. Tested with btrfs and ext4.
2. Tested with real work application (memcache like caching service).
3. Tested with "THP aware uprobe":
https://patchwork.kernel.org/project/linux-mm/list/?series=131339
This patch (of 7):
Currently, filemap_fault() avoids race condition with truncate by checking
page->mapping == mapping. This does not work for compound pages. This
patch let it check compound_head(page)->mapping instead.
Link: http://lkml.kernel.org/r/20190801184244.3169074-2-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 06:37:44 +08:00
|
|
|
if (unlikely(compound_head(page)->mapping != mapping)) {
|
2010-10-27 05:21:56 +08:00
|
|
|
unlock_page(page);
|
|
|
|
put_page(page);
|
|
|
|
goto retry_find;
|
|
|
|
}
|
2019-09-24 06:37:50 +08:00
|
|
|
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
|
2010-10-27 05:21:56 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 16:46:57 +08:00
|
|
|
* We have a locked page in the page cache, now we need to check
|
|
|
|
* that it's up-to-date. If not, it is going to be due to an error.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 16:46:57 +08:00
|
|
|
if (unlikely(!PageUptodate(page)))
|
2005-04-17 06:20:36 +08:00
|
|
|
goto page_not_uptodate;
|
|
|
|
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
/*
|
|
|
|
* We've made it this far and we had to drop our mmap_sem, now is the
|
|
|
|
* time to return to the upper layer and have it re-find the vma and
|
|
|
|
* redo the fault.
|
|
|
|
*/
|
|
|
|
if (fpin) {
|
|
|
|
unlock_page(page);
|
|
|
|
goto out_retry;
|
|
|
|
}
|
|
|
|
|
2009-06-17 06:31:25 +08:00
|
|
|
/*
|
|
|
|
* Found the page and have a reference on it.
|
|
|
|
* We must recheck i_size under page lock.
|
|
|
|
*/
|
mm: tighten up the fault path a little
The round_up() macro generates a couple of unnecessary instructions
in this usage:
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 83 e8 01 sub $0x1,%rax
48d5: 48 0d ff 0f 00 00 or $0xfff,%rax
48db: 48 83 c0 01 add $0x1,%rax
48df: 48 c1 f8 0c sar $0xc,%rax
48e3: 48 39 c3 cmp %rax,%rbx
48e6: 72 2e jb 4916 <filemap_fault+0x96>
If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
then GCC can see through it and remove the mask (because that would be
dead code given the subsequent shift):
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 05 ff 0f 00 00 add $0xfff,%rax
48d7: 48 c1 e8 0c shr $0xc,%rax
48db: 48 39 c3 cmp %rax,%rbx
48de: 72 2e jb 490e <filemap_fault+0x8e>
But that's problematic because we'd evaluate 'y' twice. Converting
round_up into an inline function prevents it from being used in other
definitions. The easiest thing to do is just change these three usages
of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's
heuristic is wrong in this case.
Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:53:29 +08:00
|
|
|
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
|
|
|
|
if (unlikely(offset >= max_off)) {
|
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 16:46:57 +08:00
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2007-11-01 00:19:46 +08:00
|
|
|
return VM_FAULT_SIGBUS;
|
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 16:46:57 +08:00
|
|
|
}
|
|
|
|
|
2007-07-19 16:47:03 +08:00
|
|
|
vmf->page = page;
|
2007-07-19 16:47:05 +08:00
|
|
|
return ret | VM_FAULT_LOCKED;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
page_not_uptodate:
|
|
|
|
/*
|
|
|
|
* Umm, take care of errors if the page isn't up-to-date.
|
|
|
|
* Try to re-read it _once_. We do this synchronously,
|
|
|
|
* because there really aren't any performance issues here
|
|
|
|
* and we need to check for errors.
|
|
|
|
*/
|
|
|
|
ClearPageError(page);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
2005-12-16 06:28:17 +08:00
|
|
|
error = mapping->a_ops->readpage(file, page);
|
2008-05-15 07:05:37 +08:00
|
|
|
if (!error) {
|
|
|
|
wait_on_page_locked(page);
|
|
|
|
if (!PageUptodate(page))
|
|
|
|
error = -EIO;
|
|
|
|
}
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
if (fpin)
|
|
|
|
goto out_retry;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 16:46:57 +08:00
|
|
|
|
|
|
|
if (!error || error == AOP_TRUNCATED_PAGE)
|
2005-12-16 06:28:17 +08:00
|
|
|
goto retry_find;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-04-02 12:04:50 +08:00
|
|
|
shrink_readahead_size_eio(ra);
|
2007-07-19 16:47:03 +08:00
|
|
|
return VM_FAULT_SIGBUS;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:22 +08:00
|
|
|
|
|
|
|
out_retry:
|
|
|
|
/*
|
|
|
|
* We dropped the mmap_sem, we need to return to the fault handler to
|
|
|
|
* re-find the vma and come back and find our hopefully still populated
|
|
|
|
* page.
|
|
|
|
*/
|
|
|
|
if (page)
|
|
|
|
put_page(page);
|
|
|
|
if (fpin)
|
|
|
|
fput(fpin);
|
|
|
|
return ret | VM_FAULT_RETRY;
|
2007-07-19 16:46:59 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_fault);
|
|
|
|
|
2016-12-15 07:06:58 +08:00
|
|
|
void filemap_map_pages(struct vm_fault *vmf,
|
2016-07-27 06:25:20 +08:00
|
|
|
pgoff_t start_pgoff, pgoff_t end_pgoff)
|
2014-04-08 06:37:19 +08:00
|
|
|
{
|
2016-12-15 07:06:58 +08:00
|
|
|
struct file *file = vmf->vma->vm_file;
|
2014-04-08 06:37:19 +08:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
2016-07-27 06:25:20 +08:00
|
|
|
pgoff_t last_pgoff = start_pgoff;
|
mm: tighten up the fault path a little
The round_up() macro generates a couple of unnecessary instructions
in this usage:
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 83 e8 01 sub $0x1,%rax
48d5: 48 0d ff 0f 00 00 or $0xfff,%rax
48db: 48 83 c0 01 add $0x1,%rax
48df: 48 c1 f8 0c sar $0xc,%rax
48e3: 48 39 c3 cmp %rax,%rbx
48e6: 72 2e jb 4916 <filemap_fault+0x96>
If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
then GCC can see through it and remove the mask (because that would be
dead code given the subsequent shift):
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 05 ff 0f 00 00 add $0xfff,%rax
48d7: 48 c1 e8 0c shr $0xc,%rax
48db: 48 39 c3 cmp %rax,%rbx
48de: 72 2e jb 490e <filemap_fault+0x8e>
But that's problematic because we'd evaluate 'y' twice. Converting
round_up into an inline function prevents it from being used in other
definitions. The easiest thing to do is just change these three usages
of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's
heuristic is wrong in this case.
Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:53:29 +08:00
|
|
|
unsigned long max_idx;
|
2018-05-17 12:08:30 +08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, start_pgoff);
|
2019-09-24 06:34:52 +08:00
|
|
|
struct page *page;
|
2014-04-08 06:37:19 +08:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2018-05-17 12:08:30 +08:00
|
|
|
xas_for_each(&xas, page, end_pgoff) {
|
|
|
|
if (xas_retry(&xas, page))
|
|
|
|
continue;
|
|
|
|
if (xa_is_value(page))
|
2016-03-18 05:22:03 +08:00
|
|
|
goto next;
|
2014-04-08 06:37:19 +08:00
|
|
|
|
2018-12-28 16:38:36 +08:00
|
|
|
/*
|
|
|
|
* Check for a locked page first, as a speculative
|
|
|
|
* reference may adversely influence page migration.
|
|
|
|
*/
|
2019-09-24 06:34:52 +08:00
|
|
|
if (PageLocked(page))
|
2018-12-28 16:38:36 +08:00
|
|
|
goto next;
|
2019-09-24 06:34:52 +08:00
|
|
|
if (!page_cache_get_speculative(page))
|
2018-05-17 12:08:30 +08:00
|
|
|
goto next;
|
2014-04-08 06:37:19 +08:00
|
|
|
|
2019-09-24 06:34:52 +08:00
|
|
|
/* Has the page moved or been split? */
|
2018-05-17 12:08:30 +08:00
|
|
|
if (unlikely(page != xas_reload(&xas)))
|
|
|
|
goto skip;
|
2019-09-24 06:34:52 +08:00
|
|
|
page = find_subpage(page, xas.xa_index);
|
2014-04-08 06:37:19 +08:00
|
|
|
|
|
|
|
if (!PageUptodate(page) ||
|
|
|
|
PageReadahead(page) ||
|
|
|
|
PageHWPoison(page))
|
|
|
|
goto skip;
|
|
|
|
if (!trylock_page(page))
|
|
|
|
goto skip;
|
|
|
|
|
|
|
|
if (page->mapping != mapping || !PageUptodate(page))
|
|
|
|
goto unlock;
|
|
|
|
|
mm: tighten up the fault path a little
The round_up() macro generates a couple of unnecessary instructions
in this usage:
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 83 e8 01 sub $0x1,%rax
48d5: 48 0d ff 0f 00 00 or $0xfff,%rax
48db: 48 83 c0 01 add $0x1,%rax
48df: 48 c1 f8 0c sar $0xc,%rax
48e3: 48 39 c3 cmp %rax,%rbx
48e6: 72 2e jb 4916 <filemap_fault+0x96>
If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
then GCC can see through it and remove the mask (because that would be
dead code given the subsequent shift):
48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 05 ff 0f 00 00 add $0xfff,%rax
48d7: 48 c1 e8 0c shr $0xc,%rax
48db: 48 39 c3 cmp %rax,%rbx
48de: 72 2e jb 490e <filemap_fault+0x8e>
But that's problematic because we'd evaluate 'y' twice. Converting
round_up into an inline function prevents it from being used in other
definitions. The easiest thing to do is just change these three usages
of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's
heuristic is wrong in this case.
Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:53:29 +08:00
|
|
|
max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
|
|
|
|
if (page->index >= max_idx)
|
2014-04-08 06:37:19 +08:00
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
if (file->f_ra.mmap_miss > 0)
|
|
|
|
file->f_ra.mmap_miss--;
|
2016-07-27 06:25:23 +08:00
|
|
|
|
2018-05-17 12:08:30 +08:00
|
|
|
vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
|
2016-12-15 07:06:58 +08:00
|
|
|
if (vmf->pte)
|
2018-05-17 12:08:30 +08:00
|
|
|
vmf->pte += xas.xa_index - last_pgoff;
|
|
|
|
last_pgoff = xas.xa_index;
|
2020-06-04 07:02:04 +08:00
|
|
|
if (alloc_set_pte(vmf, page))
|
2016-07-27 06:25:23 +08:00
|
|
|
goto unlock;
|
2014-04-08 06:37:19 +08:00
|
|
|
unlock_page(page);
|
|
|
|
goto next;
|
|
|
|
unlock:
|
|
|
|
unlock_page(page);
|
|
|
|
skip:
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2014-04-08 06:37:19 +08:00
|
|
|
next:
|
2016-07-27 06:25:23 +08:00
|
|
|
/* Huge page is mapped? No need to proceed. */
|
2016-12-15 07:06:58 +08:00
|
|
|
if (pmd_trans_huge(*vmf->pmd))
|
2016-07-27 06:25:23 +08:00
|
|
|
break;
|
2014-04-08 06:37:19 +08:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_map_pages);
|
|
|
|
|
2018-06-08 08:08:00 +08:00
|
|
|
vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf)
|
2012-06-12 22:20:29 +08:00
|
|
|
{
|
|
|
|
struct page *page = vmf->page;
|
2017-02-25 06:56:41 +08:00
|
|
|
struct inode *inode = file_inode(vmf->vma->vm_file);
|
2018-06-08 08:08:00 +08:00
|
|
|
vm_fault_t ret = VM_FAULT_LOCKED;
|
2012-06-12 22:20:29 +08:00
|
|
|
|
2012-06-12 22:20:37 +08:00
|
|
|
sb_start_pagefault(inode->i_sb);
|
2017-02-25 06:56:41 +08:00
|
|
|
file_update_time(vmf->vma->vm_file);
|
2012-06-12 22:20:29 +08:00
|
|
|
lock_page(page);
|
|
|
|
if (page->mapping != inode->i_mapping) {
|
|
|
|
unlock_page(page);
|
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-06-12 22:20:37 +08:00
|
|
|
/*
|
|
|
|
* We mark the page dirty already here so that when freeze is in
|
|
|
|
* progress, we are guaranteed that writeback during freezing will
|
|
|
|
* see the dirty page and writeprotect it again.
|
|
|
|
*/
|
|
|
|
set_page_dirty(page);
|
mm: only enforce stable page writes if the backing device requires it
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait. Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function. This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.
Here's the result of using dbench to test latency on ext2:
3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283
Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms
3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, the maximum write latency drops considerably with this
patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22 08:42:51 +08:00
|
|
|
wait_for_stable_page(page);
|
2012-06-12 22:20:29 +08:00
|
|
|
out:
|
2012-06-12 22:20:37 +08:00
|
|
|
sb_end_pagefault(inode->i_sb);
|
2012-06-12 22:20:29 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-09-28 02:29:37 +08:00
|
|
|
const struct vm_operations_struct generic_file_vm_ops = {
|
2007-07-19 16:46:59 +08:00
|
|
|
.fault = filemap_fault,
|
2014-04-08 06:37:19 +08:00
|
|
|
.map_pages = filemap_map_pages,
|
2012-06-12 22:20:29 +08:00
|
|
|
.page_mkwrite = filemap_page_mkwrite,
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
/* This is used for a general mmap of a disk file */
|
|
|
|
|
|
|
|
int generic_file_mmap(struct file * file, struct vm_area_struct * vma)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
|
|
|
|
if (!mapping->a_ops->readpage)
|
|
|
|
return -ENOEXEC;
|
|
|
|
file_accessed(file);
|
|
|
|
vma->vm_ops = &generic_file_vm_ops;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is for filesystems which do not implement ->writepage.
|
|
|
|
*/
|
|
|
|
int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
|
|
|
|
return -EINVAL;
|
|
|
|
return generic_file_mmap(file, vma);
|
|
|
|
}
|
|
|
|
#else
|
2018-10-27 06:04:03 +08:00
|
|
|
vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf)
|
2018-04-14 06:35:27 +08:00
|
|
|
{
|
2018-10-27 06:04:03 +08:00
|
|
|
return VM_FAULT_SIGBUS;
|
2018-04-14 06:35:27 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
int generic_file_mmap(struct file * file, struct vm_area_struct * vma)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
int generic_file_readonly_mmap(struct file * file, struct vm_area_struct * vma)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_MMU */
|
|
|
|
|
2018-04-14 06:35:27 +08:00
|
|
|
EXPORT_SYMBOL(filemap_page_mkwrite);
|
2005-04-17 06:20:36 +08:00
|
|
|
EXPORT_SYMBOL(generic_file_mmap);
|
|
|
|
EXPORT_SYMBOL(generic_file_readonly_mmap);
|
|
|
|
|
2014-04-04 05:48:18 +08:00
|
|
|
static struct page *wait_on_page_read(struct page *page)
|
|
|
|
{
|
|
|
|
if (!IS_ERR(page)) {
|
|
|
|
wait_on_page_locked(page);
|
|
|
|
if (!PageUptodate(page)) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2014-04-04 05:48:18 +08:00
|
|
|
page = ERR_PTR(-EIO);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return page;
|
|
|
|
}
|
|
|
|
|
2016-03-16 05:55:36 +08:00
|
|
|
static struct page *do_read_cache_page(struct address_space *mapping,
|
2007-10-16 16:24:37 +08:00
|
|
|
pgoff_t index,
|
2011-07-26 08:12:23 +08:00
|
|
|
int (*filler)(void *, struct page *),
|
2010-01-28 01:20:03 +08:00
|
|
|
void *data,
|
|
|
|
gfp_t gfp)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-10-16 16:24:57 +08:00
|
|
|
struct page *page;
|
2005-04-17 06:20:36 +08:00
|
|
|
int err;
|
|
|
|
repeat:
|
|
|
|
page = find_get_page(mapping, index);
|
|
|
|
if (!page) {
|
2017-11-16 09:38:03 +08:00
|
|
|
page = __page_cache_alloc(gfp);
|
2007-10-16 16:24:57 +08:00
|
|
|
if (!page)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2011-12-22 01:05:48 +08:00
|
|
|
err = add_to_page_cache_lru(page, mapping, index, gfp);
|
2007-10-16 16:24:57 +08:00
|
|
|
if (unlikely(err)) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2007-10-16 16:24:57 +08:00
|
|
|
if (err == -EEXIST)
|
|
|
|
goto repeat;
|
2017-12-04 17:02:00 +08:00
|
|
|
/* Presumably ENOMEM for xarray node */
|
2005-04-17 06:20:36 +08:00
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
2016-03-16 05:55:36 +08:00
|
|
|
|
|
|
|
filler:
|
2019-07-12 11:55:20 +08:00
|
|
|
if (filler)
|
|
|
|
err = filler(data, page);
|
|
|
|
else
|
|
|
|
err = mapping->a_ops->readpage(data, page);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
if (err < 0) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2016-03-16 05:55:36 +08:00
|
|
|
return ERR_PTR(err);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2016-03-16 05:55:36 +08:00
|
|
|
page = wait_on_page_read(page);
|
|
|
|
if (IS_ERR(page))
|
|
|
|
return page;
|
|
|
|
goto out;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
if (PageUptodate(page))
|
|
|
|
goto out;
|
|
|
|
|
2016-03-16 05:55:39 +08:00
|
|
|
/*
|
|
|
|
* Page is not up to date and may be locked due one of the following
|
|
|
|
* case a: Page is being filled and the page lock is held
|
|
|
|
* case b: Read/write error clearing the page uptodate status
|
|
|
|
* case c: Truncation in progress (page locked)
|
|
|
|
* case d: Reclaim in progress
|
|
|
|
*
|
|
|
|
* Case a, the page will be up to date when the page is unlocked.
|
|
|
|
* There is no need to serialise on the page lock here as the page
|
|
|
|
* is pinned so the lock gives no additional protection. Even if the
|
|
|
|
* the page is truncated, the data is still valid if PageUptodate as
|
|
|
|
* it's a race vs truncate race.
|
|
|
|
* Case b, the page will not be up to date
|
|
|
|
* Case c, the page may be truncated but in itself, the data may still
|
|
|
|
* be valid after IO completes as it's a read vs truncate race. The
|
|
|
|
* operation must restart if the page is not uptodate on unlock but
|
|
|
|
* otherwise serialising on page lock to stabilise the mapping gives
|
|
|
|
* no additional guarantees to the caller as the page lock is
|
|
|
|
* released before return.
|
|
|
|
* Case d, similar to truncation. If reclaim holds the page lock, it
|
|
|
|
* will be a race with remove_mapping that determines if the mapping
|
|
|
|
* is valid on unlock but otherwise the data is valid and there is
|
|
|
|
* no need to serialise with page lock.
|
|
|
|
*
|
|
|
|
* As the page lock gives no additional guarantee, we optimistically
|
|
|
|
* wait on the page to be unlocked and check if it's up to date and
|
|
|
|
* use the page if it is. Otherwise, the page lock is required to
|
|
|
|
* distinguish between the different cases. The motivation is that we
|
|
|
|
* avoid spurious serialisations and wakeups when multiple processes
|
|
|
|
* wait on the same page for IO to complete.
|
|
|
|
*/
|
|
|
|
wait_on_page_locked(page);
|
|
|
|
if (PageUptodate(page))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Distinguish between all the cases under the safety of the lock */
|
2005-04-17 06:20:36 +08:00
|
|
|
lock_page(page);
|
2016-03-16 05:55:39 +08:00
|
|
|
|
|
|
|
/* Case c or d, restart the operation */
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!page->mapping) {
|
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2016-03-16 05:55:36 +08:00
|
|
|
goto repeat;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2016-03-16 05:55:39 +08:00
|
|
|
|
|
|
|
/* Someone else locked and filled the page in a very small window */
|
2005-04-17 06:20:36 +08:00
|
|
|
if (PageUptodate(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
goto out;
|
|
|
|
}
|
mm/filemap.c: clear page error before actual read
Mount failure issue happens under the scenario: Application forked dozens
of threads to mount the same number of cramfs images separately in docker,
but several mounts failed with high probability. Mount failed due to the
checking result of the page(read from the superblock of loop dev) is not
uptodate after wait_on_page_locked(page) returned in function cramfs_read:
wait_on_page_locked(page);
if (!PageUptodate(page)) {
...
}
The reason of the checking result of the page not uptodate: systemd-udevd
read the loopX dev before mount, because the status of loopX is Lo_unbound
at this time, so loop_make_request directly trigger the calling of io_end
handler end_buffer_async_read, which called SetPageError(page). So It
caused the page can't be set to uptodate in function
end_buffer_async_read:
if(page_uptodate && !PageError(page)) {
SetPageUptodate(page);
}
Then mount operation is performed, it used the same page which is just
accessed by systemd-udevd above, Because this page is not uptodate, it
will launch a actual read via submit_bh, then wait on this page by calling
wait_on_page_locked(page). When the I/O of the page done, io_end handler
end_buffer_async_read is called, because no one cleared the page
error(during the whole read path of mount), which is caused by
systemd-udevd reading, so this page is still in "PageError" status, which
can't be set to uptodate in function end_buffer_async_read, then caused
mount failure.
But sometimes mount succeed even through systemd-udeved read loopX dev
just before, The reason is systemd-udevd launched other loopX read just
between step 3.1 and 3.2, the steps as below:
1, loopX dev default status is Lo_unbound;
2, systemd-udved read loopX dev (page is set to PageError);
3, mount operation
1) set loopX status to Lo_bound;
==>systemd-udevd read loopX dev<==
2) read loopX dev(page has no error)
3) mount succeed
As the loopX dev status is set to Lo_bound after step 3.1, so the other
loopX dev read by systemd-udevd will go through the whole I/O stack, part
of the call trace as below:
SYS_read
vfs_read
do_sync_read
blkdev_aio_read
generic_file_aio_read
do_generic_file_read:
ClearPageError(page);
mapping->a_ops->readpage(filp, page);
here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel,
some function name changed, the call trace as below:
blkdev_read_iter
generic_file_read_iter
generic_file_buffered_read:
/*
* A previous I/O error may have been due to temporary
* failures, eg. mutipath errors.
* Pg_error will be set again if readpage fails.
*/
ClearPageError(page);
/* Start the actual read. The read will unlock the page*/
error=mapping->a_ops->readpage(flip, page);
We can see ClearPageError(page) is called before the actual read,
then the read in step 3.2 succeed.
This patch is to add the calling of ClearPageError just before the actual
read of read path of cramfs mount. Without the patch, the call trace as
below when performing cramfs mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page
do_read_cache_page:
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, the call trace as below when performing mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page:
do_read_cache_page:
ClearPageError(page); <== new add
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, mount operation trigger the calling of
ClearPageError(page) before the actual read, the page has no error if no
additional page error happen when I/O done.
Signed-off-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: <yubin@h3c.com>
Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:04:47 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* A previous I/O error may have been due to temporary
|
|
|
|
* failures.
|
|
|
|
* Clear page error before actual read, PG_error will be
|
|
|
|
* set again if read page fails.
|
|
|
|
*/
|
|
|
|
ClearPageError(page);
|
2016-03-16 05:55:36 +08:00
|
|
|
goto filler;
|
|
|
|
|
2007-05-09 20:42:20 +08:00
|
|
|
out:
|
2007-05-07 05:49:04 +08:00
|
|
|
mark_page_accessed(page);
|
|
|
|
return page;
|
|
|
|
}
|
2010-01-28 01:20:03 +08:00
|
|
|
|
|
|
|
/**
|
2014-04-04 05:48:18 +08:00
|
|
|
* read_cache_page - read into page cache, fill it if needed
|
2010-01-28 01:20:03 +08:00
|
|
|
* @mapping: the page's address_space
|
|
|
|
* @index: the page index
|
|
|
|
* @filler: function to perform the read
|
2011-07-26 08:12:23 +08:00
|
|
|
* @data: first arg to filler(data, page) function, often left as NULL
|
2010-01-28 01:20:03 +08:00
|
|
|
*
|
|
|
|
* Read into the page cache. If a page already exists, and PageUptodate() is
|
2014-04-04 05:48:18 +08:00
|
|
|
* not set, try to fill the page and wait for it to become unlocked.
|
2010-01-28 01:20:03 +08:00
|
|
|
*
|
|
|
|
* If the page does not get brought uptodate, return -EIO.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: up to date page on success, ERR_PTR() on failure.
|
2010-01-28 01:20:03 +08:00
|
|
|
*/
|
2014-04-04 05:48:18 +08:00
|
|
|
struct page *read_cache_page(struct address_space *mapping,
|
2010-01-28 01:20:03 +08:00
|
|
|
pgoff_t index,
|
2011-07-26 08:12:23 +08:00
|
|
|
int (*filler)(void *, struct page *),
|
2010-01-28 01:20:03 +08:00
|
|
|
void *data)
|
|
|
|
{
|
2019-07-12 11:55:17 +08:00
|
|
|
return do_read_cache_page(mapping, index, filler, data,
|
|
|
|
mapping_gfp_mask(mapping));
|
2010-01-28 01:20:03 +08:00
|
|
|
}
|
2014-04-04 05:48:18 +08:00
|
|
|
EXPORT_SYMBOL(read_cache_page);
|
2010-01-28 01:20:03 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* read_cache_page_gfp - read into page cache, using specified page allocation flags.
|
|
|
|
* @mapping: the page's address_space
|
|
|
|
* @index: the page index
|
|
|
|
* @gfp: the page allocator flags to use if allocating
|
|
|
|
*
|
|
|
|
* This is the same as "read_mapping_page(mapping, index, NULL)", but with
|
2011-12-22 01:05:48 +08:00
|
|
|
* any new page allocations done using the specified allocation flags.
|
2010-01-28 01:20:03 +08:00
|
|
|
*
|
|
|
|
* If the page does not get brought uptodate, return -EIO.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return: up to date page on success, ERR_PTR() on failure.
|
2010-01-28 01:20:03 +08:00
|
|
|
*/
|
|
|
|
struct page *read_cache_page_gfp(struct address_space *mapping,
|
|
|
|
pgoff_t index,
|
|
|
|
gfp_t gfp)
|
|
|
|
{
|
2019-07-12 11:55:20 +08:00
|
|
|
return do_read_cache_page(mapping, index, NULL, NULL, gfp);
|
2010-01-28 01:20:03 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(read_cache_page_gfp);
|
|
|
|
|
2018-10-30 07:40:46 +08:00
|
|
|
/*
|
|
|
|
* Don't operate on ranges the page cache doesn't support, and don't exceed the
|
|
|
|
* LFS limits. If pos is under the limit it becomes a short access. If it
|
|
|
|
* exceeds the limit we return -EFBIG.
|
|
|
|
*/
|
|
|
|
static int generic_write_check_limits(struct file *file, loff_t pos,
|
|
|
|
loff_t *count)
|
|
|
|
{
|
2019-06-05 23:04:48 +08:00
|
|
|
struct inode *inode = file->f_mapping->host;
|
|
|
|
loff_t max_size = inode->i_sb->s_maxbytes;
|
2018-10-30 07:40:46 +08:00
|
|
|
loff_t limit = rlimit(RLIMIT_FSIZE);
|
|
|
|
|
|
|
|
if (limit != RLIM_INFINITY) {
|
|
|
|
if (pos >= limit) {
|
|
|
|
send_sig(SIGXFSZ, current, 0);
|
|
|
|
return -EFBIG;
|
|
|
|
}
|
|
|
|
*count = min(*count, limit - pos);
|
|
|
|
}
|
|
|
|
|
2019-06-05 23:04:48 +08:00
|
|
|
if (!(file->f_flags & O_LARGEFILE))
|
|
|
|
max_size = MAX_NON_LFS;
|
|
|
|
|
|
|
|
if (unlikely(pos >= max_size))
|
|
|
|
return -EFBIG;
|
|
|
|
|
|
|
|
*count = min(*count, max_size - pos);
|
|
|
|
|
|
|
|
return 0;
|
2018-10-30 07:40:46 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Performs necessary checks before doing a write
|
|
|
|
*
|
2006-06-23 17:03:49 +08:00
|
|
|
* Can adjust writing position or amount of bytes to write.
|
2005-04-17 06:20:36 +08:00
|
|
|
* Returns appropriate error code that caller should return or
|
|
|
|
* zero in case that write should be allowed.
|
|
|
|
*/
|
2015-04-10 00:55:47 +08:00
|
|
|
inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2015-04-10 00:55:47 +08:00
|
|
|
struct file *file = iocb->ki_filp;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct inode *inode = file->f_mapping->host;
|
2018-10-30 07:40:46 +08:00
|
|
|
loff_t count;
|
|
|
|
int ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-08-20 22:55:16 +08:00
|
|
|
if (IS_SWAPFILE(inode))
|
|
|
|
return -ETXTBSY;
|
|
|
|
|
2015-04-10 00:55:47 +08:00
|
|
|
if (!iov_iter_count(from))
|
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2015-04-04 16:05:48 +08:00
|
|
|
/* FIXME: this is for backwards compatibility with 2.4 */
|
2015-04-10 01:52:01 +08:00
|
|
|
if (iocb->ki_flags & IOCB_APPEND)
|
2015-04-10 00:55:47 +08:00
|
|
|
iocb->ki_pos = i_size_read(inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-06-20 20:05:44 +08:00
|
|
|
if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2018-10-30 07:40:46 +08:00
|
|
|
count = iov_iter_count(from);
|
|
|
|
ret = generic_write_check_limits(file, iocb->ki_pos, &count);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-10-30 07:40:46 +08:00
|
|
|
iov_iter_truncate(from, count);
|
2015-04-10 00:55:47 +08:00
|
|
|
return iov_iter_count(from);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(generic_write_checks);
|
|
|
|
|
2018-10-30 07:40:31 +08:00
|
|
|
/*
|
|
|
|
* Performs necessary checks before doing a clone.
|
|
|
|
*
|
2019-06-05 23:04:48 +08:00
|
|
|
* Can adjust amount of bytes to clone via @req_count argument.
|
2018-10-30 07:40:31 +08:00
|
|
|
* Returns appropriate error code that caller should return or
|
|
|
|
* zero in case the clone should be allowed.
|
|
|
|
*/
|
|
|
|
int generic_remap_checks(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
2018-10-30 07:41:49 +08:00
|
|
|
loff_t *req_count, unsigned int remap_flags)
|
2018-10-30 07:40:31 +08:00
|
|
|
{
|
|
|
|
struct inode *inode_in = file_in->f_mapping->host;
|
|
|
|
struct inode *inode_out = file_out->f_mapping->host;
|
|
|
|
uint64_t count = *req_count;
|
|
|
|
uint64_t bcount;
|
|
|
|
loff_t size_in, size_out;
|
|
|
|
loff_t bs = inode_out->i_sb->s_blocksize;
|
2018-10-30 07:40:46 +08:00
|
|
|
int ret;
|
2018-10-30 07:40:31 +08:00
|
|
|
|
|
|
|
/* The start of both ranges must be aligned to an fs block. */
|
|
|
|
if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* Ensure offsets don't wrap. */
|
|
|
|
if (pos_in + count < pos_in || pos_out + count < pos_out)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
size_in = i_size_read(inode_in);
|
|
|
|
size_out = i_size_read(inode_out);
|
|
|
|
|
|
|
|
/* Dedupe requires both ranges to be within EOF. */
|
2018-10-30 07:41:34 +08:00
|
|
|
if ((remap_flags & REMAP_FILE_DEDUP) &&
|
2018-10-30 07:40:31 +08:00
|
|
|
(pos_in >= size_in || pos_in + count > size_in ||
|
|
|
|
pos_out >= size_out || pos_out + count > size_out))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* Ensure the infile range is within the infile. */
|
|
|
|
if (pos_in >= size_in)
|
|
|
|
return -EINVAL;
|
|
|
|
count = min(count, size_in - (uint64_t)pos_in);
|
|
|
|
|
2018-10-30 07:40:46 +08:00
|
|
|
ret = generic_write_check_limits(file_out, pos_out, &count);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
2018-10-30 07:40:31 +08:00
|
|
|
* If the user wanted us to link to the infile's EOF, round up to the
|
|
|
|
* next block boundary for this check.
|
|
|
|
*
|
|
|
|
* Otherwise, make sure the count is also block-aligned, having
|
|
|
|
* already confirmed the starting offsets' block alignment.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2018-10-30 07:40:31 +08:00
|
|
|
if (pos_in + count == size_in) {
|
|
|
|
bcount = ALIGN(size_in, bs) - pos_in;
|
|
|
|
} else {
|
|
|
|
if (!IS_ALIGNED(count, bs))
|
2018-10-30 07:42:10 +08:00
|
|
|
count = ALIGN_DOWN(count, bs);
|
2018-10-30 07:40:31 +08:00
|
|
|
bcount = count;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2018-10-30 07:40:31 +08:00
|
|
|
/* Don't allow overlapped cloning within the same file. */
|
|
|
|
if (inode_in == inode_out &&
|
|
|
|
pos_out + bcount > pos_in &&
|
|
|
|
pos_out < pos_in + bcount)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2018-10-30 07:42:10 +08:00
|
|
|
* We shortened the request but the caller can't deal with that, so
|
|
|
|
* bounce the request back to userspace.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2018-10-30 07:42:10 +08:00
|
|
|
if (*req_count != count && !(remap_flags & REMAP_FILE_CAN_SHORTEN))
|
2018-10-30 07:40:31 +08:00
|
|
|
return -EINVAL;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-10-30 07:42:10 +08:00
|
|
|
*req_count = count;
|
2018-10-30 07:40:31 +08:00
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2019-06-05 23:04:48 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Performs common checks before doing a file copy/clone
|
|
|
|
* from @file_in to @file_out.
|
|
|
|
*/
|
|
|
|
int generic_file_rw_checks(struct file *file_in, struct file *file_out)
|
|
|
|
{
|
|
|
|
struct inode *inode_in = file_inode(file_in);
|
|
|
|
struct inode *inode_out = file_inode(file_out);
|
|
|
|
|
|
|
|
/* Don't copy dirs, pipes, sockets... */
|
|
|
|
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
|
|
|
|
return -EISDIR;
|
|
|
|
if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (!(file_in->f_mode & FMODE_READ) ||
|
|
|
|
!(file_out->f_mode & FMODE_WRITE) ||
|
|
|
|
(file_out->f_flags & O_APPEND))
|
|
|
|
return -EBADF;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-05 23:04:49 +08:00
|
|
|
/*
|
|
|
|
* Performs necessary checks before doing a file copy
|
|
|
|
*
|
|
|
|
* Can adjust amount of bytes to copy via @req_count argument.
|
|
|
|
* Returns appropriate error code that caller should return or
|
|
|
|
* zero in case the copy should be allowed.
|
|
|
|
*/
|
|
|
|
int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
|
|
|
|
struct file *file_out, loff_t pos_out,
|
|
|
|
size_t *req_count, unsigned int flags)
|
|
|
|
{
|
|
|
|
struct inode *inode_in = file_inode(file_in);
|
|
|
|
struct inode *inode_out = file_inode(file_out);
|
|
|
|
uint64_t count = *req_count;
|
|
|
|
loff_t size_in;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = generic_file_rw_checks(file_in, file_out);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* Don't touch certain kinds of inodes */
|
|
|
|
if (IS_IMMUTABLE(inode_out))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
|
|
|
|
return -ETXTBSY;
|
|
|
|
|
|
|
|
/* Ensure offsets don't wrap. */
|
|
|
|
if (pos_in + count < pos_in || pos_out + count < pos_out)
|
|
|
|
return -EOVERFLOW;
|
|
|
|
|
|
|
|
/* Shorten the copy to EOF */
|
|
|
|
size_in = i_size_read(inode_in);
|
|
|
|
if (pos_in >= size_in)
|
|
|
|
count = 0;
|
|
|
|
else
|
|
|
|
count = min(count, size_in - (uint64_t)pos_in);
|
|
|
|
|
|
|
|
ret = generic_write_check_limits(file_out, pos_out, &count);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* Don't allow overlapped copying within the same file. */
|
|
|
|
if (inode_in == inode_out &&
|
|
|
|
pos_out + count > pos_in &&
|
|
|
|
pos_out < pos_in + count)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
*req_count = count;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-10-16 16:25:01 +08:00
|
|
|
int pagecache_write_begin(struct file *file, struct address_space *mapping,
|
|
|
|
loff_t pos, unsigned len, unsigned flags,
|
|
|
|
struct page **pagep, void **fsdata)
|
|
|
|
{
|
|
|
|
const struct address_space_operations *aops = mapping->a_ops;
|
|
|
|
|
2008-10-30 05:00:55 +08:00
|
|
|
return aops->write_begin(file, mapping, pos, len, flags,
|
2007-10-16 16:25:01 +08:00
|
|
|
pagep, fsdata);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(pagecache_write_begin);
|
|
|
|
|
|
|
|
int pagecache_write_end(struct file *file, struct address_space *mapping,
|
|
|
|
loff_t pos, unsigned len, unsigned copied,
|
|
|
|
struct page *page, void *fsdata)
|
|
|
|
{
|
|
|
|
const struct address_space_operations *aops = mapping->a_ops;
|
|
|
|
|
2008-10-30 05:00:55 +08:00
|
|
|
return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
|
2007-10-16 16:25:01 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(pagecache_write_end);
|
|
|
|
|
2019-12-01 09:49:44 +08:00
|
|
|
/*
|
|
|
|
* Warn about a page cache invalidation failure during a direct I/O write.
|
|
|
|
*/
|
|
|
|
void dio_warn_stale_pagecache(struct file *filp)
|
|
|
|
{
|
|
|
|
static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST);
|
|
|
|
char pathname[128];
|
|
|
|
struct inode *inode = file_inode(filp);
|
|
|
|
char *path;
|
|
|
|
|
|
|
|
errseq_set(&inode->i_mapping->wb_err, -EIO);
|
|
|
|
if (__ratelimit(&_rs)) {
|
|
|
|
path = file_path(filp, pathname, sizeof(pathname));
|
|
|
|
if (IS_ERR(path))
|
|
|
|
path = "(unknown)";
|
|
|
|
pr_crit("Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!\n");
|
|
|
|
pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid,
|
|
|
|
current->comm);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
ssize_t
|
2016-04-07 23:51:56 +08:00
|
|
|
generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct file *file = iocb->ki_filp;
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
struct inode *inode = mapping->host;
|
2016-04-07 23:51:56 +08:00
|
|
|
loff_t pos = iocb->ki_pos;
|
2005-04-17 06:20:36 +08:00
|
|
|
ssize_t written;
|
2008-07-24 12:27:04 +08:00
|
|
|
size_t write_len;
|
|
|
|
pgoff_t end;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2014-03-22 18:51:37 +08:00
|
|
|
write_len = iov_iter_count(from);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
end = (pos + write_len - 1) >> PAGE_SHIFT;
|
2008-07-24 12:27:04 +08:00
|
|
|
|
2017-06-20 20:05:44 +08:00
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT) {
|
|
|
|
/* If there are pages to writeback, return */
|
|
|
|
if (filemap_range_has_page(inode->i_mapping, pos,
|
2019-03-06 07:44:21 +08:00
|
|
|
pos + write_len - 1))
|
2017-06-20 20:05:44 +08:00
|
|
|
return -EAGAIN;
|
|
|
|
} else {
|
|
|
|
written = filemap_write_and_wait_range(mapping, pos,
|
|
|
|
pos + write_len - 1);
|
|
|
|
if (written)
|
|
|
|
goto out;
|
|
|
|
}
|
2008-07-24 12:27:04 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* After a write we want buffered reads to be sure to go to disk to get
|
|
|
|
* the new data. We invalidate clean cached page from the region we're
|
|
|
|
* about to write. We do this *before* the write so that we can return
|
2008-09-03 05:35:40 +08:00
|
|
|
* without clobbering -EIOCBQUEUED from ->direct_IO().
|
2008-07-24 12:27:04 +08:00
|
|
|
*/
|
fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.
We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.
Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.
- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?
- Patches 3-4: some optimizations.
This patch (of 4):
Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.
Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.
Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.
Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.
Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:55:59 +08:00
|
|
|
written = invalidate_inode_pages2_range(mapping,
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
pos >> PAGE_SHIFT, end);
|
fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.
We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.
Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.
- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?
- Patches 3-4: some optimizations.
This patch (of 4):
Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.
Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.
Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.
Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.
Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:55:59 +08:00
|
|
|
/*
|
|
|
|
* If a page can not be invalidated, return 0 to fall back
|
|
|
|
* to buffered write.
|
|
|
|
*/
|
|
|
|
if (written) {
|
|
|
|
if (written == -EBUSY)
|
|
|
|
return 0;
|
|
|
|
goto out;
|
2008-07-24 12:27:04 +08:00
|
|
|
}
|
|
|
|
|
2017-04-14 02:10:15 +08:00
|
|
|
written = mapping->a_ops->direct_IO(iocb, from);
|
2008-07-24 12:27:04 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Finally, try again to invalidate clean pages which might have been
|
|
|
|
* cached by non-direct readahead, or faulted in by get_user_pages()
|
|
|
|
* if the source of the write was an mmap'ed region of the file
|
|
|
|
* we're writing. Either one is a pretty crazy thing to do,
|
|
|
|
* so we don't support it 100%. If this invalidation
|
|
|
|
* fails, tough, the write still worked...
|
2017-09-21 22:16:29 +08:00
|
|
|
*
|
|
|
|
* Most of the time we do not need this since dio_complete() will do
|
|
|
|
* the invalidation for us. However there are some file systems that
|
|
|
|
* do not end up with dio_complete() being called, so let's not break
|
2019-12-01 09:49:41 +08:00
|
|
|
* them by removing it completely.
|
|
|
|
*
|
2019-12-01 09:49:47 +08:00
|
|
|
* Noticeable example is a blkdev_direct_IO().
|
|
|
|
*
|
2019-12-01 09:49:41 +08:00
|
|
|
* Skip invalidation for async writes or if mapping has no pages.
|
2008-07-24 12:27:04 +08:00
|
|
|
*/
|
2019-12-01 09:49:47 +08:00
|
|
|
if (written > 0 && mapping->nrpages &&
|
|
|
|
invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end))
|
|
|
|
dio_warn_stale_pagecache(file);
|
2008-07-24 12:27:04 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
if (written > 0) {
|
2010-10-27 05:21:58 +08:00
|
|
|
pos += written;
|
2017-04-14 02:10:15 +08:00
|
|
|
write_len -= written;
|
2010-10-27 05:21:58 +08:00
|
|
|
if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
|
|
|
|
i_size_write(inode, pos);
|
2005-04-17 06:20:36 +08:00
|
|
|
mark_inode_dirty(inode);
|
|
|
|
}
|
2014-02-12 09:58:20 +08:00
|
|
|
iocb->ki_pos = pos;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2017-04-14 02:10:15 +08:00
|
|
|
iov_iter_revert(from, write_len - iov_iter_count(from));
|
2008-07-24 12:27:04 +08:00
|
|
|
out:
|
2005-04-17 06:20:36 +08:00
|
|
|
return written;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(generic_file_direct_write);
|
|
|
|
|
2007-10-16 16:24:57 +08:00
|
|
|
/*
|
|
|
|
* Find or create a page at the given pagecache position. Return the locked
|
|
|
|
* page. This function is specifically for buffered writes.
|
|
|
|
*/
|
fs: symlink write_begin allocation context fix
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05 04:00:53 +08:00
|
|
|
struct page *grab_cache_page_write_begin(struct address_space *mapping,
|
|
|
|
pgoff_t index, unsigned flags)
|
2007-10-16 16:24:57 +08:00
|
|
|
{
|
|
|
|
struct page *page;
|
2016-05-21 07:56:28 +08:00
|
|
|
int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
|
2012-01-11 07:07:53 +08:00
|
|
|
|
fs: symlink write_begin allocation context fix
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05 04:00:53 +08:00
|
|
|
if (flags & AOP_FLAG_NOFS)
|
2014-06-05 07:10:31 +08:00
|
|
|
fgp_flags |= FGP_NOFS;
|
|
|
|
|
|
|
|
page = pagecache_get_page(mapping, index, fgp_flags,
|
2014-12-30 03:30:35 +08:00
|
|
|
mapping_gfp_mask(mapping));
|
2011-01-14 07:46:18 +08:00
|
|
|
if (page)
|
2014-06-05 07:10:31 +08:00
|
|
|
wait_for_stable_page(page);
|
2007-10-16 16:24:57 +08:00
|
|
|
|
|
|
|
return page;
|
|
|
|
}
|
fs: symlink write_begin allocation context fix
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05 04:00:53 +08:00
|
|
|
EXPORT_SYMBOL(grab_cache_page_write_begin);
|
2007-10-16 16:24:57 +08:00
|
|
|
|
2014-02-12 10:34:08 +08:00
|
|
|
ssize_t generic_perform_write(struct file *file,
|
2007-10-16 16:25:01 +08:00
|
|
|
struct iov_iter *i, loff_t pos)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
const struct address_space_operations *a_ops = mapping->a_ops;
|
|
|
|
long status = 0;
|
|
|
|
ssize_t written = 0;
|
2007-10-16 16:25:03 +08:00
|
|
|
unsigned int flags = 0;
|
|
|
|
|
2007-10-16 16:25:01 +08:00
|
|
|
do {
|
|
|
|
struct page *page;
|
|
|
|
unsigned long offset; /* Offset into pagecache page */
|
|
|
|
unsigned long bytes; /* Bytes to write to page */
|
|
|
|
size_t copied; /* Bytes copied from user */
|
|
|
|
void *fsdata;
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
offset = (pos & (PAGE_SIZE - 1));
|
|
|
|
bytes = min_t(unsigned long, PAGE_SIZE - offset,
|
2007-10-16 16:25:01 +08:00
|
|
|
iov_iter_count(i));
|
|
|
|
|
|
|
|
again:
|
2015-10-07 15:32:38 +08:00
|
|
|
/*
|
|
|
|
* Bring in the user page that we will copy from _first_.
|
|
|
|
* Otherwise there's a nasty deadlock on copying from the
|
|
|
|
* same page as we're writing to, without it being marked
|
|
|
|
* up-to-date.
|
|
|
|
*
|
|
|
|
* Not only is this an optimisation, but it is also required
|
|
|
|
* to check that the address is actually valid, when atomic
|
|
|
|
* usercopies are used, below.
|
|
|
|
*/
|
|
|
|
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
|
|
|
|
status = -EFAULT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
mm: make sendfile(2) killable
Currently a simple program below issues a sendfile(2) system call which
takes about 62 days to complete in my test KVM instance.
int fd;
off_t off = 0;
fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644);
ftruncate(fd, 2);
lseek(fd, 0, SEEK_END);
sendfile(fd, fd, &off, 0xfffffff);
Now you should not ask kernel to do a stupid stuff like copying 256MB in
2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin
should have a way to stop you.
We actually do have a check for fatal_signal_pending() in
generic_perform_write() which triggers in this path however because we
always succeed in writing something before the check is done, we return
value > 0 from generic_perform_write() and thus the information about
signal gets lost.
Fix the problem by doing the signal check before writing anything. That
way generic_perform_write() returns -EINTR, the error gets propagated up
and the sendfile loop terminates early.
Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-23 04:32:21 +08:00
|
|
|
if (fatal_signal_pending(current)) {
|
|
|
|
status = -EINTR;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2007-10-16 16:25:03 +08:00
|
|
|
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
|
2007-10-16 16:25:01 +08:00
|
|
|
&page, &fsdata);
|
2014-06-05 07:10:31 +08:00
|
|
|
if (unlikely(status < 0))
|
2007-10-16 16:25:01 +08:00
|
|
|
break;
|
|
|
|
|
mm: flush dcache before writing into page to avoid alias
The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic. So the right steps should be:
flush_dcache_page(page);
kmap_atomic(page);
write to page;
kunmap_atomic(page);
flush_dcache_page(page);
More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.
Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:
int val = 0x11111111;
fd = open("abc", O_RDWR);
addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
*(addr+0) = 0x44444444;
tmp = *(addr+0);
*(addr+1) = 0x77777777;
write(fd, &val, sizeof(int));
close(fd);
The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.
Signed-off-by: Anfei <anfei.zhou@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-arch@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-03 05:44:02 +08:00
|
|
|
if (mapping_writably_mapped(mapping))
|
|
|
|
flush_dcache_page(page);
|
2015-10-07 15:32:38 +08:00
|
|
|
|
2007-10-16 16:25:01 +08:00
|
|
|
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
|
|
|
|
flush_dcache_page(page);
|
|
|
|
|
|
|
|
status = a_ops->write_end(file, mapping, pos, bytes, copied,
|
|
|
|
page, fsdata);
|
|
|
|
if (unlikely(status < 0))
|
|
|
|
break;
|
|
|
|
copied = status;
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
|
2008-02-02 22:01:17 +08:00
|
|
|
iov_iter_advance(i, copied);
|
2007-10-16 16:25:01 +08:00
|
|
|
if (unlikely(copied == 0)) {
|
|
|
|
/*
|
|
|
|
* If we were unable to copy any data at all, we must
|
|
|
|
* fall back to a single segment length write.
|
|
|
|
*
|
|
|
|
* If we didn't fallback here, we could livelock
|
|
|
|
* because not all segments in the iov can be copied at
|
|
|
|
* once without a pagefault.
|
|
|
|
*/
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
bytes = min_t(unsigned long, PAGE_SIZE - offset,
|
2007-10-16 16:25:01 +08:00
|
|
|
iov_iter_single_seg_count(i));
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
pos += copied;
|
|
|
|
written += copied;
|
|
|
|
|
|
|
|
balance_dirty_pages_ratelimited(mapping);
|
|
|
|
} while (iov_iter_count(i));
|
|
|
|
|
|
|
|
return written ? written : status;
|
|
|
|
}
|
2014-02-12 10:34:08 +08:00
|
|
|
EXPORT_SYMBOL(generic_perform_write);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-08-18 00:10:06 +08:00
|
|
|
/**
|
2014-04-03 15:17:43 +08:00
|
|
|
* __generic_file_write_iter - write data to a file
|
2009-08-18 00:10:06 +08:00
|
|
|
* @iocb: IO state structure (file, offset, etc.)
|
2014-04-03 15:17:43 +08:00
|
|
|
* @from: iov_iter with data to write
|
2009-08-18 00:10:06 +08:00
|
|
|
*
|
|
|
|
* This function does all the work needed for actually writing data to a
|
|
|
|
* file. It does all basic checks, removes SUID from the file, updates
|
|
|
|
* modification times and calls proper subroutines depending on whether we
|
|
|
|
* do direct IO or a standard buffered write.
|
|
|
|
*
|
|
|
|
* It expects i_mutex to be grabbed unless we work on a block device or similar
|
|
|
|
* object which does not need locking at all.
|
|
|
|
*
|
|
|
|
* This function does *not* take care of syncing data in case of O_SYNC write.
|
|
|
|
* A caller has to handle it. This is mainly due to the fact that we want to
|
|
|
|
* avoid syncing under i_mutex.
|
2019-03-06 07:48:42 +08:00
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* * number of bytes written, even for truncated writes
|
|
|
|
* * negative error code if no data has been written at all
|
2009-08-18 00:10:06 +08:00
|
|
|
*/
|
2014-04-03 15:17:43 +08:00
|
|
|
ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct file *file = iocb->ki_filp;
|
2006-10-20 14:28:13 +08:00
|
|
|
struct address_space * mapping = file->f_mapping;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct inode *inode = mapping->host;
|
2014-02-12 10:34:08 +08:00
|
|
|
ssize_t written = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
ssize_t err;
|
2014-02-12 10:34:08 +08:00
|
|
|
ssize_t status;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* We can write back this queue in page reclaim */
|
2015-01-14 17:42:36 +08:00
|
|
|
current->backing_dev_info = inode_to_bdi(inode);
|
2015-05-21 22:05:53 +08:00
|
|
|
err = file_remove_privs(file);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (err)
|
|
|
|
goto out;
|
|
|
|
|
2012-03-26 21:59:21 +08:00
|
|
|
err = file_update_time(file);
|
|
|
|
if (err)
|
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2015-04-10 01:52:01 +08:00
|
|
|
if (iocb->ki_flags & IOCB_DIRECT) {
|
2015-04-07 22:22:53 +08:00
|
|
|
loff_t pos, endbyte;
|
2006-10-20 14:28:13 +08:00
|
|
|
|
2016-04-07 23:51:56 +08:00
|
|
|
written = generic_file_direct_write(iocb, from);
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2015-02-17 07:58:53 +08:00
|
|
|
* If the write stopped short of completing, fall back to
|
|
|
|
* buffered writes. Some filesystems do this for writes to
|
|
|
|
* holes, for example. For DAX files, a buffered write will
|
|
|
|
* not succeed (even if it did, DAX does not handle dirty
|
|
|
|
* page-cache pages correctly).
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2015-04-07 22:22:53 +08:00
|
|
|
if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
|
2015-02-17 07:58:53 +08:00
|
|
|
goto out;
|
|
|
|
|
2015-04-07 22:22:53 +08:00
|
|
|
status = generic_perform_write(file, from, pos = iocb->ki_pos);
|
2006-10-20 14:28:13 +08:00
|
|
|
/*
|
2014-02-12 10:34:08 +08:00
|
|
|
* If generic_perform_write() returned a synchronous error
|
2006-10-20 14:28:13 +08:00
|
|
|
* then we want to return the number of bytes which were
|
|
|
|
* direct-written, or the error code if that was zero. Note
|
|
|
|
* that this differs from normal direct-io semantics, which
|
|
|
|
* will return -EFOO even if some bytes were written.
|
|
|
|
*/
|
2014-08-09 00:39:16 +08:00
|
|
|
if (unlikely(status < 0)) {
|
2014-02-12 10:34:08 +08:00
|
|
|
err = status;
|
2006-10-20 14:28:13 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We need to ensure that the page cache pages are written to
|
|
|
|
* disk and invalidated to preserve the expected O_DIRECT
|
|
|
|
* semantics.
|
|
|
|
*/
|
2014-02-12 10:34:08 +08:00
|
|
|
endbyte = pos + status - 1;
|
2015-04-07 22:22:53 +08:00
|
|
|
err = filemap_write_and_wait_range(mapping, pos, endbyte);
|
2006-10-20 14:28:13 +08:00
|
|
|
if (err == 0) {
|
2015-04-07 22:22:53 +08:00
|
|
|
iocb->ki_pos = endbyte + 1;
|
2014-02-12 10:34:08 +08:00
|
|
|
written += status;
|
2006-10-20 14:28:13 +08:00
|
|
|
invalidate_mapping_pages(mapping,
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
pos >> PAGE_SHIFT,
|
|
|
|
endbyte >> PAGE_SHIFT);
|
2006-10-20 14:28:13 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We don't know how much we wrote, so just return
|
|
|
|
* the number of bytes which were direct-written
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
} else {
|
2015-04-07 22:22:53 +08:00
|
|
|
written = generic_perform_write(file, from, iocb->ki_pos);
|
|
|
|
if (likely(written > 0))
|
|
|
|
iocb->ki_pos += written;
|
2006-10-20 14:28:13 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
|
|
|
current->backing_dev_info = NULL;
|
|
|
|
return written ? written : err;
|
|
|
|
}
|
2014-04-03 15:17:43 +08:00
|
|
|
EXPORT_SYMBOL(__generic_file_write_iter);
|
2009-08-18 00:10:06 +08:00
|
|
|
|
|
|
|
/**
|
2014-04-03 15:17:43 +08:00
|
|
|
* generic_file_write_iter - write data to a file
|
2009-08-18 00:10:06 +08:00
|
|
|
* @iocb: IO state structure
|
2014-04-03 15:17:43 +08:00
|
|
|
* @from: iov_iter with data to write
|
2009-08-18 00:10:06 +08:00
|
|
|
*
|
2014-04-03 15:17:43 +08:00
|
|
|
* This is a wrapper around __generic_file_write_iter() to be used by most
|
2009-08-18 00:10:06 +08:00
|
|
|
* filesystems. It takes care of syncing the file in case of O_SYNC file
|
|
|
|
* and acquires i_mutex as needed.
|
2019-03-06 07:48:42 +08:00
|
|
|
* Return:
|
|
|
|
* * negative error code if no data has been written at all of
|
|
|
|
* vfs_fsync_range() failed for a synchronous write
|
|
|
|
* * number of bytes written, even for truncated writes
|
2009-08-18 00:10:06 +08:00
|
|
|
*/
|
2014-04-03 15:17:43 +08:00
|
|
|
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct file *file = iocb->ki_filp;
|
2009-08-18 01:52:36 +08:00
|
|
|
struct inode *inode = file->f_mapping->host;
|
2005-04-17 06:20:36 +08:00
|
|
|
ssize_t ret;
|
|
|
|
|
2016-01-23 04:40:57 +08:00
|
|
|
inode_lock(inode);
|
2015-04-10 00:55:47 +08:00
|
|
|
ret = generic_write_checks(iocb, from);
|
|
|
|
if (ret > 0)
|
2015-04-07 23:28:12 +08:00
|
|
|
ret = __generic_file_write_iter(iocb, from);
|
2016-01-23 04:40:57 +08:00
|
|
|
inode_unlock(inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-04-07 23:52:01 +08:00
|
|
|
if (ret > 0)
|
|
|
|
ret = generic_write_sync(iocb, ret);
|
2005-04-17 06:20:36 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2014-04-03 15:17:43 +08:00
|
|
|
EXPORT_SYMBOL(generic_file_write_iter);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-08-30 02:05:54 +08:00
|
|
|
/**
|
|
|
|
* try_to_release_page() - release old fs-specific metadata on a page
|
|
|
|
*
|
|
|
|
* @page: the page which the kernel is trying to free
|
|
|
|
* @gfp_mask: memory allocation flags (and I/O mode)
|
|
|
|
*
|
|
|
|
* The address_space is to try to release any data against the page
|
2019-03-06 07:48:42 +08:00
|
|
|
* (presumably at page->private).
|
2006-08-30 02:05:54 +08:00
|
|
|
*
|
2009-04-03 23:42:36 +08:00
|
|
|
* This may also be called if PG_fscache is set on a page, indicating that the
|
|
|
|
* page is known to the local caching routines.
|
|
|
|
*
|
2006-08-30 02:05:54 +08:00
|
|
|
* The @gfp_mask argument specifies whether I/O may be performed to release
|
2015-11-07 08:28:28 +08:00
|
|
|
* this page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).
|
2006-08-30 02:05:54 +08:00
|
|
|
*
|
2019-03-06 07:48:42 +08:00
|
|
|
* Return: %1 if the release was successful, otherwise return zero.
|
2006-08-30 02:05:54 +08:00
|
|
|
*/
|
|
|
|
int try_to_release_page(struct page *page, gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
struct address_space * const mapping = page->mapping;
|
|
|
|
|
|
|
|
BUG_ON(!PageLocked(page));
|
|
|
|
if (PageWriteback(page))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (mapping && mapping->a_ops->releasepage)
|
|
|
|
return mapping->a_ops->releasepage(page, gfp_mask);
|
|
|
|
return try_to_free_buffers(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
EXPORT_SYMBOL(try_to_release_page);
|