linux/fs
Johannes Weiner 7b785645e8 mm: fix page cache convergence regression
Since a283348629 ("page cache: Finish XArray conversion"), on most
major Linux distributions, the page cache doesn't correctly transition
when the hot data set is changing, and leaves the new pages thrashing
indefinitely instead of kicking out the cold ones.

On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
running stock Arch Linux:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120086/153600 workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120268/153600 workingset-b

workingset-b is a 600M file on a 1G host that is otherwise entirely
idle. No matter how often it's being accessed, it won't get cached.

While investigating, I noticed that the non-resident information gets
aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
a problem because a workingset transition like this relies on the
non-resident information tracked in the page cache tree of evicted
file ranges: when the cache faults are refaults of recently evicted
cache, we challenge the existing active set, and that allows a new
workingset to establish itself.

Tracing the shrinker that maintains this memory revealed that all page
cache tree nodes were allocated to the root cgroup. This is a problem,
because 1) the shrinker sizes the amount of non-resident information
it keeps to the size of the cgroup's other memory and 2) on most major
Linux distributions, only kernel threads live in the root cgroup and
everything else gets put into services or session groups:

[root@ham ~]# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c1.scope

As a result, we basically maintain no non-resident information for the
workloads running on the system, thus breaking the caching algorithm.

Looking through the code, I found the culprit in the above-mentioned
patch: when switching from the radix tree to xarray, it dropped the
__GFP_ACCOUNT flag from the tree node allocations - the flag that
makes sure the allocated memory gets charged to and tracked by the
cgroup of the calling process - in this case, the one doing the fault.

To fix this, allow xarray users to specify per-tree flag that makes
xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
tree annotation to request such cgroup tracking for the cache nodes.

With this patch applied, the page cache correctly converges on new
workingsets again after just a few iterations:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ ./mincore workingset-a workingset-b
124607/153600 workingset-a
87876/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
81313/153600 workingset-a
133321/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
63036/153600 workingset-a
153600/153600 workingset-b

Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2019-05-31 13:52:41 -04:00
..
9p treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
adfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
affs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
afs treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
autofs treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 83 2019-05-24 17:37:52 +02:00
befs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
bfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
btrfs for-5.2-rc2-tag 2019-05-30 20:52:40 -07:00
cachefiles treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
ceph treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
cifs treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 61 2019-05-24 17:36:45 +02:00
coda treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
configfs configfs: Fix use-after-free when accessing sd->s_dentry 2019-05-28 08:11:58 +02:00
cramfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
crypto treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
debugfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
devpts treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 83 2019-05-24 17:37:52 +02:00
dlm treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
ecryptfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
efivarfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
efs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
exportfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
ext2 treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
ext4 Bug fixes (including a regression fix) for ext4. 2019-05-25 15:03:12 -07:00
f2fs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
fat treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
freevxfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
fscache treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
fuse treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
gfs2 Fix a gfs2 sign extension bug introduced in v4.3. 2019-05-22 08:31:09 -07:00
hfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
hfsplus treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
hostfs This pull request contains the following changes for UML: 2019-05-12 17:52:13 -04:00
hpfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
hugetlbfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
isofs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
jbd2 treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
jffs2 treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
jfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
kernfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
lockd treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
minix treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
nfs treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
nfs_common treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
nfsd treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 1 2019-05-21 11:28:39 +02:00
nilfs2 treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
nls treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
notify treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118 2019-05-24 17:39:02 +02:00
ntfs treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 97 2019-05-24 17:37:53 +02:00
ocfs2 treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
omfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
openpromfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
orangefs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
overlayfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
proc treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
pstore treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
qnx4 treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
qnx6 treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
quota treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
ramfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
reiserfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
romfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
squashfs treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118 2019-05-24 17:39:02 +02:00
sysfs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
sysv treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
tracefs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
ubifs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
udf treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
ufs treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
unicode treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
xfs Fixes for 5.1: 2019-05-23 11:18:18 -07:00
aio.c aio: use kmem_cache_free() instead of kfree() 2019-04-04 20:13:59 -04:00
anon_inodes.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
attr.c
bad_inode.c
binfmt_aout.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
binfmt_elf_fdpic.c
binfmt_elf.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
binfmt_em86.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
binfmt_flat.c
binfmt_misc.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
binfmt_script.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
block_dev.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
buffer.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
char_dev.c chardev: update comment based on the code 2019-04-02 17:49:58 +02:00
compat_binfmt_elf.c
compat_ioctl.c
compat.c
coredump.c
d_path.c
dax.c mm: page_mkclean vs MADV_DONTNEED race 2019-05-14 09:47:48 -07:00
dcache.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
dcookies.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
direct-io.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
drop_caches.c fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() 2019-02-01 15:46:24 -08:00
eventfd.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
eventpoll.c epoll: use rwlock in order to reduce ep_poll_callback() contention 2019-03-07 18:32:01 -08:00
exec.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
fcntl.c fs: mark expected switch fall-throughs 2019-04-08 18:21:02 -05:00
fhandle.c
file_table.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
file.c io_uring-2019-03-06 2019-03-08 14:48:40 -08:00
filesystems.c vfs: Implement a filesystem superblock creation/configuration context 2019-02-28 03:29:26 -05:00
fs_context.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
fs_parser.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
fs_pin.c
fs_struct.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
fs_types.c fs: common implementation of file type 2019-01-21 17:48:13 +01:00
fs-writeback.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
fsopen.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
inode.c mm: fix page cache convergence regression 2019-05-31 13:52:41 -04:00
internal.h Merge branch 'work.mount-syscalls' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:17:51 -07:00
io_uring.c for-linus-20190516 2019-05-16 19:10:37 -07:00
ioctl.c
iomap.c for-5.2/block-20190507 2019-05-07 18:14:36 -07:00
Kconfig treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
Kconfig.binfmt treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
libfs.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
locks.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
Makefile Add as a feature case-insensitive directories (the casefold feature) 2019-05-07 21:12:44 -07:00
mbcache.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
mount.h saner handling of temporary namespaces 2019-01-30 17:44:07 -05:00
mpage.c block: remove the i argument to bio_for_each_segment_all 2019-04-30 09:26:13 -06:00
namei.c Clean up fscrypt's dcache revalidation support, and other 2019-05-07 21:28:04 -07:00
namespace.c do_move_mount(): fix an unsafe use of is_anon_ns() 2019-05-09 02:32:50 -04:00
no-block.c
nsfs.c nsfs: unobfuscate 2019-04-09 19:20:57 -04:00
open.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
pipe.c Merge branch 'page-refs' (page ref overflow) 2019-04-14 15:09:40 -07:00
pnode.c separate copying and locking mount tree on cross-userns copies 2019-01-30 17:14:50 -05:00
pnode.h separate copying and locking mount tree on cross-userns copies 2019-01-30 17:14:50 -05:00
posix_acl.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
proc_namespace.c
read_write.c vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files 2019-05-06 17:46:52 +03:00
readdir.c
select.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
seq_file.c fs: mark expected switch fall-throughs 2019-04-08 18:21:02 -05:00
signalfd.c fs: mark expected switch fall-throughs 2019-04-08 18:21:02 -05:00
splice.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
stack.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
stat.c fs: move generic stat response attr handling to vfs_getattr_nosec 2019-02-01 01:55:45 -05:00
statfs.c vfs: add vfs_get_fsid() helper 2019-02-07 16:38:35 +01:00
super.c [fix] get rid of checking for absent device name in vfs_get_tree() 2019-04-28 21:34:21 -04:00
sync.c fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback 2019-05-14 09:47:50 -07:00
timerfd.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
userfaultfd.c userfaultfd/sysctl: add vm.unprivileged_userfaultfd 2019-05-14 09:47:45 -07:00
utimes.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
xattr.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00