linux/fs
Peter Xu 2bad466cc9 mm/uffd: UFFD_FEATURE_WP_UNPOPULATED
Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4.

The new feature bit makes anonymous memory acts the same as file memory on
userfaultfd-wp in that it'll also wr-protect none ptes.

It can be useful in two cases:

(1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot,
    so pre-fault can be replaced by enabling this flag and speed up
    protections

(2) It helps to implement async uffd-wp mode that Muhammad is working on [1]

It's debatable whether this is the most ideal solution because with the
new feature bit set, wr-protect none pte needs to pre-populate the
pgtables to the last level (PAGE_SIZE).  But it seems fine so far to
service either purpose above, so we can leave optimizations for later.

The series brings pte markers to anonymous memory too.  There's some
change in the common mm code path in the 1st patch, great to have some eye
looking at it, but hopefully they're still relatively straightforward.


This patch (of 2):

This is a new feature that controls how uffd-wp handles none ptes.  When
it's set, the kernel will handle anonymous memory the same way as file
memory, by allowing the user to wr-protect unpopulated ptes.

File memories handles none ptes consistently by allowing wr-protecting of
none ptes because of the unawareness of page cache being exist or not. 
For anonymous it was not as persistent because we used to assume that we
don't need protections on none ptes or known zero pages.

One use case of such a feature bit was VM live snapshot, where if without
wr-protecting empty ptes the snapshot can contain random rubbish in the
holes of the anonymous memory, which can cause misbehave of the guest when
the guest OS assumes the pages should be all zeros.

QEMU worked it around by pre-populate the section with reads to fill in
zero page entries before starting the whole snapshot process [1].

Recently there's another need raised on using userfaultfd wr-protect for
detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
case if without being able to wr-protect none ptes by default, the dirty
info can get lost, since we cannot treat every none pte to be dirty (the
current design is identify a page dirty based on uffd-wp bit being
cleared).

In general, we want to be able to wr-protect empty ptes too even for
anonymous.

This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
uffd-wp handling on none ptes being consistent no matter what the memory
type is underneath.  It doesn't have any impact on file memories so far
because we already have pte markers taking care of that.  So it only
affects anonymous.

The feature bit is by default off, so the old behavior will be maintained.
Sometimes it may be wanted because the wr-protect of none ptes will
contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte
markers to anonymous), but also on creating the pgtables to store the pte
markers.  So there's potentially less chance of using thp on the first
fault for a none pmd or larger than a pmd.

The major implementation part is teaching the whole kernel to understand
pte markers even for anonymously mapped ranges, meanwhile allowing the
UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
new feature bit is set.

Note that even if the patch subject starts with mm/uffd, there're a few
small refactors to major mm path of handling anonymous page faults.  But
they should be straightforward.

With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all
the memory before wr-protect during taking a live snapshot.  Quotting from
Muhammad's test result here [3] based on a simple program [4]:

  (1) With huge page disabled
  echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
  ./uffd_wp_perf
  Test DEFAULT: 4
  Test PRE-READ: 1111453 (pre-fault 1101011)
  Test MADVISE: 278276 (pre-fault 266378)
  Test WP-UNPOPULATE: 11712

  (2) With Huge page enabled
  echo always > /sys/kernel/mm/transparent_hugepage/enabled
  ./uffd_wp_perf
  Test DEFAULT: 4
  Test PRE-READ: 22521 (pre-fault 22348)
  Test MADVISE: 4909 (pre-fault 4743)
  Test WP-UNPOPULATE: 14448

There'll be a great perf boost for no-thp case, while for thp enabled with
extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE,
but that's low possibility in reality, also the overhead was not reduced
but postponed until a follow up write on any huge zero thp, so potentially
it is faster by making the follow up writes slower.

[1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
[2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
[3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/
[4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c

[peterx@redhat.com: comment changes, oneliner fix to khugepaged]
  Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n
Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Paul Gofman <pgofman@codeweavers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:44 -07:00
..
9p 9p patches for 6.3 merge window (part 1) 2023-03-01 08:52:49 -08:00
adfs fs: port ->setattr() to pass mnt_idmap 2023-01-19 09:24:02 +01:00
affs for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
afs mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
autofs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
befs
bfs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
btrfs for-6.3-rc3-tag 2023-03-24 08:32:10 -07:00
cachefiles fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
ceph Two small fixes from Xiubo and myself, marked for stable. 2023-03-02 10:48:30 -08:00
cifs smb3: fix unusable share after force unmount failure 2023-03-24 14:37:12 -05:00
coda hardening updates for v6.3-rc1 2023-02-21 11:07:23 -08:00
configfs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
cramfs fs/cramfs/inode.c: initialize file_ra_state 2023-03-02 21:54:23 -08:00
crypto fscrypt: check for NULL keyring in fscrypt_put_master_key_activeref() 2023-03-18 21:08:03 -07:00
debugfs ARM: SoC drivers for 6.3 2023-02-27 10:04:49 -08:00
devpts
dlm Driver core changes for 6.3-rc1 2023-02-24 12:58:55 -08:00
ecryptfs This update includes the following changes: 2023-02-21 18:10:50 -08:00
efivarfs A healthy mix of EFI contributions this time: 2023-02-23 14:41:48 -08:00
efs
erofs erofs: use wrapper i_blocksize() in erofs_file_read_iter() 2023-03-09 23:36:04 +08:00
exfat Description for this pull request: 2023-03-01 08:42:27 -08:00
exportfs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
ext2 for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
ext4 mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
f2fs f2fs-for-6.3-rc1 2023-02-27 16:18:51 -08:00
fat There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
freevxfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
fscache fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() 2023-01-30 12:51:54 +00:00
fuse fuse update for 6.3 2023-02-27 09:53:58 -08:00
gfs2 Reinstate "GFS2: free disk inode which is deleted by remote node -V2" 2023-03-23 19:37:56 +01:00
hfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
hfsplus fs: hfsplus: fix UAF issue in hfsplus_put_super 2023-03-02 21:54:23 -08:00
hostfs This pull request contains the following changes for UML: 2023-03-01 09:13:00 -08:00
hpfs fs: port ->rename() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
hugetlbfs mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
iomap mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
isofs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
jbd2 Bug fixes and regressions for ext4, the most serious of which is a 2023-03-12 08:55:55 -07:00
jffs2 This pull request contains updates for JFFS2, UBI and UBIFS 2023-03-01 09:06:51 -08:00
jfs mm,jfs: move write_one_page/folio_write_one to jfs 2023-03-28 16:20:14 -07:00
kernfs Driver core changes for 6.3-rc1 2023-02-24 12:58:55 -08:00
ksmbd ksmbd: return unsupported error on smb1 mount 2023-03-24 00:23:13 -05:00
lockd lockd: set file_lock start and end when decoding nlm4 testargs 2023-03-14 14:00:55 -04:00
minix Merge branch 'work.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:01:15 -08:00
netfs mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
nfs mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
nfs_common filelock: move file locking definitions to separate header file 2023-01-11 06:52:32 -05:00
nfsd nfsd-6.3 fixes: 2023-03-21 14:48:38 -07:00
nilfs2 mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
nls
notify RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
ntfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
ntfs3 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
ocfs2 ocfs2: don't use write_one_page in ocfs2_duplicate_clusters_by_page 2023-03-28 16:20:14 -07:00
omfs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
openpromfs
orangefs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
overlayfs fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
proc capability: just use a 'u64' instead of a 'u32[2]' array 2023-03-01 10:01:22 -08:00
pstore pstore updates for v6.2-rc1-fixes 2022-12-23 11:55:54 -08:00
qnx4
qnx6
quota RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
ramfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
reiserfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
romfs mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping() 2023-01-18 17:12:56 -08:00
smbfs_common smb3: Replace smb2pdu 1-element arrays with flex-arrays 2023-02-20 17:25:43 -06:00
squashfs revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" 2023-02-03 17:52:25 -08:00
sysfs
sysv Merge branch 'work.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:03:26 -08:00
tracefs fs: port ->mkdir() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
ubifs This pull request contains updates for JFFS2, UBI and UBIFS 2023-03-01 09:06:51 -08:00
udf udf: Warn if block mapping is done for in-ICB files 2023-03-06 16:38:25 +01:00
ufs ufs: don't flush page immediately for DIRSYNC directories 2023-03-28 16:20:14 -07:00
unicode
vboxsf fs: port ->rename() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
verity fsverity: don't drop pagecache at end of FS_IOC_ENABLE_VERITY 2023-03-15 22:50:41 -07:00
xfs xfs: fix mismerged tracepoints 2023-03-24 13:16:01 -07:00
zonefs zonefs: Fix error message in zonefs_file_dio_append() 2023-03-21 06:36:43 +09:00
aio.c Merge branch 'mm-hotfixes-stable' into mm-stable 2023-02-10 15:34:48 -08:00
anon_inodes.c
attr.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
bad_inode.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
binfmt_elf_fdpic.c elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size} 2023-01-05 15:12:12 +00:00
binfmt_elf_test.c
binfmt_elf.c Linux 6.2-rc6 2023-01-31 15:01:20 +01:00
binfmt_flat.c
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
char_dev.c chardev: fix error handling in cdev_device_add() 2022-12-02 17:48:59 +01:00
compat_binfmt_elf.c
coredump.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
d_path.c
dax.c fsdax: dedupe should compare the min of two iters' length 2023-03-28 15:24:32 -07:00
dcache.c tmpfile API change 2022-10-10 19:45:17 -07:00
direct-io.c fs: move sb_init_dio_done_wq out of direct-io.c 2023-01-26 10:30:56 -07:00
drop_caches.c
eventfd.c eventfd: provide a eventfd_signal_mask() helper 2022-11-22 06:07:55 -07:00
eventpoll.c eventpoll: add EPOLL_URING_WAKE poll wakeup flag 2022-11-21 07:45:29 -07:00
exec.c lazy tlb: introduce lazy tlb mm refcount helper functions 2023-03-28 16:20:08 -07:00
fcntl.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
fhandle.c
file_table.c filelock: move file locking definitions to separate header file 2023-01-11 06:52:32 -05:00
file.c fs: prevent out-of-bounds array speculation when closing a file descriptor 2023-03-09 22:46:21 -05:00
filesystems.c
fs_context.c
fs_parser.c ext4: journal_path mount options should follow links 2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio() 2023-02-02 22:33:19 -08:00
fsopen.c
init.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
inode.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
internal.h for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
ioctl.c fs: port inode_owner_or_capable() to mnt_idmap 2023-01-19 09:24:29 +01:00
Kconfig fs: build the legacy direct I/O code conditionally 2023-01-26 10:30:56 -07:00
Kconfig.binfmt Xtensa updates for v6.1 2022-10-10 14:21:11 -07:00
kernel_read_file.c
libfs.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
locks.c filelocks: use mount idmapping for setlease permission check 2023-03-09 22:36:12 +01:00
Makefile for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2022-12-08 21:49:25 -05:00
mnt_idmapping.c fs: move mnt_idmap 2023-01-19 09:24:30 +01:00
mount.h
mpage.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
namei.c NFSD 6.3 Release Notes 2023-02-22 14:21:40 -08:00
namespace.c Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:20:07 -08:00
no-block.c
nsfs.c nsfs: repair kernel-doc for ns_match() 2023-01-11 15:47:40 -05:00
open.c vfs: avoid duplicating creds in faccessat if possible 2023-02-27 16:39:19 -08:00
pipe.c
pnode.c pnode: terminate at peers of source 2022-12-21 14:45:25 +01:00
pnode.h
posix_acl.c fs.acl.v6.3 2023-02-20 12:14:33 -08:00
proc_namespace.c
read_write.c iov_iter work; most of that is about getting rid of 2022-12-12 18:29:54 -08:00
readdir.c
remap_range.c fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap 2023-01-19 09:24:29 +01:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
signalfd.c
splice.c splice: Remove redundant assignment to ret 2023-03-09 10:10:31 +01:00
stack.c
stat.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
statfs.c
super.c mm: shrinkers: convert shrinker_rwsem to mutex 2023-03-28 16:20:17 -07:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c mm/uffd: UFFD_FEATURE_WP_UNPOPULATED 2023-04-05 19:42:44 -07:00
utimes.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
xattr.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00