linux/include/trace/events
Axel Rasmussen 7677f7fd8b userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.

Overview
========

This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults.  By "minor" fault, I mean the following
situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory).  One of the mappings is registered with userfaultfd (in minor
mode), and the other is not.  Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents.  The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault.  As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...).  In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because this is a feature, a registered VMA could potentially receive both
missing and minor faults.  I spent some time thinking through how the
existing API interacts with the new feature:

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated.  This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Future Work
===========

This series only supports hugetlbfs.  I have a second series in flight to
support shmem as well, extending the functionality.  This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.

This patch (of 6):

This feature allows userspace to intercept "minor" faults.  By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not.  Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents.  The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault.  As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.

This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered.  In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.

This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].

However, doing it this was requires we allocate a VM_* flag for the new
registration mode.  On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.

[1] https://lore.kernel.org/patchwork/patch/1380226/

[peterx@redhat.com: fix minor fault page leak]
  Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
..
9p.h
afs.h afs: Use ITER_XARRAY for writing 2021-04-23 10:17:27 +01:00
alarmtimer.h
asoc.h ASoC: soc-core: tidyup jack.h 2020-11-30 12:54:01 +00:00
avc.h selinux: add basic filtering for audit trace events 2020-08-21 17:07:29 -04:00
bcache.h block: remove superfluous param in blk_fill_rwbs() 2021-02-22 06:37:41 -07:00
block.h blktrace: fix blk_rq_merge documentation 2021-02-22 06:37:41 -07:00
bpf_test_run.h selftests: bpf: test writable buffers in raw tps 2019-04-26 19:04:19 -07:00
bridge.h net: bridge: fdb: br_fdb_update can take flags directly 2019-11-01 10:32:43 -07:00
btrfs.h btrfs: zoned: automatically reclaim zones 2021-04-20 20:46:31 +02:00
cachefiles.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
cgroup.h cgroup: use cgrp->kn->id as the cgroup ID 2019-11-12 08:18:04 -08:00
clk.h clk: Trace clk_set_rate() "range" functions 2020-12-17 01:54:31 -08:00
cma.h
compaction.h mm/page_alloc: integrate classzone_idx and high_zoneidx 2020-06-03 20:09:44 -07:00
context_tracking.h
cpuhp.h treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
devfreq.h PM / devfreq: Add tracepoint for frequency changes 2020-10-26 10:52:37 +09:00
devlink.h devlink: Add a tracepoint for trap reports 2020-09-30 18:01:26 -07:00
dma_fence.h tracing: Fix header include guards in trace event headers 2019-07-30 21:49:06 -04:00
erofs.h erofs: convert uncompressed files from readpages to readahead 2020-06-02 10:59:07 -07:00
error_report.h tracing: add error_report_end trace point 2021-02-26 09:41:02 -08:00
ext4.h ext4: disable fast commit with data journalling 2020-11-06 23:01:05 -05:00
f2fs.h f2fs: move ioctl interface definitions to separated file 2020-11-02 08:33:02 -08:00
fib6.h ipv6: Add fib6_type and fib6_flags to fib6_result 2019-04-17 23:11:30 -07:00
fib.h net: Replace nhc_has_gw with nhc_gw_family 2019-04-08 15:22:40 -07:00
filelock.h locks: Remove extra "0x" in tracepoint format specifier 2020-09-01 18:09:34 -04:00
filemap.h ftrace: Rework event_create_dir() 2019-11-27 07:44:25 +01:00
fs_dax.h
fscache.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
fsi_master_aspeed.h fsi: aspeed: Add trace points 2019-11-08 11:28:20 +01:00
fsi_master_ast_cf.h
fsi_master_gpio.h
fsi.h trace: fsi: Print transfer size unsigned 2019-11-08 11:23:37 +01:00
gpio.h tracing: stop making gpio tracing configurable 2019-04-08 15:11:48 +02:00
gpu_mem.h gpu/trace: Minor comment updates for gpu_mem_total tracepoint 2020-05-07 13:32:57 -04:00
host1x.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 1 2019-05-21 11:28:39 +02:00
huge_memory.h khugepaged: introduce 'max_ptes_shared' tunable 2020-06-03 20:09:46 -07:00
hwmon.h
i2c.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
ib_mad.h IB/MAD: Add SMP details to MAD tracing 2019-03-27 15:52:01 -03:00
ib_umad.h IB/UMAD: Add umad trace points 2019-03-27 15:52:01 -03:00
initcall.h
intel_iommu.h iommu/vt-d: Fix compile error [-Werror=implicit-function-declaration] 2021-02-02 14:43:56 +01:00
intel_ish.h
intel-sst.h
io_uring.h io_uring: include cflags in completion trace event 2021-04-11 17:41:59 -06:00
iocost.h blk-iocost: Add iocg idle state tracepoint 2020-12-17 07:55:44 -07:00
iommu.h
ipi.h
irq_matrix.h
irq.h
iscsi.h
jbd2.h jbd2: Provide trace event for handle restarts 2019-11-05 16:00:49 -05:00
kmem.h mm, tracing: improve rss_stat tracepoint message 2021-04-30 11:20:39 -07:00
kvm.h KVM: X86: Implement ring-based dirty memory tracking 2020-11-15 09:49:15 -05:00
kyber.h block: add queue_to_disk() to get gendisk from request_queue 2021-04-12 06:51:57 -06:00
libata.h
lock.h
mce.h
mdio.h
migrate.h mm/vmstat: add events for THP migration without split 2020-08-12 10:57:57 -07:00
mlxsw.h mlxsw: spectrum_acl: Rename rehash_dis trace 2019-03-31 11:01:23 -07:00
mmap_lock.h mm: mmap_lock: add tracepoints around lock acquisition 2020-12-15 12:13:41 -08:00
mmap.h mm: mmap: add trace point of vm_unmapped_area 2020-04-02 09:35:30 -07:00
mmc.h
mmflags.h userfaultfd: add minor fault registration mode 2021-05-05 11:27:22 -07:00
module.h
mptcp.h mptcp: add tracepoint in subflow_check_data_avail 2021-04-16 17:10:40 -07:00
napi.h tracing: Fix header include guards in trace event headers 2019-07-30 21:49:06 -04:00
nbd.h nbd: add tracepoints for send/receive timing 2019-04-26 19:04:19 -07:00
neigh.h neighbor: Add tracepoint to __neigh_create 2019-05-22 17:50:24 -07:00
net_probe_common.h
net.h net: add a generic tracepoint for TX queue timeout 2019-05-04 00:41:41 -04:00
netfs.h netfs: Add a tracepoint to log failures that would be otherwise unseen 2021-04-23 10:14:32 +01:00
netlink.h netlink: add tracepoint at NL_SET_ERR_MSG 2021-02-04 18:05:59 -08:00
nilfs2.h
nmi.h
objagg.h
oom.h
page_isolation.h
page_pool.h page_pool: Add API to update numa node 2019-11-20 11:47:36 -08:00
page_ref.h
pagemap.h mm/swap.c: don't pass "enum lru_list" to trace_mm_lru_insertion() 2021-02-24 13:38:33 -08:00
percpu.h
power_cpu_migrate.h
power.h PM: QoS: Simplify definitions of CPU latency QoS trace events 2020-02-13 11:26:39 +01:00
preemptirq.h tracing: Change offset type to s32 in preempt/irq tracepoints 2020-01-03 11:34:37 -05:00
printk.h
pwc.h media: usb: pwc: Introduce TRACE_EVENTs for pwc_isoc_handler() 2019-01-16 11:15:11 -05:00
pwm.h pwm: Implement tracing for .get_state() and .apply_state() 2020-01-20 12:28:37 +01:00
qdisc.h net_sched: add a tracepoint for qdisc creation 2020-05-27 15:05:49 -07:00
qla.h scsi: qla2xxx: Suppress two recently introduced compiler warnings 2020-05-19 21:43:01 -04:00
qrtr.h net: qrtr: Add tracepoint support 2020-04-22 12:55:54 -07:00
random.h random: remove dead code left over from blocking pool 2021-04-02 18:28:12 +11:00
rcu.h rcu/tree: Add a trace event for RCU CPU stall warnings 2021-03-15 13:53:24 -07:00
rdma_core.h RDMA/core: Add trace points to follow MR allocation 2020-01-07 16:10:53 -04:00
rdma.h RDMA/core: Move the rdma_show_ib_cm_event() macro 2020-08-24 16:01:47 -03:00
regulator.h regulator: core: Add regulator bypass trace points 2020-05-29 17:17:02 +01:00
rpcgss.h SUNRPC: Augment server-side rpcgss tracepoints 2020-07-13 17:28:24 -04:00
rpcrdma.h rpcrdma: Capture bytes received in Receive completion tracepoints 2021-02-05 11:16:56 -05:00
rpm.h PM-runtime: add tracepoints for usage_count changes 2020-01-13 12:28:29 +01:00
rseq.h
rtc.h
rxrpc.h rxrpc: Fix a missing NULL-pointer check in a trace 2020-09-14 16:18:59 +01:00
sched.h kthread: remove comments about old _do_fork() helper 2021-01-11 15:11:56 +01:00
scmi.h firmware: arm_scmi: Use signed integer to report transfer status 2020-06-30 14:07:08 +01:00
scsi.h
sctp.h sctp: move trace_sctp_probe_path into sctp_outq_sack 2019-12-26 13:06:45 -08:00
signal.h
siox.h
skb.h
smbus.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
sock.h tcp: Define IPPROTO_MPTCP 2020-01-09 18:41:41 -08:00
spi.h spi/trace: Cap buffer contents at 64 bytes 2019-05-02 10:37:52 +09:00
spmi.h
sunrpc.h SUNRPC: Export svc_xprt_received() 2021-03-22 13:22:13 -04:00
sunvnet.h
swiotlb.h
syscalls.h syscalls: Remove start and number from syscall_get_arguments() args 2019-04-05 09:26:43 -04:00
target.h scsi: target: core: Add CONTROL field for trace events 2020-10-02 18:36:19 -04:00
task.h
tcp.h net: tracepoint: exposing sk_family in all tcp:tracepoints 2021-02-04 09:25:36 -08:00
tegra_apb_dma.h tracing: Fix header include guards in trace event headers 2019-07-30 21:49:06 -04:00
thermal_power_allocator.h
thermal.h thermal: devfreq_cooling: change tracing function and arguments 2020-12-11 14:10:44 +01:00
thp.h
timer.h y2038: syscall implementation cleanups 2019-12-01 14:00:59 -08:00
tlb.h
udp.h
ufs.h scsi: ufs: Add exception event tracepoint 2021-03-04 17:36:58 -05:00
v4l2.h media: v4l2: abstract timeval handling in v4l2_buffer 2020-01-03 15:43:35 +01:00
vb2.h
vmscan.h mm/page_alloc: integrate classzone_idx and high_zoneidx 2020-06-03 20:09:44 -07:00
vsock_virtio_transport_common.h
wbt.h bdi: use bdi_dev_name() to get device name 2020-05-09 16:07:39 -06:00
workqueue.h workqueue/tracing: Copy workqueue name to buffer in trace event 2021-03-18 12:57:37 -04:00
writeback.h Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2020-12-04 07:48:12 -08:00
xdp.h bpf, xdp: Restructure redirect actions 2021-03-10 01:06:34 +01:00
xen.h x86/mm/tlb: Flush remote and local TLBs concurrently 2021-03-06 12:59:10 +01:00