mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-15 08:14:15 +08:00
a7800aa80e
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based memory that is tied to a specific KVM virtual machine and whose primary purpose is to serve guest memory. A guest-first memory subsystem allows for optimizations and enhancements that are kludgy or outright infeasible to implement/support in a generic memory subsystem. With guest_memfd, guest protections and mapping sizes are fully decoupled from host userspace mappings. E.g. KVM currently doesn't support mapping memory as writable in the guest without it also being writable in host userspace, as KVM's ABI uses VMA protections to define the allow guest protection. Userspace can fudge this by establishing two mappings, a writable mapping for the guest and readable one for itself, but that’s suboptimal on multiple fronts. Similarly, KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size, e.g. KVM doesn’t support creating a 1GiB guest mapping unless userspace also has a 1GiB guest mapping. Decoupling the mappings sizes would allow userspace to precisely map only what is needed without impacting guest performance, e.g. to harden against unintentional accesses to guest memory. Decoupling guest and userspace mappings may also allow for a cleaner alternative to high-granularity mappings for HugeTLB, which has reached a bit of an impasse and is unlikely to ever be merged. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to mmap() guest memory). More immediately, being able to map memory into KVM guests without mapping said memory into the host is critical for Confidential VMs (CoCo VMs), the initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent untrusted software from reading guest private data by encrypting guest memory with a key that isn't usable by the untrusted host, projects such as Protected KVM (pKVM) provide confidentiality and integrity *without* relying on memory encryption. And with SEV-SNP and TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as being mappable only by KVM (or a similarly enlightened kernel subsystem). That approach was abandoned largely due to it needing to play games with PROT_NONE to prevent userspace from accessing guest memory. Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping guest private memory into userspace, but that approach failed to meet several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel wouldn't easily be able to enforce a 1:1 page:guest association, let alone a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory that isn't backed by 'struct page', e.g. if devices gain support for exposing encrypted memory regions to guests. Attempt #3 was to extend the memfd() syscall and wrap shmem to provide dedicated file-based guest memory. That approach made it as far as v10 before feedback from Hugh Dickins and Christian Brauner (and others) led to it demise. Hugh's objection was that piggybacking shmem made no sense for KVM's use case as KVM didn't actually *want* the features provided by shmem. I.e. KVM was using memfd() and shmem to avoid having to manage memory directly, not because memfd() and shmem were the optimal solution, e.g. things like read/write/mmap in shmem were dead weight. Christian pointed out flaws with implementing a partial overlay (wrapping only _some_ of shmem), e.g. poking at inode_operations or super_operations would show shmem stuff, but address_space_operations and file_operations would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM stop being lazy and create a proper API. Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org Cc: Fuad Tabba <tabba@google.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Maciej Szmigiero <mail@maciej.szmigiero.name> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Quentin Perret <qperret@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Wang <wei.w.wang@intel.com> Cc: Liam Merwick <liam.merwick@oracle.com> Cc: Isaku Yamahata <isaku.yamahata@gmail.com> Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-17-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
107 lines
2.0 KiB
Plaintext
107 lines
2.0 KiB
Plaintext
# SPDX-License-Identifier: GPL-2.0
|
|
# KVM common configuration items and defaults
|
|
|
|
config HAVE_KVM
|
|
bool
|
|
|
|
config HAVE_KVM_PFNCACHE
|
|
bool
|
|
|
|
config HAVE_KVM_IRQCHIP
|
|
bool
|
|
|
|
config HAVE_KVM_IRQFD
|
|
bool
|
|
|
|
config HAVE_KVM_IRQ_ROUTING
|
|
bool
|
|
|
|
config HAVE_KVM_DIRTY_RING
|
|
bool
|
|
|
|
# Only strongly ordered architectures can select this, as it doesn't
|
|
# put any explicit constraint on userspace ordering. They can also
|
|
# select the _ACQ_REL version.
|
|
config HAVE_KVM_DIRTY_RING_TSO
|
|
bool
|
|
select HAVE_KVM_DIRTY_RING
|
|
depends on X86
|
|
|
|
# Weakly ordered architectures can only select this, advertising
|
|
# to userspace the additional ordering requirements.
|
|
config HAVE_KVM_DIRTY_RING_ACQ_REL
|
|
bool
|
|
select HAVE_KVM_DIRTY_RING
|
|
|
|
# Allow enabling both the dirty bitmap and dirty ring. Only architectures
|
|
# that need to dirty memory outside of a vCPU context should select this.
|
|
config NEED_KVM_DIRTY_RING_WITH_BITMAP
|
|
bool
|
|
depends on HAVE_KVM_DIRTY_RING
|
|
|
|
config HAVE_KVM_EVENTFD
|
|
bool
|
|
select EVENTFD
|
|
|
|
config KVM_MMIO
|
|
bool
|
|
|
|
config KVM_ASYNC_PF
|
|
bool
|
|
|
|
# Toggle to switch between direct notification and batch job
|
|
config KVM_ASYNC_PF_SYNC
|
|
bool
|
|
|
|
config HAVE_KVM_MSI
|
|
bool
|
|
|
|
config HAVE_KVM_CPU_RELAX_INTERCEPT
|
|
bool
|
|
|
|
config KVM_VFIO
|
|
bool
|
|
|
|
config HAVE_KVM_INVALID_WAKEUPS
|
|
bool
|
|
|
|
config KVM_GENERIC_DIRTYLOG_READ_PROTECT
|
|
bool
|
|
|
|
config KVM_COMPAT
|
|
def_bool y
|
|
depends on KVM && COMPAT && !(S390 || ARM64 || RISCV)
|
|
|
|
config HAVE_KVM_IRQ_BYPASS
|
|
bool
|
|
|
|
config HAVE_KVM_VCPU_ASYNC_IOCTL
|
|
bool
|
|
|
|
config HAVE_KVM_VCPU_RUN_PID_CHANGE
|
|
bool
|
|
|
|
config HAVE_KVM_NO_POLL
|
|
bool
|
|
|
|
config KVM_XFER_TO_GUEST_WORK
|
|
bool
|
|
|
|
config HAVE_KVM_PM_NOTIFIER
|
|
bool
|
|
|
|
config KVM_GENERIC_HARDWARE_ENABLING
|
|
bool
|
|
|
|
config KVM_GENERIC_MMU_NOTIFIER
|
|
select MMU_NOTIFIER
|
|
bool
|
|
|
|
config KVM_GENERIC_MEMORY_ATTRIBUTES
|
|
select KVM_GENERIC_MMU_NOTIFIER
|
|
bool
|
|
|
|
config KVM_PRIVATE_MEM
|
|
select XARRAY_MULTI
|
|
bool
|