linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-27 04:54:41 +08:00

Author	SHA1	Message	Date
Emanuele Giuseppe Esposito	7907160dbf	KVM: nSVM: rename nested_load_control_from_vmcb12 in nested_copy_vmcb_control_to_cache Following the same naming convention of the previous patch, rename nested_load_control_from_vmcb12. In addition, inline copy_vmcb_control_area as it is only called by this function. __nested_copy_vmcb_control_to_cache() works with vmcb_control_area parameters and it will be useful in next patches, when we use local variables instead of svm cached state. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20211103140527.752797-4-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:39 -05:00
Emanuele Giuseppe Esposito	f2740a8d85	KVM: nSVM: introduce svm->nested.save to cache save area before checks This is useful in the next patch, to keep a saved copy of vmcb12 registers and pass it around more easily. Instead of blindly copying everything, we just copy EFER, CR0, CR3, CR4, DR6 and DR7 which are needed by the VMRUN checks. If more fields will need to be checked, it will be quite obvious to see that they must be added in struct vmcb_save_area_cached and in nested_copy_vmcb_save_to_cache(). __nested_copy_vmcb_save_to_cache() takes a vmcb_save_area_cached parameter, which is useful in order to save the state to a local variable. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20211103140527.752797-3-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:38 -05:00
Emanuele Giuseppe Esposito	907afa48e9	KVM: nSVM: move nested_vmcb_check_cr3_cr4 logic in nested_vmcb_valid_sregs Inline nested_vmcb_check_cr3_cr4 as it is not called by anyone else. Doing so simplifies next patches. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211103140527.752797-2-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:38 -05:00
Maciej S. Szmigiero	f4209439b5	KVM: Optimize gfn lookup in kvm_zap_gfn_range() Introduce a memslots gfn upper bound operation and use it to optimize kvm_zap_gfn_range(). This way this handler can do a quick lookup for intersecting gfns and won't have to do a linear scan of the whole memslot set. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <ef242146a87a335ee93b441dcf01665cb847c902.1638817641.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:35 -05:00
Maciej S. Szmigiero	a54d806688	KVM: Keep memslots in tree-based structures instead of array-based ones The current memslot code uses a (reverse gfn-ordered) memslot array for keeping track of them. Because the memslot array that is currently in use cannot be modified every memslot management operation (create, delete, move, change flags) has to make a copy of the whole array so it has a scratch copy to work on. Strictly speaking, however, it is only necessary to make copy of the memslot that is being modified, copying all the memslots currently present is just a limitation of the array-based memslot implementation. Two memslot sets, however, are still needed so the VM continues to run on the currently active set while the requested operation is being performed on the second, currently inactive one. In order to have two memslot sets, but only one copy of actual memslots it is necessary to split out the memslot data from the memslot sets. The memslots themselves should be also kept independent of each other so they can be individually added or deleted. These two memslot sets should normally point to the same set of memslots. They can, however, be desynchronized when performing a memslot management operation by replacing the memslot to be modified by its copy. After the operation is complete, both memslot sets once again point to the same, common set of memslot data. This commit implements the aforementioned idea. For tracking of gfns an ordinary rbtree is used since memslots cannot overlap in the guest address space and so this data structure is sufficient for ensuring that lookups are done quickly. The "last used slot" mini-caches (both per-slot set one and per-vCPU one), that keep track of the last found-by-gfn memslot, are still present in the new code. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:34 -05:00
Maciej S. Szmigiero	ed922739c9	KVM: Use interval tree to do fast hva lookup in memslots The current memslots implementation only allows quick binary search by gfn, quick lookup by hva is not possible - the implementation has to do a linear scan of the whole memslots array, even though the operation being performed might apply just to a single memslot. This significantly hurts performance of per-hva operations with higher memslot counts. Since hva ranges can overlap between memslots an interval tree is needed for tracking them. [sean: handle interval tree updates in kvm_replace_memslot()] Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <d66b9974becaa9839be9c4e1a5de97b177b4ac20.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:32 -05:00
Maciej S. Szmigiero	f5756029ee	KVM: x86: Use nr_memslot_pages to avoid traversing the memslots array There is no point in recalculating from scratch the total number of pages in all memslots each time a memslot is created or deleted. Use KVM's cached nr_memslot_pages to compute the default max number of MMU pages. Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can possibly overflow an unsigned long variable. Write this "* 20 / 1000" operation as "/ 50" instead to avoid such overflow. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: use common KVM field and rework changelog accordingly] Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:29 -05:00
Maciej S. Szmigiero	e0c2b6338a	KVM: x86: Don't call kvm_mmu_change_mmu_pages() if the count hasn't changed There is no point in calling kvm_mmu_change_mmu_pages() for memslot operations that don't change the total page count, so do it just for KVM_MR_CREATE and KVM_MR_DELETE. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <9e56b7616a11f5654e4ab486b3237366b7ba9f2a.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:28 -05:00
Sean Christopherson	77aedf26fe	KVM: x86: Don't assume old/new memslots are non-NULL at memslot commit Play nice with a NULL @old or @new when handling memslot updates so that common KVM can pass NULL for one or the other in CREATE and DELETE cases instead of having to synthesize a dummy memslot. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <2eb7788adbdc2bc9a9c5f86844dd8ee5c8428732.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:26 -05:00
Sean Christopherson	6a99c6e3f5	KVM: Stop passing kvm_userspace_memory_region to arch memslot hooks Drop the @mem param from kvm_arch_{prepare,commit}_memory_region() now that its use has been removed in all architectures. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <aa5ed3e62c27e881d0d8bc0acbc1572bc336dc19.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:25 -05:00
Sean Christopherson	9d7d18ee3f	KVM: x86: Use "new" memslot instead of userspace memory region Get the number of pages directly from the new memslot instead of computing the same from the userspace memory region when allocating memslot metadata. This will allow a future patch to drop @mem. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <ef44892eb615f5c28e682bbe06af96aff9ce2a9f.1638817639.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:23 -05:00
Sean Christopherson	537a17b314	KVM: Let/force architectures to deal with arch specific memslot data Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch code to handle propagating arch specific data from "new" to "old" when necessary. This is a baby step towards dynamically allocating "new" from the get go, and is a (very) minor performance boost on x86 due to not unnecessarily copying arch data. For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE and FLAGS_ONLY. This is functionally a nop as the previous behavior would overwrite the pointer for CREATE, and eventually discard/ignore it for DELETE. For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV, x86 needs to reallocate arch data in the MOVE case as the size of x86's allocations depend on the alignment of the memslot's gfn. Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to match the "commit" prototype. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [mss: add missing RISCV kvm_arch_prepare_memory_region() change] Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:20 -05:00
Marc Zyngier	46808a4cb8	KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu index. Unfortunately, we're about to move rework the iterator, which requires this to be upgrade to an unsigned long. Let's bite the bullet and repaint all of it in one go. Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-7-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:15 -05:00
Marc Zyngier	27592ae8db	KVM: Move wiping of the kvm->vcpus array to common code All architectures have similar loops iterating over the vcpus, freeing one vcpu at a time, and eventually wiping the reference off the vcpus array. They are also inconsistently taking the kvm->lock mutex when wiping the references from the array. Make this code common, which will simplify further changes. The locking is dropped altogether, as this should only be called when there is no further references on the kvm structure. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-2-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:13 -05:00
Tom Lendacky	ad5b353240	KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT exit code or exit parameters fails. The VMGEXIT instruction can be issued from userspace, even though userspace (likely) can't update the GHCB. To prevent userspace from being able to kill the guest, return an error through the GHCB when validation fails rather than terminating the guest. For cases where the GHCB can't be updated (e.g. the GHCB can't be mapped, etc.), just return back to the guest. The new error codes are documented in the lasest update to the GHCB specification. Fixes: `291bd20d5d` ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-05 03:02:04 -05:00
Sean Christopherson	a655276a59	KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so that KVM falls back to __vmalloc() if physically contiguous memory isn't available. The buffer is purely a KVM software construct, i.e. there's no need for it to be physically contiguous. Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109222350.2266045-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-05 03:02:03 -05:00
Sean Christopherson	75236f5f22	KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails Return appropriate error codes if setting up the GHCB scratch area for an SEV-ES guest fails. In particular, returning -EINVAL instead of -ENOMEM when allocating the kernel buffer could be confusing as userspace would likely suspect a guest issue. Fixes: `8f423a80d2` ("KVM: SVM: Support MMIO for an SEV-ES guest") Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109222350.2266045-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-05 03:02:03 -05:00
Sean Christopherson	a955cad84c	KVM: x86/mmu: Retry page fault if root is invalidated by memslot update Bail from the page fault handler if the root shadow page was obsoleted by a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU doesn't rely on the memslot/MMU generation, and instead relies on the root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes mmu_lock for write. For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has moved past the gfn associated with the SP. For other MMUs, the resulting behavior is far more convoluted, though unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete root isn't directly problematic, as the obsolete root will be unloaded and dropped before the vCPU re-enters the guest. But because the legacy MMU tracks shadow pages by their role, any SP created by the fault can can be reused in the new post-reload root. Again, that _shouldn't_ be problematic as any leaf child SPTEs will be created for the current/valid memslot generation, and kvm_mmu_get_page() will not reuse child SPs from the old generation as they will be flagged as obsolete. But, given that continuing with the fault is pointess (the root will be unloaded), apply the check to all MMUs. Fixes: `b7cccd397f` ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:12:12 -05:00
Dan Carpenter	bfbb307c62	KVM: VMX: Set failure code in prepare_vmcs02() The error paths in the prepare_vmcs02() function are supposed to set *entry_failure_code but this path does not. It leads to using an uninitialized variable in the caller. Fixes: `71f7347025` ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Message-Id: <20211130125337.GB24578@kili> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:12:11 -05:00
Paolo Bonzini	ef8b4b7203	KVM: ensure APICv is considered inactive if there is no APIC kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel local APIC, however kvm_apicv_activated might still be true if there are no reasons to disable APICv; in fact it is quite likely that there is none because APICv is inhibited by specific configurations of the local APIC and those configurations cannot be programmed. This triggers a WARN: WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu)); To avoid this, introduce another cause for APICv inhibition, namely the absence of an in-kernel local APIC. This cause is enabled by default, and is dropped by either KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. Reported-by: Ignat Korchagin <ignat@cloudflare.com> Fixes: `ee49a89329` ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22) Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Ignat Korchagin <ignat@cloudflare.com> Message-Id: <20211130123746.293379-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:12:11 -05:00
Like Xu	cb1d220da0	KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register If we run the following perf command in an AMD Milan guest: perf stat \ -e cpu/event=0x1d0/ \ -e cpu/event=0x1c7/ \ -e cpu/umask=0x1f,event=0x18e/ \ -e cpu/umask=0x7,event=0x18e/ \ -e cpu/umask=0x18,event=0x18e/ \ ./workload dmesg will report a #GP warning from an unchecked MSR access error on MSR_F15H_PERF_CTLx. This is because according to APM (Revision: 4.03) Figure 13-7, the bits [35:32] of AMD PerfEvtSeln register is a part of the event select encoding, which extends the EVENT_SELECT field from 8 bits to 12 bits. Opportunistically update pmu->reserved_bits for reserved bit 19. Reported-by: Jim Mattson <jmattson@google.com> Fixes: `ca724305a2` ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20211118130320.95997-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:11:50 -05:00
Paolo Bonzini	7cfc5c653b	KVM: fix avic_set_running for preemptable kernels avic_set_running() passes the current CPU to avic_vcpu_load(), albeit via vcpu->cpu rather than smp_processor_id(). If the thread is migrated while avic_set_running runs, the call to avic_vcpu_load() can use a stale value for the processor id. Avoid this by blocking preemption over the entire execution of avic_set_running(). Reported-by: Sean Christopherson <seanjc@google.com> Fixes: `8221c13700` ("svm: Manage vcpu load/unload when enable AVIC") Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 07:40:48 -05:00
Paolo Bonzini	e90e51d5f0	KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled There is nothing to synchronize if APICv is disabled, since neither other vCPUs nor assigned devices can set PIR.ON. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 07:40:47 -05:00
Paolo Bonzini	c9d61dcb0b	KVM: SEV: accept signals in sev_lock_two_vms Generally, kvm->lock is not taken for a long time, but sev_lock_two_vms is different: it takes vCPU locks inside, so userspace can hold it back just by calling a vCPU ioctl. Play it safe and use mutex_lock_killable. Message-Id: <20211123005036.2954379-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:15 -05:00
Paolo Bonzini	10a37929ef	KVM: SEV: do not take kvm->lock when destroying Taking the lock is useless since there are no other references, and there are already accesses (e.g. to sev->enc_context_owner) that do not take it. So get rid of it. Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:14 -05:00
Paolo Bonzini	17d44a96f0	KVM: SEV: Prohibit migration of a VM that has mirrors VMs that mirror an encryption context rely on the owner to keep the ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM would cause a dangling ASID: 1. copy context from A to B (gets ref to A) 2. move context from A to L (moves ASID from A to L) 3. close L (releases ASID from L, B still references it) The right way to do the handoff instead is to create a fresh mirror VM on the destination first: 1. copy context from A to B (gets ref to A) [later] 2. close B (releases ref to A) 3. move context from A to L (moves ASID from A to L) 4. copy context from L to M So, catch the situation by adding a count of how many VMs are mirroring this one's encryption context. Fixes: `0b020f5af0` ("KVM: SEV: Add support for SEV-ES intra host migration") Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:14 -05:00
Paolo Bonzini	bf42b02b19	KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked Now that we have a facility to lock two VMs with deadlock protection, use it for the creation of mirror VMs as well. One of COPY_ENC_CONTEXT_FROM(dst, src) and COPY_ENC_CONTEXT_FROM(src, dst) would always fail, so the combination is nonsensical and it is okay to return -EBUSY if it is attempted. This sidesteps the question of what happens if a VM is MOVE_ENC_CONTEXT_FROM'd at the same time as it is COPY_ENC_CONTEXT_FROM'd: the locking prevents that from happening. Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-10-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:13 -05:00
Paolo Bonzini	642525e3bd	KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM Allow intra-host migration of a mirror VM; the destination VM will be a mirror of the same ASID as the source. Fixes: `b56639318b` ("KVM: SEV: Add support for SEV intra host migration") Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:12 -05:00
Paolo Bonzini	2b347a3878	KVM: SEV: initialize regions_list of a mirror VM This was broken before the introduction of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM, but technically harmless because the region list was unused for a mirror VM. However, it is untidy and it now causes a NULL pointer access when attempting to move the encryption context of a mirror VM. Fixes: `54526d1fd5` ("KVM: x86: Support KVM VMs sharing SEV context") Message-Id: <20211123005036.2954379-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:12 -05:00
Paolo Bonzini	501b580c02	KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM Encapsulate the handling of the migration_in_progress flag for both VMs in two functions sev_lock_two_vms and sev_unlock_two_vms. It does not matter if KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM locks the destination struct kvm a bit later, and this change 1) keeps the cleanup chain of labels smaller 2) makes it possible for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM to reuse the logic. Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:11 -05:00
Paolo Bonzini	4674164f0a	KVM: SEV: do not use list_replace_init on an empty list list_replace_init cannot be used if the source is an empty list, because "new->next->prev = new" will overwrite "old->next": new old prev = new, next = new prev = old, next = old new->next = old->next prev = new, next = old prev = old, next = old new->next->prev = new prev = new, next = old prev = old, next = new new->prev = old->prev prev = old, next = old prev = old, next = old new->next->prev = new prev = old, next = old prev = new, next = new The desired outcome instead would be to leave both old and new the same as they were (two empty circular lists). Use list_cut_before, which already has the necessary check and is documented to discard the previous contents of the list that will hold the result. Fixes: `b56639318b` ("KVM: SEV: Add support for SEV intra host migration") Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:11 -05:00
Paolo Bonzini	53b7ca1a35	KVM: x86: Use a stable condition around all VT-d PI paths Currently, checks for whether VT-d PI can be used refer to the current status of the feature in the current vCPU; or they more or less pick vCPU 0 in case a specific vCPU is not available. However, these checks do not attempt to synchronize with changes to the IRTE. In particular, there is no path that updates the IRTE when APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU that has APICv disabled, if the wakeup occurs because of an IRTE that points to a posted interrupt. To fix this, always go through the VT-d PI path as long as there are assigned devices and APICv is available on both the host and the VM side. Since the relevant condition was copied over three times, take the hint and factor it into a separate function. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Message-Id: <20211123004311.2954158-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:53:14 -05:00
Paolo Bonzini	37c4dbf337	KVM: x86: check PIR even for vCPUs with disabled APICv The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even if APICv is disabled on the vCPU that receives it. In that case, the interrupt will just cause a vmexit and leave the ON bit set together with the PIR bit corresponding to the interrupt. Right now, the interrupt would not be delivered until APICv is re-enabled. However, fixing this is just a matter of always doing the PIR->IRR synchronization, even if the vCPU has temporarily disabled APICv. This is not a problem for performance, or if anything it is an improvement. First, in the common case where vcpu->arch.apicv_active is true, one fewer check has to be performed. Second, static_call_cond will elide the function call if APICv is not present or disabled. Finally, in the case for AMD hardware we can remove the sync_pir_to_irr callback: it is only needed for apic_has_interrupt_for_ppr, and that function already has a fallback for !APICv. Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:52:39 -05:00
Paolo Bonzini	7e1901f6c8	KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled If APICv is disabled for this vCPU, assigned devices may still attempt to post interrupts. In that case, we need to cancel the vmentry and deliver the interrupt with KVM_REQ_EVENT. Extend the existing code that handles injection of L1 interrupts into L2 to cover this case as well. vmx_hwapic_irr_update is only called when APICv is active so it would be confusing to add a check for vcpu->arch.apicv_active in there. Instead, just use vmx_set_rvi directly in vmx_sync_pir_to_irr. Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211123004311.2954158-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:51:58 -05:00
Sean Christopherson	f47491d7f3	KVM: x86/mmu: Handle "default" period when selectively waking kthread Account for the '0' being a default, "let KVM choose" period, when determining whether or not the recovery worker needs to be awakened in response to userspace reducing the period. Failure to do so results in the worker not being awakened properly, e.g. when changing the period from '0' to any small-ish value. Fixes: `4dfe4f40d8` ("kvm: x86: mmu: Make NX huge page recovery period configurable") Cc: stable@vger.kernel.org Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120015706.3830341-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:27 -05:00
Paolo Bonzini	28f091bc2f	KVM: MMU: shadow nested paging does not have PKU Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding incorrect interpretations of the nested hypervisor's page tables. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:26 -05:00
Sean Christopherson	4b85c921cd	KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path Drop the "flush" param and return values to/from the TDP MMU's helper for zapping collapsible SPTEs. Because the helper runs with mmu_lock held for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic zap handles the necessary remote TLB flush. Similarly, because mmu_lock is dropped and re-acquired between zapping legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must handle remote TLB flushes from the legacy MMU before calling into the TDP MMU. Fixes: `e2209710cc` ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:25 -05:00
Sean Christopherson	7533377215	KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping Use the yield-safe variant of the TDP MMU iterator when handling an unmapping event from the MMU notifier, as most occurences of the event allow yielding. Fixes: `e1eed5847b` ("KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120015008.3780032-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:25 -05:00
Lai Jiangshan	05b29633c7	KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg() INVLPG operates on guest virtual address, which are represented by vcpu->arch.walk_mmu. In nested virtualization scenarios, kvm_mmu_invlpg() was using the wrong MMU structure; if L2's invlpg were emulated by L0 (in practice, it hardly happen) when nested two-dimensional paging is enabled, the call to ->tlb_flush_gva() would be skipped and the hardware TLB entry would not be invalidated. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-5-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:21 -05:00
Lai Jiangshan	12ec33a705	KVM: X86: Fix when shadow_root_level=5 && guest root_level<4 If the is an L1 with nNPT in 32bit, the shadow walk starts with pae_root. Fixes: `a717a780fc` ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host) Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:20 -05:00
Vitaly Kuznetsov	feb627e8d6	KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN Commit `63f5a1909f` ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls after first successful KVM_RUN and promissed to make this sequence forbiden in 5.16. It's time to fulfil the promise. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211122175818.608220-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:20 -05:00
Sean Christopherson	712494de96	KVM: nVMX: Emulate guest TLB flush on nested VM-Enter with new vpid12 Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12, i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's TLB entries for vpid02. From L1's perspective, changing L2's VPID is effectively a TLB flush unless "hardware" has previously cached entries for the new vpid12. Because KVM tracks only a single vpid12, KVM doesn't know if the new vpid12 has been used in the past and so must treat it as a brand new, never been used VPID, i.e. must assume that the new vpid12 represents a TLB flush from L1's perspective. For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12 and thus can't have cached entries in the TLB for vpid12. Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com> Fixes: `5c614b3583` ("KVM: nVMX: nested VPID emulation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211125014944.536398-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 07:11:29 -05:00
Sean Christopherson	40e5f90804	KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at nested transitions, as KVM doesn't track requests for L1 vs L2. E.g. if there's a pending flush when a nested VM-Exit occurs, then the flush was requested in the context of L2 and needs to be handled before switching to L1, otherwise the flush for L2 would effectiely be lost. Opportunistically add a helper to handle CURRENT and GUEST as a pair, the logic for when they need to be serviced is identical as both requests are tied to L1 vs. L2, the only difference is the scope of the flush. Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com> Fixes: `07ffaf343e` ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211125014944.536398-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 07:07:49 -05:00
Sean Christopherson	2b4a5a5d56	KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST Flush the current VPID when handling KVM_REQ_TLB_FLUSH_GUEST instead of always flushing vpid01. Any TLB flush that is triggered when L2 is active is scoped to L2's VPID (if it has one), e.g. if L2 toggles CR4.PGE and L1 doesn't intercept PGE writes, then KVM's emulation of the TLB flush needs to be applied to L2's VPID. Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com> Fixes: `07ffaf343e` ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211125014944.536398-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 07:06:58 -05:00
Paolo Bonzini	30d7c5d60a	KVM: SEV: expose KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM capability The capability, albeit present, was never exposed via KVM_CHECK_EXTENSION. Fixes: `b56639318b` ("KVM: SEV: Add support for SEV intra host migration") Cc: Peter Gonda <pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 06:43:30 -05:00
Paolo Bonzini	78311a5140	KVM: x86: ignore APICv if LAPIC is not enabled Synchronize the two calls to kvm_x86_sync_pir_to_irr. The one in the reenter-guest fast path invoked the callback unconditionally even if LAPIC is present but disabled. In this case, there are no interrupts to deliver, and therefore posted interrupts can be ignored. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 06:43:29 -05:00
Paolo Bonzini	8503fea676	KVM: VMX: do not use uninitialized gfn_to_hva_cache An uninitialized gfn_to_hva_cache has ghc->len == 0, which causes the accessors to croak very loudly. While a BUG_ON is definitely _too_ loud and a bug on its own, there is indeed an issue of using the caches in such a way that they could not have been initialized, because ghc->gpa == 0 might match and thus kvm_gfn_to_hva_cache_init would not be called. For the vmcs12_cache, the solution is simply to invoke kvm_gfn_to_hva_cache_init unconditionally: we already know that the cache does not match the current VMCS pointer. For the shadow_vmcs12_cache, there is no similar condition that checks the VMCS link pointer, so invalidate the cache on VMXON. Fixes: `cee66664dc` ("KVM: nVMX: Use a gfn_to_hva_cache for vmptrld") Acked-by: David Woodhouse <dwmw@amazon.co.uk> Reported-by: syzbot+7b7db8bb4db6fd5e157b@syzkaller.appspotmail.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 06:43:28 -05:00
Hou Wenlong	8ed716ca7d	KVM: x86/mmu: Pass parameter flush as false in kvm_tdp_mmu_zap_collapsible_sptes() Since tlb flush has been done for legacy MMU before kvm_tdp_mmu_zap_collapsible_sptes(), so the parameter flush should be false for kvm_tdp_mmu_zap_collapsible_sptes(). Fixes: `e2209710cc` ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <21453a1d2533afb6e59fb6c729af89e771ff2e76.1637140154.git.houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 07:05:58 -05:00
Hou Wenlong	c7785d85b6	KVM: x86/mmu: Skip tlb flush if it has been done in zap_gfn_range() If the parameter flush is set, zap_gfn_range() would flush remote tlb when yield, then tlb flush is not needed outside. So use the return value of zap_gfn_range() directly instead of OR on it in kvm_unmap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range(). Fixes: `3039bcc744` ("KVM: Move x86's MMU notifier memslot walkers to generic code") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <5e16546e228877a4d974f8c0e448a93d52c7a5a9.1637140154.git.houwenlong93@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 07:05:57 -05:00
Juergen Gross	9dba4d24cb	x86/kvm: remove unused ack_notifier callbacks Commit `f52447261b` ("KVM: irq ack notification") introduced an ack_notifier() callback in struct kvm_pic and in struct kvm_ioapic without using them anywhere. Remove those callbacks again. Signed-off-by: Juergen Gross <jgross@suse.com> Message-Id: <20211117071617.19504-1-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 07:05:57 -05:00
Ben Gardon	574c3c55e9	KVM: x86/mmu: Fix TLB flush range when handling disconnected pt When recursively clearing out disconnected pts, the range based TLB flush in handle_removed_tdp_mmu_page uses the wrong starting GFN, resulting in the flush mostly missing the affected range. Fix this by using base_gfn for the flush. In response to feedback from David Matlack on the RFC version of this patch, also move a few definitions into the for loop in the function to prevent unintended references to them in the future. Fixes: `a066e61f13` ("KVM: x86/mmu: Factor out handling of removed page tables") CC: stable@vger.kernel.org Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115211704.2621644-1-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:15:19 -05:00
Vitaly Kuznetsov	2845e7353b	KVM: x86: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS It doesn't make sense to return the recommended maximum number of vCPUs which exceeds the maximum possible number of vCPUs. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211116163443.88707-7-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:12:15 -05:00
Tom Lendacky	b5aead0064	KVM: x86: Assume a 64-bit hypercall for guests with protected state When processing a hypercall for a guest with protected state, currently SEV-ES guests, the guest CS segment register can't be checked to determine if the guest is in 64-bit mode. For an SEV-ES guest, it is expected that communication between the guest and the hypervisor is performed to shared memory using the GHCB. In order to use the GHCB, the guest must have been in long mode, otherwise writes by the guest to the GHCB would be encrypted and not be able to be comprehended by the hypervisor. Create a new helper function, is_64_bit_hypercall(), that assumes the guest is in 64-bit mode when the guest has protected state, and returns true, otherwise invoking is_64_bit_mode() to determine the mode. Update the hypercall related routines to use is_64_bit_hypercall() instead of is_64_bit_mode(). Add a WARN_ON_ONCE() to is_64_bit_mode() to catch occurences of calls to this helper function for a guest running with protected state. Fixes: `f1c6366e30` ("KVM: SVM: Add required changes to support intercepts under SEV-ES") Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <e0b20c770c9d0d1403f23d83e785385104211f74.1621878537.git.thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:12:13 -05:00
Paolo Bonzini	817506df9d	Merge branch 'kvm-5.16-fixes' into kvm-master * Fixes for Xen emulation * Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache * Fixes for migration of 32-bit nested guests on 64-bit hypervisor * Compilation fixes * More SEV cleanups	2021-11-18 02:11:57 -05:00
Sean Christopherson	8e38e96a4e	KVM: SEV: Fix typo in and tweak name of cmd_allowed_from_miror() Rename cmd_allowed_from_miror() to is_cmd_allowed_from_mirror(), fixing a typo and making it obvious that the result is a boolean where false means "not allowed". No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109215101.2211373-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:10:28 -05:00
Sean Christopherson	ea410ef4da	KVM: SEV: Drop a redundant setting of sev->asid during initialization Remove a fully redundant write to sev->asid during SEV/SEV-ES guest initialization. The ASID is set a few lines earlier prior to the call to sev_platform_init(), which doesn't take "sev" as a param, i.e. can't muck with the ASID barring some truly magical behind-the-scenes code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109215101.2211373-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:10:27 -05:00
Sean Christopherson	1bd00a4257	KVM: SEV: WARN if SEV-ES is marked active but SEV is not WARN if the VM is tagged as SEV-ES but not SEV. KVM relies on SEV and SEV-ES being set atomically, and guards common flows with "is SEV", i.e. observing SEV-ES without SEV means KVM has a fatal bug. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109215101.2211373-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:10:27 -05:00
Sean Christopherson	a41fb26e61	KVM: SEV: Set sev_info.active after initial checks in sev_guest_init() Set sev_info.active during SEV/SEV-ES activation before calling any code that can potentially consume sev_info.es_active, e.g. set "active" and "es_active" as a pair immediately after the initial sanity checks. KVM generally expects that es_active can be true if and only if active is true, e.g. sev_asid_new() deliberately avoids sev_es_guest() so that it doesn't get a false negative. This will allow WARNing in sev_es_guest() if the VM is tagged as SEV-ES but not SEV. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109215101.2211373-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:10:27 -05:00
Sean Christopherson	79b1114276	KVM: SEV: Disallow COPY_ENC_CONTEXT_FROM if target has created vCPUs Reject COPY_ENC_CONTEXT_FROM if the destination VM has created vCPUs. KVM relies on SEV activation to occur before vCPUs are created, e.g. to set VMCB flags and intercepts correctly. Fixes: `54526d1fd5` ("KVM: x86: Support KVM VMs sharing SEV context") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Nathan Tempelman <natet@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109215101.2211373-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:10:27 -05:00
David Woodhouse	cee66664dc	KVM: nVMX: Use a gfn_to_hva_cache for vmptrld And thus another call to kvm_vcpu_map() can die. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-7-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:43 -05:00
David Woodhouse	7d0172b3ca	KVM: nVMX: Use kvm_read_guest_offset_cached() for nested VMCS check Kill another mostly gratuitous kvm_vcpu_map() which could just use the userspace HVA for it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-6-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:43 -05:00
David Woodhouse	6a834754a5	KVM: x86/xen: Use sizeof_field() instead of open-coding it Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-4-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:43 -05:00
David Woodhouse	297d597a6d	KVM: nVMX: Use kvm_{read,write}_guest_cached() for shadow_vmcs12 Using kvm_vcpu_map() for reading from the guest is entirely gratuitous, when all we do is a single memcpy and unmap it again. Fix it up to use kvm_read_guest()... but in fact I couldn't bring myself to do that without also making it use a gfn_to_hva_cache for both that and the copy in the other direction. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-5-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:42 -05:00
David Woodhouse	4e8436479a	KVM: x86/xen: Fix get_attr of KVM_XEN_ATTR_TYPE_SHARED_INFO In commit `319afe6856` ("KVM: xen: do not use struct gfn_to_hva_cache") we stopped storing this in-kernel as a GPA, and started storing it as a GFN. Which means we probably should have stopped calling gpa_to_gfn() on it when userspace asks for it back. Cc: stable@vger.kernel.org Fixes: `319afe6856` ("KVM: xen: do not use struct gfn_to_hva_cache") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:42 -05:00
Maxim Levitsky	b8453cdcf2	KVM: x86/mmu: include EFER.LMA in extended mmu role Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the guest root level and is not reflected in kvm_mmu_page_role.level when TDP is in use. When simply running the guest, it is impossible for EFER.LMA and kvm_mmu.root_level to get out of sync, as the guest cannot transition from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without first bouncing through a different MMU context. And stuffing guest state via KVM_SET_SREGS{,2} also ensures a full MMU context reset. However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to set guest state when migrating the VM while L2 is active, the vCPU state will reflect L2, not L1. If L1 is using TDP for L2, then root_mmu will have been configured using L2's state, despite not being used for L2. If L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will be configured for guest PAE paging, but will match the mmu_role for 64-bit paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit. Alternatively, the root_mmu's role could be invalidated after a successful KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu, i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily tricky, and not even needed if L1 and L2 do have the same role (e.g., they are both 64-bit guests and run with the same CR4). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211115131837.195527-3-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:42 -05:00
Maxim Levitsky	af957eebfc	KVM: nVMX: don't use vcpu->arch.efer when checking host state on nested state load When loading nested state, don't use check vcpu->arch.efer to get the L1 host's 64-bit vs. 32-bit state and don't check it for consistency with respect to VM_EXIT_HOST_ADDR_SPACE_SIZE, as register state in vCPU may be stale when KVM_SET_NESTED_STATE is called---and architecturally does not exist. When restoring L2 state in KVM, the CPU is placed in non-root where nested VMX code has no snapshot of L1 host state: VMX (conditionally) loads host state fields loaded on VM-exit, but they need not correspond to the state before entry. A simple case occurs in KVM itself, where the host RIP field points to vmx_vmexit rather than the instruction following vmlaunch/vmresume. However, for the particular case of L1 being in 32- or 64-bit mode on entry, the exit controls can be treated instead as the source of truth regarding the state of L1 on entry, and can be used to check that vmcs12.VM_EXIT_HOST_ADDR_SPACE_SIZE matches vmcs12.HOST_EFER if vmcs12.VM_EXIT_LOAD_IA32_EFER is set. The consistency check on CPU EFER vs. vmcs12.VM_EXIT_HOST_ADDR_SPACE_SIZE, instead, happens only on VM-Enter. That's because, again, there's conceptually no "current" L1 EFER to check on KVM_SET_NESTED_STATE. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211115131837.195527-2-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:42 -05:00
David Woodhouse	964b7aa0b0	KVM: Fix steal time asm constraints In 64-bit mode, x86 instruction encoding allows us to use the low 8 bits of any GPR as an 8-bit operand. In 32-bit mode, however, we can only use the [abcd] registers. For which, GCC has the "q" constraint instead of the less restrictive "r". Also fix st->preempted, which is an input/output operand rather than an input. Fixes: `7e2175ebd6` ("KVM: x86: Fix recording of guest steal time / preempted status") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <89bf72db1b859990355f9c40713a34e0d2d86c98.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:41 -05:00
Paul Durrant	dc23a5110b	cpuid: kvm_find_kvm_cpuid_features() should be declared 'static' The lack a static declaration currently results in: arch/x86/kvm/cpuid.c:128:26: warning: no previous prototype for function 'kvm_find_kvm_cpuid_features' when compiling with "W=1". Reported-by: kernel test robot <lkp@intel.com> Fixes: `760849b147` ("KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURES") Signed-off-by: Paul Durrant <pdurrant@amazon.com> Message-Id: <20211115144131.5943-1-pdurrant@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:14 -05:00
黄乐	c5adbb3af0	KVM: x86: Fix uninitialized eoi_exit_bitmap usage in vcpu_load_eoi_exitmap() In vcpu_load_eoi_exitmap(), currently the eoi_exit_bitmap[4] array is initialized only when Hyper-V context is available, in other path it is just passed to kvm_x86_ops.load_eoi_exitmap() directly from on the stack, which would cause unexpected interrupt delivery/handling issues, e.g. an old linux kernel that relies on PIT to do clock calibration on KVM might randomly fail to boot. Fix it by passing ioapic_handled_vectors to load_eoi_exitmap() when Hyper-V context is not available. Fixes: `f2bc14b69c` ("KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context") Cc: stable@vger.kernel.org Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Huang Le <huangle1@jd.com> Message-Id: <62115b277dab49ea97da5633f8522daf@jd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-16 07:51:04 -05:00
Paolo Bonzini	501cfe0679	KVM: SEV: unify cgroup cleanup code for svm_vm_migrate_from Use the same cleanup code independent of whether the cgroup to be uncharged and unref'd is the source or the destination cgroup. Use a bool to track whether the destination cgroup has been charged, which also fixes a bug in the error case: the destination cgroup must be uncharged only if it does not match the source. Fixes: `b56639318b` ("KVM: SEV: Add support for SEV intra host migration") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-12 04:37:51 -05:00
Paolo Bonzini	3e067fd850	KVM: x86: move guest_pv_has out of user_access section When UBSAN is enabled, the code emitted for the call to guest_pv_has includes a call to __ubsan_handle_load_invalid_value. objtool complains that this call happens with UACCESS enabled; to avoid the warning, pull the calls to user_access_begin into both arms of the "if" statement, after the check for guest_pv_has. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-12 02:53:52 -05:00
Paolo Bonzini	f5396f2d82	Merge branch 'kvm-5.16-fixes' into kvm-master * Fix misuse of gfn-to-pfn cache when recording guest steal time / preempted status * Fix selftests on APICv machines * Fix sparse warnings * Fix detection of KVM features in CPUID * Cleanups for bogus writes to MSR_KVM_PV_EOI_EN * Fixes and cleanups for MSR bitmap handling * Cleanups for INVPCID * Make x86 KVM_SOFT_MAX_VCPUS consistent with other architectures	2021-11-11 11:03:05 -05:00
Paolo Bonzini	1f05833193	Merge branch 'kvm-sev-move-context' into kvm-master Add support for AMD SEV and SEV-ES intra-host migration support. Intra host migration provides a low-cost mechanism for userspace VMM upgrades. In the common case for intra host migration, we can rely on the normal ioctls for passing data from one VMM to the next. SEV, SEV-ES, and other confidential compute environments make most of this information opaque, and render KVM ioctls such as "KVM_GET_REGS" irrelevant. As a result, we need the ability to pass this opaque metadata from one VMM to the next. The easiest way to do this is to leave this data in the kernel, and transfer ownership of the metadata from one KVM VM (or vCPU) to the next. In-kernel hand off makes it possible to move any data that would be unsafe/impossible for the kernel to hand directly to userspace, and cannot be reproduced using data that can be handed to userspace. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 11:02:58 -05:00
Vitaly Kuznetsov	da1bfd52b9	KVM: x86: Drop arbitrary KVM_SOFT_MAX_VCPUS KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports either num_online_cpus() or num_present_cpus(), s390 has multiple constants depending on hardware features. On x86, KVM reports an arbitrary value of '710' which is supposed to be the maximum tested value but it's possible to test all KVM_MAX_VCPUS even when there are less physical CPUs available. Drop the arbitrary '710' value and return num_online_cpus() on x86 as well. The recommendation will match other architectures and will mean 'no CPU overcommit'. For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning when the requested vCPU number exceeds it. The static limit of '710' is quite weird as smaller systems with just a few physical CPUs should certainly "recommend" less. Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211111134733.86601-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:24 -05:00
Vipin Sharma	796c83c58a	KVM: Move INVPCID type check from vmx and svm to the common kvm_handle_invpcid() Handle #GP on INVPCID due to an invalid type in the common switch statement instead of relying on the callers (VMX and SVM) to manually validate the type. Unlike INVVPID and INVEPT, INVPCID is not explicitly documented to check the type before reading the operand from memory, so deferring the type validity check until after that point is architecturally allowed. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109174426.2350547-3-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:24 -05:00
Vipin Sharma	329bd56ce5	KVM: VMX: Add a helper function to retrieve the GPR index for INVPCID, INVVPID, and INVEPT handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2 field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that holds the invalidation type. Add a helper to retrieve reg2 from VMX instruction info to consolidate and document the shift+mask magic. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109174426.2350547-2-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:24 -05:00
Sean Christopherson	a5e0c25284	KVM: nVMX: Clean up x2APIC MSR handling for L2 Clean up the x2APIC MSR bitmap intereption code for L2, which is the last holdout of open coded bitmap manipulations. Freshen up the SDM/PRM comment, rename the function to make it abundantly clear the funky behavior is x2APIC specific, and explain _why_ vmcs01's bitmap is ignored (the previous comment was flat out wrong for x2APIC behavior). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:23 -05:00
Sean Christopherson	0cacb80b98	KVM: VMX: Macrofy the MSR bitmap getters and setters Add builder macros to generate the MSR bitmap helpers to reduce the amount of copy-paste code, especially with respect to all the magic numbers needed to calc the correct bit location. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:23 -05:00
Sean Christopherson	67f4b9969c	KVM: nVMX: Handle dynamic MSR intercept toggling Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2, and always update the relevant bits in vmcs02. This fixes two distinct, but intertwined bugs related to dynamic MSR bitmap modifications. The first issue is that KVM fails to enable MSR interception in vmcs02 for the FS/GS base MSRs if L1 first runs L2 with interception disabled, and later enables interception. The second issue is that KVM fails to honor userspace MSR filtering when preparing vmcs02. Fix both issues simultaneous as fixing only one of the issues (doesn't matter which) would create a mess that no one should have to bisect. Fixing only the first bug would exacerbate the MSR filtering issue as userspace would see inconsistent behavior depending on the whims of L1. Fixing only the second bug (MSR filtering) effectively requires fixing the first, as the nVMX code only knows how to transition vmcs02's bitmap from 1->0. Move the various accessor/mutators that are currently buried in vmx.c into vmx.h so that they can be shared by the nested code. Fixes: `1a155254ff` ("KVM: x86: Introduce MSR filtering") Fixes: `d69129b4e4` ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible") Cc: stable@vger.kernel.org Cc: Alexander Graf <graf@amazon.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:23 -05:00
Sean Christopherson	7dfbc624eb	KVM: nVMX: Query current VMCS when determining if MSR bitmaps are in use Check the current VMCS controls to determine if an MSR write will be intercepted due to MSR bitmaps being disabled. In the nested VMX case, KVM will disable MSR bitmaps in vmcs02 if they're disabled in vmcs12 or if KVM can't map L1's bitmaps for whatever reason. Note, the bad behavior is relatively benign in the current code base as KVM sets all bits in vmcs02's MSR bitmap by default, clears bits if and only if L0 KVM also disables interception of an MSR, and only uses the buggy helper for MSR_IA32_SPEC_CTRL. Because KVM explicitly tests WRMSR before disabling interception of MSR_IA32_SPEC_CTRL, the flawed check will only result in KVM reading MSR_IA32_SPEC_CTRL from hardware when it isn't strictly necessary. Tag the fix for stable in case a future fix wants to use msr_write_intercepted(), in which case a buggy implementation in older kernels could prove subtly problematic. Fixes: `d28b387fb7` ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:22 -05:00
Vitaly Kuznetsov	afd67ee3cb	KVM: x86: Don't update vcpu->arch.pv_eoi.msr_val when a bogus value was written to MSR_KVM_PV_EOI_EN When kvm_gfn_to_hva_cache_init() call from kvm_lapic_set_pv_eoi() fails, MSR write to MSR_KVM_PV_EOI_EN results in #GP so it is reasonable to expect that the value we keep internally in KVM wasn't updated. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211108152819.12485-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:22 -05:00
Vitaly Kuznetsov	77c3323f48	KVM: x86: Rename kvm_lapic_enable_pv_eoi() kvm_lapic_enable_pv_eoi() is a misnomer as the function is also used to disable PV EOI. Rename it to kvm_lapic_set_pv_eoi(). No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211108152819.12485-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:22 -05:00
Paul Durrant	760849b147	KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURES Currently when kvm_update_cpuid_runtime() runs, it assumes that the KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true, however, if Hyper-V support is enabled. In this case the KVM leaves will be offset. This patch introdues as new 'kvm_cpuid_base' field into struct kvm_vcpu_arch to track the location of the KVM leaves and function kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now target the correct leaf. NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced into processor.h to avoid having duplicate code for the iteration over possible hypervisor base leaves. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Message-Id: <20211105095101.5384-3-pdurrant@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:21 -05:00
Sean Christopherson	8b44b174f6	KVM: x86: Add helper to consolidate core logic of SET_CPUID{2} flows Move the core logic of SET_CPUID and SET_CPUID2 to a common helper, the only difference between the two ioctls() is the format of the userspace struct. A future fix will add yet more code to the core logic. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211105095101.5384-2-pdurrant@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:21 -05:00
Junaid Shahid	10c30de019	kvm: mmu: Use fast PF path for access tracking of huge pages when possible The fast page fault path bails out on write faults to huge pages in order to accommodate dirty logging. This change adds a check to do that only when dirty logging is actually enabled, so that access tracking for huge pages can still use the fast path for write faults in the common case. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211104003359.2201967-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:20 -05:00
Sean Christopherson	c435d4b7ba	KVM: x86/mmu: Properly dereference rcu-protected TDP MMU sptep iterator Wrap the read of iter->sptep in tdp_mmu_map_handle_target_level() with rcu_dereference(). Shadow pages in the TDP MMU, and thus their SPTEs, are protected by rcu. This fixes a Sparse warning at tdp_mmu.c:900:51: warning: incorrect type in argument 1 (different address spaces) expected unsigned long long [usertype] sptep got unsigned long long [noderef] [usertype] __rcu [usertype] sptep Fixes: `7158bee4b4` ("KVM: MMU: pass kvm_mmu_page struct to make_spte") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211103161833.3769487-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:20 -05:00
Maxim Levitsky	cae72dcc3b	KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using standard kvm's inject_pending_event, and not via APICv/AVIC. Since this is a debug feature, just inhibit APICv/AVIC while KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU. Fixes: `61e5f69ef0` ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:20 -05:00
Jim Mattson	e6cd31f1a8	kvm: x86: Convert return type of is_valid_rdpmc_ecx() to bool These function names sound like predicates, and they have siblings, is_valid_msr(), which _are_ predicates. Moreover, there are comments that essentially warn that these functions behave unexpectedly. Flip the polarity of the return values, so that they become predicates, and convert the boolean result to a success/failure code at the outer call site. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211105202058.1048757-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:19 -05:00
David Woodhouse	7e2175ebd6	KVM: x86: Fix recording of guest steal time / preempted status In commit `b043138246` ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") we switched to using a gfn_to_pfn_cache for accessing the guest steal time structure in order to allow for an atomic xchg of the preempted field. This has a couple of problems. Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a guest vCPU using an IOMEM page for its steal time would never have its preempted field set. Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it should have been. There are two stages to the GFN->PFN conversion; first the GFN is converted to a userspace HVA, and then that HVA is looked up in the process page tables to find the underlying host PFN. Correct invalidation of the latter would require being hooked up to the MMU notifiers, but that doesn't happen---so it just keeps mapping and unmapping the wrong PFN after the userspace page tables change. In the !IOMEM case at least the stale page is pinned all the time it's cached, so it won't be freed and reused by anyone else while still receiving the steal time updates. The map/unmap dance only takes care of the KVM administrivia such as marking the page dirty. Until the gfn_to_pfn cache handles the remapping automatically by integrating with the MMU notifiers, we might as well not get a kernel mapping of it, and use the perfectly serviceable userspace HVA that we already have. We just need to implement the atomic xchg on the userspace address with appropriate exception handling, which is fairly trivial. Cc: stable@vger.kernel.org Fixes: `b043138246` ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org> [I didn't entirely agree with David's assessment of the usefulness of the gfn_to_pfn cache, and integrated the outcome of the discussion in the above commit message. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:19 -05:00
Peter Gonda	0b020f5af0	KVM: SEV: Add support for SEV-ES intra host migration For SEV-ES to work with intra host migration the VMSAs, GHCB metadata, and other SEV-ES info needs to be preserved along with the guest's memory. Signed-off-by: Peter Gonda <pgonda@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20211021174303.385706-4-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:35:27 -05:00
Peter Gonda	b56639318b	KVM: SEV: Add support for SEV intra host migration For SEV to work with intra host migration, contents of the SEV info struct such as the ASID (used to index the encryption key in the AMD SP) and the list of memory regions need to be transferred to the target VM. This change adds a commands for a target VMM to get a source SEV VM's sev info. Signed-off-by: Peter Gonda <pgonda@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20211021174303.385706-3-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:35:27 -05:00
Paolo Bonzini	91b692a03c	KVM: SEV: provide helpers to charge/uncharge misc_cg Avoid code duplication across all callers of misc_cg_try_charge and misc_cg_uncharge. The resource type for KVM is always derived from sev->es_active, and the quantity is always 1. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:35:26 -05:00
Paolo Bonzini	f4d3165370	KVM: generalize "bugged" VM to "dead" VM Generalize KVM_REQ_VM_BUGGED so that it can be called even in cases where it is by design that the VM cannot be operated upon. In this case any KVM_BUG_ON should still warn, so introduce a new flag kvm->vm_dead that is separate from kvm->vm_bugged. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:35:26 -05:00
Peter Gonda	b67a4cc35c	KVM: SEV: Refactor out sev_es_state struct Move SEV-ES vCPU metadata into new sev_es_state struct from vcpu_svm. Signed-off-by: Peter Gonda <pgonda@google.com> Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20211021174303.385706-2-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:35:26 -05:00
Linus Torvalds	d7e0a795bf	ARM: * More progress on the protected VM front, now with the full fixed feature set as well as the limitation of some hypercalls after initialisation. * Cleanup of the RAZ/WI sysreg handling, which was pointlessly complicated * Fixes for the vgic placement in the IPA space, together with a bunch of selftests * More memcg accounting of the memory allocated on behalf of a guest * Timer and vgic selftests * Workarounds for the Apple M1 broken vgic implementation * KConfig cleanups * New kvmarm.mode=none option, for those who really dislike us RISC-V: * New KVM port. x86: * New API to control TSC offset from userspace * TSC scaling for nested hypervisors on SVM * Switch masterclock protection from raw_spin_lock to seqcount * Clean up function prototypes in the page fault code and avoid repeated memslot lookups * Convey the exit reason to userspace on emulation failure * Configure time between NX page recovery iterations * Expose Predictive Store Forwarding Disable CPUID leaf * Allocate page tracking data structures lazily (if the i915 KVM-GT functionality is not compiled in) * Cleanups, fixes and optimizations for the shadow MMU code s390: * SIGP Fixes * initial preparations for lazy destroy of secure VMs * storage key improvements/fixes * Log the guest CPNC Starting from this release, KVM-PPC patches will come from Michael Ellerman's PPC tree. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGBOiEUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNowwf/axlx3g9sgCwQHr12/6UF/7hL/RwP 9z+pGiUzjl2YQE+RjSvLqyd6zXh+h4dOdOKbZDLSkSTbcral/8U70ojKnQsXM0XM 1LoymxBTJqkgQBLm9LjYreEbzrPV4irk4ygEmuk3CPOHZu8xX1ei6c5LdandtM/n XVUkXsQY+STkmnGv4P3GcPoDththCr0tBTWrFWtxa0w9hYOxx0ay1AZFlgM4FFX0 QFuRc8VBLoDJpIUjbkhsIRIbrlHc/YDGjuYnAU7lV/CIME8vf2BW6uBwIZJdYcDj 0ejozLjodEnuKXQGnc8sXFioLX2gbMyQJEvwCgRvUu/EU7ncFm1lfs7THQ== =UxKM -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "ARM: - More progress on the protected VM front, now with the full fixed feature set as well as the limitation of some hypercalls after initialisation. - Cleanup of the RAZ/WI sysreg handling, which was pointlessly complicated - Fixes for the vgic placement in the IPA space, together with a bunch of selftests - More memcg accounting of the memory allocated on behalf of a guest - Timer and vgic selftests - Workarounds for the Apple M1 broken vgic implementation - KConfig cleanups - New kvmarm.mode=none option, for those who really dislike us RISC-V: - New KVM port. x86: - New API to control TSC offset from userspace - TSC scaling for nested hypervisors on SVM - Switch masterclock protection from raw_spin_lock to seqcount - Clean up function prototypes in the page fault code and avoid repeated memslot lookups - Convey the exit reason to userspace on emulation failure - Configure time between NX page recovery iterations - Expose Predictive Store Forwarding Disable CPUID leaf - Allocate page tracking data structures lazily (if the i915 KVM-GT functionality is not compiled in) - Cleanups, fixes and optimizations for the shadow MMU code s390: - SIGP Fixes - initial preparations for lazy destroy of secure VMs - storage key improvements/fixes - Log the guest CPNC Starting from this release, KVM-PPC patches will come from Michael Ellerman's PPC tree" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits) RISC-V: KVM: fix boolreturn.cocci warnings RISC-V: KVM: remove unneeded semicolon RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functions RISC-V: KVM: Factor-out FP virtualization into separate sources KVM: s390: add debug statement for diag 318 CPNC data KVM: s390: pv: properly handle page flags for protected guests KVM: s390: Fix handle_sske page fault handling KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol KVM: x86: On emulation failure, convey the exit reason, etc. to userspace KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info KVM: x86: Clarify the kvm_run.emulation_failure structure layout KVM: s390: Add a routine for setting userspace CPU state KVM: s390: Simplify SIGP Set Arch handling KVM: s390: pv: avoid stalls when making pages secure KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm KVM: s390: pv: avoid double free of sida page KVM: s390: pv: add macros for UVC CC values s390/mm: optimize reset_guest_reference_bit() s390/mm: optimize set_guest_storage_key() s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present ...	2021-11-02 11:24:14 -07:00
Linus Torvalds	6e5772c8d9	Add an interface called cc_platform_has() which is supposed to be used by confidential computing solutions to query different aspects of the system. The intent behind it is to unify testing of such aspects instead of having each confidential computing solution add its own set of tests to code paths in the kernel, leading to an unwieldy mess. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/uLUACgkQEsHwGGHe VUqGbQ/+LOmz8hmL5vtbXw/lVonCSBRKI2KVefnN2VtQ3rjtCq8HlNoq/hAdi15O WntABFV8u4daNAcssp+H/p+c8Mt/NzQa60TRooC5ZIynSOCj4oZQxTWjcnR4Qxrf oABy4sp09zNW31qExtTVTwPC/Ejzv4hA0Vqt9TLQOSxp7oYVYKeDJNp79VJK64Yz Ky7epgg8Pauk0tAT76ATR4kyy9PLGe4/Ry0bOtAptO4NShL1RyRgI0ywUmptJHSw FV/MnoexdAs4V8+4zPwyOkf8YMDnhbJcvFcr7Yd9AEz2q9Z1wKCgi1M3aZIoW8lV YMXECMGe9DfxmEJbnP5zbnL6eF32x+tbq+fK8Ye4V2fBucpWd27zkcTXjoP+Y+zH NLg+9QykR9QCH75YCOXcAg1Q5hSmc4DaWuJymKjT+W7MKs89ywjq+ybIBpLBHbQe uN9FM/CEKXx8nQwpNQc7mdUE5sZeCQ875028RaLbLx3/b6uwT6rBlNJfxl/uxmcZ iF1kG7Cx4uO+7G1a9EWgxtWiJQ8GiZO7PMCqEdwIymLIrlNksAk7nX2SXTuH5jIZ YDuBj/Xz2UUVWYFm88fV5c4ogiFlm9Jeo140Zua/BPdDJd2VOP013rYxzFE/rVSF SM2riJxCxkva8Fb+8TNiH42AMhPMSpUt1Nmd1H2rcEABRiT83Ow= =Na0U -----END PGP SIGNATURE----- Merge tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull generic confidential computing updates from Borislav Petkov: "Add an interface called cc_platform_has() which is supposed to be used by confidential computing solutions to query different aspects of the system. The intent behind it is to unify testing of such aspects instead of having each confidential computing solution add its own set of tests to code paths in the kernel, leading to an unwieldy mess" * tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide: Replace the use of mem_encrypt_active() with cc_platform_has() x86/sev: Replace occurrences of sev_es_active() with cc_platform_has() x86/sev: Replace occurrences of sev_active() with cc_platform_has() x86/sme: Replace occurrences of sme_active() with cc_platform_has() powerpc/pseries/svm: Add a powerpc version of cc_platform_has() x86/sev: Add an x86 version of cc_platform_has() arch/cc: Introduce a function to check for confidential computing features x86/ioremap: Selectively build arch override encryption functions	2021-11-01 15:16:52 -07:00
Linus Torvalds	8cb1ae19bf	x86/fpu updates: - Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanved Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a /3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3 YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8 FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL 75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T hcKvDmehQLrUvg== =x3WL -----END PGP SIGNATURE----- Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: - Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanced Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1 * tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) Documentation/x86: Add documentation for using dynamic XSTATE features x86/fpu: Include vmalloc.h for vzalloc() selftests/x86/amx: Add context switch test selftests/x86/amx: Add test cases for AMX state management x86/fpu/amx: Enable the AMX feature in 64-bit mode x86/fpu: Add XFD handling for dynamic states x86/fpu: Calculate the default sizes independently x86/fpu/amx: Define AMX state components and have it used for boot-time checks x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers x86/fpu/xstate: Add fpstate_realloc()/free() x86/fpu/xstate: Add XFD #NM handler x86/fpu: Update XFD state where required x86/fpu: Add sanity checks for XFD x86/fpu: Add XFD state to fpstate x86/msr-index: Add MSRs for XFD x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit x86/fpu: Reset permission and fpstate on exec() x86/fpu: Prepare fpu_clone() for dynamically enabled features x86/fpu/signal: Prepare for variable sigframe length x86/signal: Use fpu::__state_user_size for sigalt stack validation ...	2021-11-01 14:03:56 -07:00
Linus Torvalds	43aa0a195f	objtool updates: - Improve retpoline code patching by separating it from alternatives which reduces memory footprint and allows to do better optimizations in the actual runtime patching. - Add proper retpoline support for x86/BPF - Address noinstr warnings in x86/kvm, lockdep and paravirtualization code - Add support to handle pv_opsindirect calls in the noinstr analysis - Classify symbols upfront and cache the result to avoid redundant strcmp() invocations. - Add a CFI hash to reduce memory consumption which also reduces runtime on a allyesconfig by ~50% - Adjust XEN code to make objtool handling more robust and as a side effect to prevent text fragmentation due to placement of the hypercall page. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/GFgTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoc1JD/0Sz6seP2OUMxbMT3gCcFo9sMvYTdsM 7WuGFbBbnCIo7g8JH7k0zRRBigptMp2eUtQXKkgaaIbWN4JbuVKf8KxN5/qXxLi4 fJ12QnNTGH9N2jtzl5wKmpjaKJnnJMD9D10XwoR+T6gn6NHd+AgLEs7GxxuQUlgo eC9oEXhNHC8uNhiZc38EwfwmItI1bRgaLrnZWIL4rYGSMxfCK1/cEOpWrFfX9wmj /diB6oqMyPXZXMCtgpX7TniUr5XOTCcUkeO9mQv5bmyq/YM/8hrTbcVSJlsVYLvP EsBnUSHAcfLFiHXwa1RNiIGdbiPjbN+UYeXGAvqF58f3e5dTIHtN/UmWo7OH93If 9rLMVNcMpsfPx7QRk2IxEPumLCkyfwjzfKrVDM6P6TKEIUzD1og4IK9gTlfykVsh 56G5XiCOC/X2x8IMxKTLGuBiAVLFHXK/rSwoqhvNEWBFKDbP13QWs0LurBcW09Sa /kQI9pIBT1xFA/R+OY5Xy1cqNVVK1Gxmk8/bllCijA9pCFSCFM4hLZE5CevdrBCV h5SdqEK5hIlzFyypXfsCik/4p/+rfvlGfUKtFsPctxx29SPe+T0orx+l61jiWQok rZOflwMawK5lDuASHrvNHGJcWaTwoo3VcXMQDnQY0Wulc43J5IFBaPxkZzgyd+S1 4lktHxatrCMUgw== =pfZi -----END PGP SIGNATURE----- Merge tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Thomas Gleixner: - Improve retpoline code patching by separating it from alternatives which reduces memory footprint and allows to do better optimizations in the actual runtime patching. - Add proper retpoline support for x86/BPF - Address noinstr warnings in x86/kvm, lockdep and paravirtualization code - Add support to handle pv_opsindirect calls in the noinstr analysis - Classify symbols upfront and cache the result to avoid redundant strcmp() invocations. - Add a CFI hash to reduce memory consumption which also reduces runtime on a allyesconfig by ~50% - Adjust XEN code to make objtool handling more robust and as a side effect to prevent text fragmentation due to placement of the hypercall page. * tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) bpf,x86: Respect X86_FEATURE_RETPOLINE* bpf,x86: Simplify computing label offsets x86,bugs: Unconditionally allow spectre_v2=retpoline,amd x86/alternative: Add debug prints to apply_retpolines() x86/alternative: Try inline spectre_v2=retpoline,amd x86/alternative: Handle Jcc __x86_indirect_thunk_\reg x86/alternative: Implement .retpoline_sites support x86/retpoline: Create a retpoline thunk array x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h x86/asm: Fixup odd GEN-for-each-reg.h usage x86/asm: Fix register order x86/retpoline: Remove unused replacement symbols objtool,x86: Replace alternatives with .retpoline_sites objtool: Shrink struct instruction objtool: Explicitly avoid self modifying code in .altinstr_replacement objtool: Classify symbols objtool: Support pv_opsindirect calls for noinstr x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays x86/xen: Mark xen_force_evtchn_callback() noinstr x86/xen: Make irq_disable() noinstr ...	2021-11-01 13:24:43 -07:00
Linus Torvalds	ca5e83eddc	* Fixes for s390 interrupt delivery * Fixes for Xen emulator bugs showing up as debug kernel WARNs * Fix another issue with SEV/ES string I/O VMGEXITs -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmF6uGIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNRagf/Srvk9lNcRh4cEzsczErKMyr3xOqA jgsTSqgl1ExJI9sBLMpVYBOFGILMaMSrhLPIltKPy0Bj/E+hw8WOQwPa44QjWlSD MAUxO1Nryt9Luc2L8uSd1c//g4fr4V1BhOaumk1lM14Q8EDfQBcDIMI2ZKueMU1+ 2Q+n8/AsG63jQIINwKNidof0dzRtbfcE30Wq/8QHttIPo5wt6l0YClOlOikqNY8N 5+WSQFmuutHIXftq5Jb/Ldn/+HVukWZyZOEVwLnBpM9uBvIubNgcEakqvxsaVtAn FHdvnA+Bk99/Xuhl+wRLQo8ofzQIQ13RQv3HPArJAJv34oAJZx2rNObVlA== =6ofB -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: - Fixes for s390 interrupt delivery - Fixes for Xen emulator bugs showing up as debug kernel WARNs - Fix another issue with SEV/ES string I/O VMGEXITs * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Take srcu lock in post_kvm_run_save() KVM: SEV-ES: fix another issue with string I/O VMGEXITs KVM: x86/xen: Fix kvm_xen_has_interrupt() sleeping in kvm_vcpu_block() KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlock KVM: s390: preserve deliverable_mask in __airqs_kick_single_vcpu KVM: s390: clear kicked_mask before sleeping again	2021-10-31 11:19:02 -07:00
Paolo Bonzini	4e33868433	KVM/arm64 updates for Linux 5.16 - More progress on the protected VM front, now with the full fixed feature set as well as the limitation of some hypercalls after initialisation. - Cleanup of the RAZ/WI sysreg handling, which was pointlessly complicated - Fixes for the vgic placement in the IPA space, together with a bunch of selftests - More memcg accounting of the memory allocated on behalf of a guest - Timer and vgic selftests - Workarounds for the Apple M1 broken vgic implementation - KConfig cleanups - New kvmarm.mode=none option, for those who really dislike us -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmF7u5YPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpD6w8QAIKDLJCTqkxv5Vh4ZSmtXxg4gTZMBlg8oSQ8 sVL639aqBvFe3A6Vmz6IwBm+NT7Sm1zxkuH9qHzVR1gmXq0oLYNrIuyrzRW8PvqO hIkSRRoVsf03755TmkxwR7/2jAFxb6FhEVAy6VWdQyI44orihIPvMp8aTIq+jvU+ XoNGb/rPf9HpSUtvuaHYvZhSZBhoi5dRnkr33R1+VR69n7Axs8lm905xcl6Pt0a0 QqYZWQvFu/BXPyNflG7LUsegRF/iiV2vNTbNNowkzlV5suqxBpJAp6ApDL/gWrHv ya/6cMqicSjBIkWnawhXY98w6/5xfzK4IV/zc00FNWOlUdVP89Thqrgc8EkigS9R BGcxFFqj41snr+ensSBBIkNtV+dBX52H3rUE0F9seiTXm8QWI86JobdeNadT8tUP TXdOeCUcA+cp4Ngln18lsbOEaBkPA5H1po1nUFPHbKnVOxnqXScB7E/xF6rAbryV m+Z+oidU7MyS/Ev/Da0ww/XFx7cs2ez9EgeQvjcdFAvUMqS6kcXEExvgGYlm+KRQ GBMKPLCNHKdflMANoSpol7MZUmPJ45XoWKW1rntj2r9X+oJW2Z2hEx32xrWDJdqK ixnbjog5kNZb0CjLGsUC90lo2hpRJecaLhAjgTLYaNC1QxGPrt92eat6gnwuMTBc mpADqi7w =qBAO -----END PGP SIGNATURE----- Merge tag 'kvmarm-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 5.16 - More progress on the protected VM front, now with the full fixed feature set as well as the limitation of some hypercalls after initialisation. - Cleanup of the RAZ/WI sysreg handling, which was pointlessly complicated - Fixes for the vgic placement in the IPA space, together with a bunch of selftests - More memcg accounting of the memory allocated on behalf of a guest - Timer and vgic selftests - Workarounds for the Apple M1 broken vgic implementation - KConfig cleanups - New kvmarm.mode=none option, for those who really dislike us	2021-10-31 02:28:48 -04:00
David Woodhouse	f3d1436d4b	KVM: x86: Take srcu lock in post_kvm_run_save() The Xen interrupt injection for event channels relies on accessing the guest's vcpu_info structure in __kvm_xen_has_interrupt(), through a gfn_to_hva_cache. This requires the srcu lock to be held, which is mostly the case except for this code path: [ 11.822877] WARNING: suspicious RCU usage [ 11.822965] ----------------------------- [ 11.823013] include/linux/kvm_host.h:664 suspicious rcu_dereference_check() usage! [ 11.823131] [ 11.823131] other info that might help us debug this: [ 11.823131] [ 11.823196] [ 11.823196] rcu_scheduler_active = 2, debug_locks = 1 [ 11.823253] 1 lock held by dom:0/90: [ 11.823292] #0: ffff998956ec8118 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x680 [ 11.823379] [ 11.823379] stack backtrace: [ 11.823428] CPU: 2 PID: 90 Comm: dom:0 Kdump: loaded Not tainted 5.4.34+ #5 [ 11.823496] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 11.823612] Call Trace: [ 11.823645] dump_stack+0x7a/0xa5 [ 11.823681] lockdep_rcu_suspicious+0xc5/0x100 [ 11.823726] __kvm_xen_has_interrupt+0x179/0x190 [ 11.823773] kvm_cpu_has_extint+0x6d/0x90 [ 11.823813] kvm_cpu_accept_dm_intr+0xd/0x40 [ 11.823853] kvm_vcpu_ready_for_interrupt_injection+0x20/0x30 < post_kvm_run_save() inlined here > [ 11.823906] kvm_arch_vcpu_ioctl_run+0x135/0x6a0 [ 11.823947] kvm_vcpu_ioctl+0x263/0x680 Fixes: `40da8ccd72` ("KVM: x86/xen: Add event channel interrupt vector upcall") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Cc: stable@vger.kernel.org Message-Id: <606aaaf29fca3850a63aa4499826104e77a72346.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-28 10:45:38 -04:00
Paolo Bonzini	9b0971ca7f	KVM: SEV-ES: fix another issue with string I/O VMGEXITs If the guest requests string I/O from the hypervisor via VMGEXIT, SW_EXITINFO2 will contain the REP count. However, sev_es_string_io was incorrectly treating it as the size of the GHCB buffer in bytes. This fixes the "outsw" test in the experimental SEV tests of kvm-unit-tests. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reported-by: Marc Orr <marcorr@google.com> Tested-by: Marc Orr <marcorr@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-27 10:58:26 -04:00
David Woodhouse	0985dba842	KVM: x86/xen: Fix kvm_xen_has_interrupt() sleeping in kvm_vcpu_block() In kvm_vcpu_block, the current task is set to TASK_INTERRUPTIBLE before making a final check whether the vCPU should be woken from HLT by any incoming interrupt. This is a problem for the get_user() in __kvm_xen_has_interrupt(), which really shouldn't be sleeping when the task state has already been set. I think it's actually harmless as it would just manifest itself as a spurious wakeup, but it's causing a debug warning: [ 230.963649] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b6bcdbc9>] prepare_to_swait_exclusive+0x30/0x80 Fix the warning by turning it into an explicit spurious wakeup. When invoked with !task_is_running(current) (and we might as well add in_atomic() there while we're at it), just return 1 to indicate that an IRQ is pending, which will cause a wakeup and then something will call it again in a context that can sleep so it can fault the page back in. Cc: stable@vger.kernel.org Fixes: `40da8ccd72` ("KVM: x86/xen: Add event channel interrupt vector upcall") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <168bf8c689561da904e48e2ff5ae4713eaef9e2d.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-25 09:10:18 -04:00
David Woodhouse	8228c77d8b	KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlock On the preemption path when updating a Xen guest's runstate times, this lock is taken inside the scheduler rq->lock, which is a raw spinlock. This was shown in a lockdep warning: [ 89.138354] ============================= [ 89.138356] [ BUG: Invalid wait context ] [ 89.138358] 5.15.0-rc5+ #834 Tainted: G S I E [ 89.138360] ----------------------------- [ 89.138361] xen_shinfo_test/2575 is trying to lock: [ 89.138363] ffffa34a0364efd8 (&kvm->arch.pvclock_gtod_sync_lock){....}-{3:3}, at: get_kvmclock_ns+0x1f/0x130 [kvm] [ 89.138442] other info that might help us debug this: [ 89.138444] context-{5:5} [ 89.138445] 4 locks held by xen_shinfo_test/2575: [ 89.138447] #0: ffff972bdc3b8108 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x6f0 [kvm] [ 89.138483] #1: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_ioctl_run+0xdc/0x8b0 [kvm] [ 89.138526] #2: ffff97331fdbac98 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xff/0xbd0 [ 89.138534] #3: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_put+0x26/0x170 [kvm] ... [ 89.138695] get_kvmclock_ns+0x1f/0x130 [kvm] [ 89.138734] kvm_xen_update_runstate+0x14/0x90 [kvm] [ 89.138783] kvm_xen_update_runstate_guest+0x15/0xd0 [kvm] [ 89.138830] kvm_arch_vcpu_put+0xe6/0x170 [kvm] [ 89.138870] kvm_sched_out+0x2f/0x40 [kvm] [ 89.138900] __schedule+0x5de/0xbd0 Cc: stable@vger.kernel.org Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com Fixes: `30b5c851af` ("KVM: x86/xen: Add support for vCPU runstate information") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <1b02a06421c17993df337493a68ba923f3bd5c0f.camel@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-25 08:14:38 -04:00
David Edmondson	0d7d84498f	KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol When passing the failing address and size out to user space, SGX must ensure not to trample on the earlier fields of the emulation_failure sub-union of struct kvm_run. Signed-off-by: David Edmondson <david.edmondson@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210920103737.2696756-5-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-25 06:48:25 -04:00
David Edmondson	e615e35589	KVM: x86: On emulation failure, convey the exit reason, etc. to userspace Should instruction emulation fail, include the VM exit reason, etc. in the emulation_failure data passed to userspace, in order that the VMM can report it as a debugging aid when describing the failure. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210920103737.2696756-4-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-25 06:48:24 -04:00
David Edmondson	0a62a0319a	KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info Extend the get_exit_info static call to provide the reason for the VM exit. Modify relevant trace points to use this rather than extracting the reason in the caller. Signed-off-by: David Edmondson <david.edmondson@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210920103737.2696756-3-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-25 06:48:24 -04:00
Thomas Gleixner	d69c1382e1	x86/kvm: Convert FPU handling to a single swap buffer For the upcoming AMX support it's necessary to do a proper integration with KVM. Currently KVM allocates two FPU structs which are used for saving the user state of the vCPU thread and restoring the guest state when entering vcpu_run() and doing the reverse operation before leaving vcpu_run(). With the new fpstate mechanism this can be reduced to one extra buffer by swapping the fpstate pointer in current:🧵:fpu. This makes the upcoming support for AMX and XFD simpler because then fpstate information (features, sizes, xfd) are always consistent and it does not require any nasty workarounds. Convert the KVM FPU code over to this new scheme. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211022185313.019454292@linutronix.de	2021-10-23 16:13:29 +02:00
Linus Torvalds	cd82c4a73b	* Cache coherency fix for SEV live migration * Fix for instruction emulation with PKU * fixes for rare delaying of interrupt delivery * fix for SEV-ES buffer overflow -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFy2tsUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMrKggAq6JWuFGwJY8hq9hd/8SMvJUsmtmh ua7zKj8xi8w52yZNigCSllj3cOtpQ4pTpy9nhUBcXbGEWDNbZ9Tm6flYmvc6Hrt3 iffXBtqri3ioSvQr908f+ceOAsX8ishA1ewbMKLmathGN6+GXa3KtqVAZ2t7z3Yp VX/I/xpViYGwhMPi5T1Yoj0SfVAEhO0ROodcGJXo2ddX/FVZTibqE/nONkXbgMP0 gibf39N7JIti3oz+puLkFUnBKcdi/jy9yUjz01Rn315QrrFEsOsPhQGLR6Q24lgg 7aarqbsoJQK6eJwNU/SxwpiZuj5lRsQVD0evkNd/JxDkGCa1T5cXUVILdg== =+1Ow -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull more x86 kvm fixes from Paolo Bonzini: - Cache coherency fix for SEV live migration - Fix for instruction emulation with PKU - fixes for rare delaying of interrupt delivery - fix for SEV-ES buffer overflow * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SEV-ES: go over the sev_pio_data buffer in multiple passes if needed KVM: SEV-ES: keep INS functions together KVM: x86: remove unnecessary arguments from complete_emulator_pio_in KVM: x86: split the two parts of emulator_pio_in KVM: SEV-ES: clean up kvm_sev_es_ins/outs KVM: x86: leave vcpu->arch.pio.count alone in emulator_pio_in_out KVM: SEV-ES: rename guest_ins_data to sev_pio_data KVM: SEV: Flush cache on non-coherent systems before RECEIVE_UPDATE_DATA KVM: MMU: Reset mmu->pkru_mask to avoid stale data KVM: nVMX: promptly process interrupts delivered while in guest mode KVM: x86: check for interrupts before deciding whether to exit the fast path	2021-10-22 09:02:15 -10:00
Jim Mattson	dfd3c713a9	kvm: x86: Remove stale declaration of kvm_no_apic_vcpu This variable was renamed to kvm_has_noapic_vcpu in commit `6e4e3b4df4` ("KVM: Stop using deprecated jump label APIs"). Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20211021185449.3471763-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 12:46:37 -04:00
Sean Christopherson	ec5a4919fa	KVM: VMX: Unregister posted interrupt wakeup handler on hardware unsetup Unregister KVM's posted interrupt wakeup handler during unsetup so that a spurious interrupt that arrives after kvm_intel.ko is unloaded doesn't call into freed memory. Fixes: `bf9f6ac8d7` ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009001107.3936588-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 12:46:37 -04:00
Sean Christopherson	187c8833de	KVM: x86: Use rw_semaphore for APICv lock to allow vCPU parallelism Use a rw_semaphore instead of a mutex to coordinate APICv updates so that vCPUs responding to requests can take the lock for read and run in parallel. Using a mutex forces serialization of vCPUs even though kvm_vcpu_update_apicv() only touches data local to that vCPU or is protected by a different lock, e.g. SVM's ir_list_lock. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022004927.1448382-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 11:20:16 -04:00
Sean Christopherson	ee49a89329	KVM: x86: Move SVM's APICv sanity check to common x86 Move SVM's assertion that vCPU's APICv state is consistent with its VM's state out of svm_vcpu_run() and into x86's common inner run loop. The assertion and underlying logic is not unique to SVM, it's just that SVM has more inhibiting conditions and thus is more likely to run headfirst into any KVM bugs. Add relevant comments to document exactly why the update path has unusual ordering between the update the kick, why said ordering is safe, and also the basic rules behind the assertion in the run loop. Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022004927.1448382-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 11:20:16 -04:00
Paolo Bonzini	95e16b4792	KVM: SEV-ES: go over the sev_pio_data buffer in multiple passes if needed The PIO scratch buffer is larger than a single page, and therefore it is not possible to copy it in a single step to vcpu->arch/pio_data. Bound each call to emulator_pio_in/out to a single page; keep track of how many I/O operations are left in vcpu->arch.sev_pio_count, so that the operation can be restarted in the complete_userspace_io callback. For OUT, this means that the previous kvm_sev_es_outs implementation becomes an iterator of the loop, and we can consume the sev_pio_data buffer before leaving to userspace. For IN, instead, consuming the buffer and decreasing sev_pio_count is always done in the complete_userspace_io callback, because that is when the memcpy is done into sev_pio_data. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reported-by: Felix Wilhelm <fwilhelm@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:09:13 -04:00
Paolo Bonzini	4fa4b38dae	KVM: SEV-ES: keep INS functions together Make the diff a little nicer when we actually get to fixing the bug. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:51 -04:00
Paolo Bonzini	6b5efc930b	KVM: x86: remove unnecessary arguments from complete_emulator_pio_in complete_emulator_pio_in can expect that vcpu->arch.pio has been filled in, and therefore does not need the size and count arguments. This makes things nicer when the function is called directly from a complete_userspace_io callback. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:38 -04:00
Paolo Bonzini	3b27de2718	KVM: x86: split the two parts of emulator_pio_in emulator_pio_in handles both the case where the data is pending in vcpu->arch.pio.count, and the case where I/O has to be done via either an in-kernel device or a userspace exit. For SEV-ES we would like to split these, to identify clearly the moment at which the sev_pio_data is consumed. To this end, create two different functions: __emulator_pio_in fills in vcpu->arch.pio.count, while complete_emulator_pio_in clears it and releases vcpu->arch.pio.data. Because this patch has to be backported, things are left a bit messy. kernel_pio() operates on vcpu->arch.pio, which leads to emulator_pio_in() having with two calls to complete_emulator_pio_in(). It will be fixed in the next release. While at it, remove the unused void* val argument of emulator_pio_in_out. The function currently hardcodes vcpu->arch.pio_data as the source/destination buffer, which sucks but will be fixed after the more severe SEV-ES buffer overflow. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:08:00 -04:00
Paolo Bonzini	ea724ea420	KVM: SEV-ES: clean up kvm_sev_es_ins/outs A few very small cleanups to the functions, smushed together because the patch is already very small like this: - inline emulator_pio_in_emulated and emulator_pio_out_emulated, since we already have the vCPU - remove the data argument and pull setting vcpu->arch.sev_pio_data into the caller - remove unnecessary clearing of vcpu->arch.pio.count when emulation is done by the kernel (and therefore vcpu->arch.pio.count is already clear on exit from emulator_pio_in and emulator_pio_out). No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:02:20 -04:00
Paolo Bonzini	0d33b1baeb	KVM: x86: leave vcpu->arch.pio.count alone in emulator_pio_in_out Currently emulator_pio_in clears vcpu->arch.pio.count twice if emulator_pio_in_out performs kernel PIO. Move the clear into emulator_pio_out where it is actually necessary. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:02:07 -04:00
Paolo Bonzini	b5998402e3	KVM: SEV-ES: rename guest_ins_data to sev_pio_data We will be using this field for OUTS emulation as well, in case the data that is pushed via OUTS spans more than one page. In that case, there will be a need to save the data pointer across exits to userspace. So, change the name to something that refers to any kind of PIO. Also spell out what it is used for, namely SEV-ES. No functional change intended. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 10:01:26 -04:00
Sean Christopherson	21fa324654	KVM: x86/mmu: Extract zapping of rmaps for gfn range to separate helper Extract the zapping of rmaps, a.k.a. legacy MMU, for a gfn range to a separate helper to clean up the unholy mess that kvm_zap_gfn_range() has become. In addition to deep nesting, the rmaps zapping spreads out the declaration of several variables and is generally a mess. Clean up the mess now so that future work to improve the memslots implementation doesn't need to deal with it. Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:52 -04:00
Sean Christopherson	e8be2a5ba8	KVM: x86/mmu: Drop a redundant remote TLB flush in kvm_zap_gfn_range() Remove an unnecessary remote TLB flush in kvm_zap_gfn_range() now that said function holds mmu_lock for write for its entire duration. The flush was added by the now-reverted commit to allow TDP MMU to flush while holding mmu_lock for read, as the transition from write=>read required dropping the lock and thus a pending flush needed to be serviced. Fixes: `5a324c24b6` ("Revert "KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock"") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:46 -04:00
Sean Christopherson	bc3b3c1002	KVM: x86/mmu: Drop a redundant, broken remote TLB flush A recent commit to fix the calls to kvm_flush_remote_tlbs_with_address() in kvm_zap_gfn_range() inadvertantly added yet another flush instead of fixing the existing flush. Drop the redundant flush, and fix the params for the existing flush. Cc: stable@vger.kernel.org Fixes: `2822da4466` ("KVM: x86/mmu: fix parameters to kvm_flush_remote_tlbs_with_address") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:30 -04:00
Lai Jiangshan	61b05a9fd4	KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() kvm_mmu_unload() destroys all the PGD caches. Use the lighter kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:44:43 -04:00
Lai Jiangshan	264d3dc1d3	KVM: X86: pair smp_wmb() of mmu_try_to_unsync_pages() with smp_rmb() The commit `578e1c4db2` ("kvm: x86: Avoid taking MMU lock in kvm_mmu_sync_roots if no sync is needed") added smp_wmb() in mmu_try_to_unsync_pages(), but the corresponding smp_load_acquire() isn't used on the load of SPTE.W. smp_load_acquire() orders _subsequent_ loads after sp->is_unsync; it does not order _earlier_ loads before the load of sp->is_unsync. This has no functional change; smp_rmb() is a NOP on x86, and no compiler barrier is required because there is a VMEXIT between the load of SPTE.W and kvm_mmu_snc_roots. Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:43:21 -04:00
Lai Jiangshan	509bfe3d97	KVM: X86: Cache CR3 in prev_roots when PCID is disabled The commit `21823fbda5` ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific PCID and in the case of PCID is disabled, it includes all PGDs in the prev_roots and the commit made prev_roots totally unused in this case. Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1 before the said commit: (CR4.PCIDE=0, CR4.PGE=1; CR3=cr3_a; the page for the guest RIP is global; cr3_b is cached in prev_roots) modify page tables under cr3_b the shadow root of cr3_b is unsync in kvm INVPCID single context the guest expects the TLB is clean for PCID=0 change CR4.PCIDE 0 -> 1 switch to cr3_b with PCID=0,NOFLUSH=1 No sync in kvm, cr3_b is still unsync in kvm jump to the page that was modified in step 1 shadow page tables point to the wrong page It is a very unlikely case, but it shows that stale prev_roots can be a problem after CR4.PCIDE changes from 0 to 1. However, to fix this case, the commit disabled caching CR3 in prev_roots altogether when PCID is disabled. Not all CPUs have PCID; especially the PCID support for AMD CPUs is kind of recent. To restore the prev_roots optimization for CR4.PCIDE=0, flush the whole MMU (including all prev_roots) when CR4.PCIDE changes. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:30 -04:00
Lai Jiangshan	e45e9e3998	KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() The KVM doesn't know whether any TLB for a specific pcid is cached in the CPU when tdp is enabled. So it is better to flush all the guest TLB when invalidating any single PCID context. The case is very rare or even impossible since KVM generally doesn't intercept CR3 write or INVPCID instructions when tdp is enabled, so the fix is mostly for the sake of overall robustness. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:30 -04:00
Lai Jiangshan	a91a7c7096	KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE X86_CR4_PGE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. It is also inconsistent that X86_CR4_PGE is in KVM_MMU_CR4_ROLE_BITS while kvm_mmu_role doesn't use X86_CR4_PGE. So X86_CR4_PGE is also removed from KVM_MMU_CR4_ROLE_BITS. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Lai Jiangshan	552617382c	KVM: X86: Don't reset mmu context when X86_CR4_PCIDE 1->0 X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Wanpeng Li	9ae7f6c9b5	KVM: emulate: Comment on difference between RDPMC implementation and manual SDM mentioned that, RDPMC: IF (((CR4.PCE = 1) or (CPL = 0) or (CR0.PE = 0)) and (ECX indicates a supported counter)) THEN EAX := counter[31:0]; EDX := ZeroExtend(counter[MSCB:32]); ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1, 2, or 3 and CR0.PE is 1 *) #GP(0); FI; Let's add a comment why CR0.PE isn't tested since it's impossible for CPL to be >0 if CR0.PE=0. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1634724836-73721-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Sean Christopherson	9dadfc4a61	KVM: x86: Add vendor name to kvm_x86_ops, use it for error messages Paul pointed out the error messages when KVM fails to load are unhelpful in understanding exactly what went wrong if userspace probes the "wrong" module. Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel and kvm_amd, and use the name for relevant error message when KVM fails to load so that the user knows which module failed to load. Opportunistically tweak the "disabled by bios" error message to clarify that _support_ was disabled, not that the module itself was magically disabled by BIOS. Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211018183929.897461-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
Junaid Shahid	4dfe4f40d8	kvm: x86: mmu: Make NX huge page recovery period configurable Currently, the NX huge page recovery thread wakes up every minute and zaps 1/nx_huge_pages_recovery_ratio of the total number of split NX huge pages at a time. This is intended to ensure that only a relatively small number of pages get zapped at a time. But for very large VMs (or more specifically, VMs with a large number of executable pages), a period of 1 minute could still result in this number being too high (unless the ratio is changed significantly, but that can result in split pages lingering on for too long). This change makes the period configurable instead of fixing it at 1 minute. Users of large VMs can then adjust the period and/or the ratio to reduce the number of pages zapped at one time while still maintaining the same overall duration for cycling through the entire list. By default, KVM derives a period from the ratio such that a page will remain on the list for 1 hour on average. Signed-off-by: Junaid Shahid <junaids@google.com> Message-Id: <20211020010627.305925-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
Wanpeng Li	540c7abe61	KVM: vPMU: Fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 SDM section 18.2.3 mentioned that: "IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of any general-purpose or fixed-function counters via a single WRMSR." It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing, the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop global_ovf_ctrl variable. Tested-by: Like Xu <likexu@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1634631160-67276-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
David Matlack	610265ea3d	KVM: x86/mmu: Rename slot_handle_leaf to slot_handle_level_4k slot_handle_leaf is a misnomer because it only operates on 4K SPTEs whereas "leaf" is used to describe any valid terminal SPTE (4K or large page). Rename slot_handle_leaf to slot_handle_level_4k to avoid confusion. Making this change makes it more obvious there is a benign discrepency between the legacy MMU and the TDP MMU when it comes to dirty logging. The legacy MMU only iterates through 4K SPTEs when zapping for collapsing and when clearing D-bits. The TDP MMU, on the other hand, iterates through SPTEs on all levels. The TDP MMU behavior of zapping SPTEs at all levels is technically overkill for its current dirty logging implementation, which always demotes to 4k SPTES, but both the TDP MMU and legacy MMU zap if and only if the SPTE can be replaced by a larger page, i.e. will not spuriously zap 2m (or larger) SPTEs. Opportunistically add comments to explain this discrepency in the code. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20211019162223.3935109-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:27 -04:00
Xiaoyao Li	e099f3eb0e	KVM: VMX: RTIT_CTL_BRANCH_EN has no dependency on other CPUID bit Per Intel SDM, RTIT_CTL_BRANCH_EN bit has no dependency on any CPUID leaf 0x14. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-5-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:27 -04:00
Xiaoyao Li	f4d3a902a5	KVM: VMX: Rename pt_desc.addr_range to pt_desc.num_address_ranges To better self explain the meaning of this field and match the PT_CAP_num_address_ranges constatn. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-4-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:27 -04:00
Xiaoyao Li	ba51d62723	KVM: VMX: Use precomputed vmx->pt_desc.addr_range The number of valid PT ADDR MSRs for the guest is precomputed in vmx->pt_desc.addr_range. Use it instead of calculating again. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-3-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:26 -04:00
Xiaoyao Li	2e6e0d683b	KVM: VMX: Restore host's MSR_IA32_RTIT_CTL when it's not zero A minor optimization to WRMSR MSR_IA32_RTIT_CTL when necessary. Opportunistically refine the comment to call out that KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-2-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:26 -04:00
Paolo Bonzini	2839180ce5	KVM: x86/mmu: clean up prefetch/prefault/speculative naming "prefetch", "prefault" and "speculative" are used throughout KVM to mean the same thing. Use a single name, standardizing on "prefetch" which is already used by various functions such as direct_pte_prefetch, FNAME(prefetch_gpte), FNAME(pte_prefetch), etc. Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:26 -04:00
David Stevens	1e76a3ce0d	KVM: cleanup allocation of rmaps and page tracking data Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: David Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:25 -04:00
Masahiro Kozuka	c8c340a9b4	KVM: SEV: Flush cache on non-coherent systems before RECEIVE_UPDATE_DATA Flush the destination page before invoking RECEIVE_UPDATE_DATA, as the PSP encrypts the data with the guest's key when writing to guest memory. If the target memory was not previously encrypted, the cache may contain dirty, unecrypted data that will persist on non-coherent systems. Fixes: `15fb7de1a7` ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Masahiro Kozuka <masa.koz@kozuka.jp> [sean: converted bug report to changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210914210951.2994260-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-21 13:01:25 -04:00
Chenyi Qiang	a3ca5281bb	KVM: MMU: Reset mmu->pkru_mask to avoid stale data When updating mmu->pkru_mask, the value can only be added but it isn't reset in advance. This will make mmu->pkru_mask keep the stale data. Fix this issue. Fixes: `2d344105f5` ("KVM, pkeys: introduce pkru_mask to cache conditions") Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20211021071022.1140-1-chenyi.qiang@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-21 11:09:29 -04:00
Paolo Bonzini	3a25dfa67f	KVM: nVMX: promptly process interrupts delivered while in guest mode Since commit `c300ab9f08` ("KVM: x86: Replace late check_nested_events() hack with more precise fix") there is no longer the certainty that check_nested_events() tries to inject an external interrupt vmexit to L1 on every call to vcpu_enter_guest. Therefore, even in that case we need to set KVM_REQ_EVENT. This ensures that inject_pending_event() is called, and from there kvm_check_nested_events(). Fixes: `c300ab9f08` ("KVM: x86: Replace late check_nested_events() hack with more precise fix") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-21 03:35:42 -04:00
Paolo Bonzini	de7cd3f676	KVM: x86: check for interrupts before deciding whether to exit the fast path The kvm_x86_sync_pir_to_irr callback can sometimes set KVM_REQ_EVENT. If that happens exactly at the time that an exit is handled as EXIT_FASTPATH_REENTER_GUEST, vcpu_enter_guest will go incorrectly through the loop that calls kvm_x86_run, instead of processing the request promptly. Fixes: `379a3c8ee4` ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-21 03:35:41 -04:00
Thomas Gleixner	1c57572d75	x86/KVM: Convert to fpstate Convert KVM code to the new register storage mechanism in preparation for dynamically sized buffers. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211013145322.451439983@linutronix.de	2021-10-20 22:34:14 +02:00
Thomas Gleixner	087df48c29	x86/fpu: Replace KVMs xstate component clearing In order to prepare for the support of dynamically enabled FPU features, move the clearing of xstate components to the FPU core code. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211013145322.399567049@linutronix.de	2021-10-20 22:26:41 +02:00
Thomas Gleixner	bf5d004707	x86/fpu: Replace KVMs home brewed FPU copy to user Similar to the copy from user function the FPU core has this already implemented with all bells and whistles. Get rid of the duplicated code and use the core functionality. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.244101845@linutronix.de	2021-10-20 22:17:17 +02:00
Linus Torvalds	0afe64bebb	Tools: * kvm_stat: do not show halt_wait_ns since it is not a cumulative statistic x86: * clean ups and fixes for bus lock vmexit and lazy allocation of rmaps * two fixes for SEV-ES (one more coming as soon as I get reviews) * fix for static_key underflow ARM: * Properly refcount pages used as a concatenated stage-2 PGD * Fix missing unlock when detecting the use of MTE+VM_SHARED -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFtuqYUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNGbAf9Ha4mlieY7lDQLk96GydPwlMofi1B dteRaWizokT0Xk7HovPr8G1zwwE9DrqO1FuHiZrkckzf7cloaPDvncLag3D3Vakr dWIqa7MaavSWBKDpcEIKOEo2SfIBU38xXQSEpegz2f2fhZK0Ud2xUNtGQMNrYatX Lz6FXHRvHDmv4+9EjASoGBd0/C/NxMaumYa1VOxMt8JPyn+zho0z5rUDKDF4pg70 KAgxVZuksy15XFRTgaSaU0BqVn9uCHwZVqRFKBm+ocPXIFjhdMkgrxJ7NSYB1T+N VFqcUBTFTjhg9e5eZnQ6GMf9FXpLzK912VhCRd0uU5PGeBwUDJTSnyu5OQ== =GZqR -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Tools: - kvm_stat: do not show halt_wait_ns since it is not a cumulative statistic x86: - clean ups and fixes for bus lock vmexit and lazy allocation of rmaps - two fixes for SEV-ES (one more coming as soon as I get reviews) - fix for static_key underflow ARM: - Properly refcount pages used as a concatenated stage-2 PGD - Fix missing unlock when detecting the use of MTE+VM_SHARED" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SEV-ES: reduce ghcb_sa_len to 32 bits KVM: VMX: Remove redundant handling of bus lock vmexit KVM: kvm_stat: do not show halt_wait_ns KVM: x86: WARN if APIC HW/SW disable static keys are non-zero on unload Revert "KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET" KVM: SEV-ES: Set guest_state_protected after VMSA update KVM: X86: fix lazy allocation of rmaps KVM: SEV-ES: fix length of string I/O KVM: arm64: Release mmap_lock when using VM_SHARED with MTE KVM: arm64: Report corrupted refcount at EL2 KVM: arm64: Fix host stage-2 PGD refcount KVM: s390: Function documentation fixes	2021-10-20 05:52:10 -10:00
Thomas Gleixner	b56d2795b2	x86/fpu: Replace the includes of fpu/internal.h Now that the file is empty, fixup all references with the proper includes and delete the former kitchen sink. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011540.001197214@linutronix.de	2021-10-20 15:27:29 +02:00
Thomas Gleixner	d9d005f32a	x86/fpu: Move mxcsr related code to core No need to expose that to code which only needs the XCR0 accessors. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011539.740012411@linutronix.de	2021-10-20 15:27:28 +02:00
Thomas Gleixner	ea4d6938d4	x86/fpu: Replace KVMs home brewed FPU copy from user Copying a user space buffer to the memory buffer is already available in the FPU core. The copy mechanism in KVM lacks sanity checks and needs to use cpuid() to lookup the offset of each component, while the FPU core has this information cached. Make the FPU core variant accessible for KVM and replace the home brewed mechanism. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.134065207@linutronix.de	2021-10-20 15:27:27 +02:00
Thomas Gleixner	a0ff0611c2	x86/fpu: Move KVMs FPU swapping to FPU core Swapping the host/guest FPU is directly fiddling with FPU internals which requires 5 exports. The upcoming support of dynamically enabled states would even need more. Implement a swap function in the FPU core code and export that instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20211015011539.076072399@linutronix.de	2021-10-20 15:27:27 +02:00
Thomas Gleixner	126fe04018	x86/fpu: Cleanup xstate xcomp_bv initialization No point in having this duplicated all over the place with needlessly different defines. Provide a proper initialization function which initializes user buffers properly and make KVM use it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011538.897664678@linutronix.de	2021-10-20 15:27:26 +02:00
Oliver Upton	828ca89628	KVM: x86: Expose TSC offset controls to userspace To date, VMM-directed TSC synchronization and migration has been a bit messy. KVM has some baked-in heuristics around TSC writes to infer if the VMM is attempting to synchronize. This is problematic, as it depends on host userspace writing to the guest's TSC within 1 second of the last write. A much cleaner approach to configuring the guest's views of the TSC is to simply migrate the TSC offset for every vCPU. Offsets are idempotent, and thus not subject to change depending on when the VMM actually reads/writes values from/to KVM. The VMM can then read the TSC once with KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when the guest is paused. Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-8-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Oliver Upton	58d4277be9	KVM: x86: Refactor tsc synchronization code Refactor kvm_synchronize_tsc to make a new function that allows callers to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly for the sake of participating in TSC synchronization. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-7-oupton@google.com> [Make sure kvm->arch.cur_tsc_generation and vcpu->arch.this_tsc_generation are equal at the end of __kvm_synchronize_tsc, if matched is false. Reported by Maxim Levitsky. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Paolo Bonzini	869b44211a	kvm: x86: protect masterclock with a seqcount Protect the reference point for kvmclock with a seqcount, so that kvmclock updates for all vCPUs can proceed in parallel. Xen runstate updates will also run in parallel and not bounce the kvmclock cacheline. Of the variables that were protected by pvclock_gtod_sync_lock, nr_vcpus_matched_tsc is different because it is updated outside pvclock_update_vm_gtod_copy and read inside it. Therefore, we need to keep it protected by a spinlock. In fact it must now be a raw spinlock, because pvclock_update_vm_gtod_copy, being the write-side of a seqcount, is non-preemptible. Since we already have tsc_write_lock which is a raw spinlock, we can just use tsc_write_lock as the lock that protects the write-side of the seqcount. Co-developed-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-6-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:44 -04:00
Oliver Upton	c68dc1b577	KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK Handling the migration of TSCs correctly is difficult, in part because Linux does not provide userspace with the ability to retrieve a (TSC, realtime) clock pair for a single instant in time. In lieu of a more convenient facility, KVM can report similar information in the kvm_clock structure. Provide userspace with a host TSC & realtime pair iff the realtime clock is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid realtime value, advance the KVM clock by the amount of elapsed time. Do not step the KVM clock backwards, though, as it is a monotonic oscillator. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-5-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:44 -04:00
Paolo Bonzini	3d5e7a28b1	KVM: x86: avoid warning with -Wbitwise-instead-of-logical This is a new warning in clang top-of-tree (will be clang 14): In file included from arch/x86/kvm/mmu/mmu.c:27: arch/x86/kvm/mmu/spte.h:318:9: error: use of bitwise '\|' with boolean operands [-Werror,-Wbitwise-instead-of-logical] return __is_bad_mt_xwr(rsvd_check, spte) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \|\| arch/x86/kvm/mmu/spte.h:318:9: note: cast one or both operands to int to silence this warning The code is fine, but change it anyway to shut up this clever clogs of a compiler. Reported-by: torvic9@mailbox.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:42 -04:00
Paolo Bonzini	a25c78d04c	Merge commit 'kvm-pagedata-alloc-fixes' into HEAD	2021-10-18 14:13:37 -04:00
Paolo Bonzini	9f1ee7b169	KVM: SEV-ES: reduce ghcb_sa_len to 32 bits The size of the GHCB scratch area is limited to 16 KiB (GHCB_SCRATCH_AREA_LIMIT), so there is no need for it to be a u64. This fixes a build error on 32-bit systems: i686-linux-gnu-ld: arch/x86/kvm/svm/sev.o: in function `sev_es_string_io: sev.c:(.text+0x110f): undefined reference to `__udivdi3' Cc: stable@vger.kernel.org Fixes: `019057bd73` ("KVM: SEV-ES: fix length of string I/O") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:19 -04:00
Hao Xiang	d61863c66f	KVM: VMX: Remove redundant handling of bus lock vmexit Hardware may or may not set exit_reason.bus_lock_detected on BUS_LOCK VM-Exits. Dealing with KVM_RUN_X86_BUS_LOCK in handle_bus_lock_vmexit could be redundant when exit_reason.basic is EXIT_REASON_BUS_LOCK. We can remove redundant handling of bus lock vmexit. Unconditionally Set exit_reason.bus_lock_detected in handle_bus_lock_vmexit(), and deal with KVM_RUN_X86_BUS_LOCK only in vmx_handle_exit(). Signed-off-by: Hao Xiang <hao.xiang@linux.alibaba.com> Message-Id: <1634299161-30101-1-git-send-email-hao.xiang@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:19 -04:00
Sean Christopherson	9139a7a645	KVM: x86: WARN if APIC HW/SW disable static keys are non-zero on unload WARN if the static keys used to track if any vCPU has disabled its APIC are left elevated at module exit. Unlike the underflow case, nothing in the static key infrastructure will complain if a key is left elevated, and because an elevated key only affects performance, nothing in KVM will fail if either key is improperly incremented. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211013003554.47705-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:18 -04:00
Sean Christopherson	f7d8a19f9a	Revert "KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET" Revert a change to open code bits of kvm_lapic_set_base() when emulating APIC RESET to fix an apic_hw_disabled underflow bug due to arch.apic_base and apic_hw_disabled being unsyncrhonized when the APIC is created. If kvm_arch_vcpu_create() fails after creating the APIC, kvm_free_lapic() will see the initialized-to-zero vcpu->arch.apic_base and decrement apic_hw_disabled without KVM ever having incremented apic_hw_disabled. Using kvm_lapic_set_base() in kvm_lapic_reset() is also desirable for a potential future where KVM supports RESET outside of vCPU creation, in which case all the side effects of kvm_lapic_set_base() are needed, e.g. to handle the transition from x2APIC => xAPIC. Alternatively, KVM could temporarily increment apic_hw_disabled (and call kvm_lapic_set_base() at RESET), but that's a waste of cycles and would impact the performance of other vCPUs and VMs. The other subtle side effect is that updating the xAPIC ID needs to be done at RESET regardless of whether the APIC was previously enabled, i.e. kvm_lapic_reset() needs an explicit call to kvm_apic_set_xapic_id() regardless of whether or not kvm_lapic_set_base() also performs the update. That makes stuffing the enable bit at vCPU creation slightly more palatable, as doing so affects only the apic_hw_disabled key. Opportunistically tweak the comment to explicitly call out the connection between vcpu->arch.apic_base and apic_hw_disabled, and add a comment to call out the need to always do kvm_apic_set_xapic_id() at RESET. Underflow scenario: kvm_vm_ioctl() { kvm_vm_ioctl_create_vcpu() { kvm_arch_vcpu_create() { if (something_went_wrong) goto fail_free_lapic; /* vcpu->arch.apic_base is initialized when something_went_wrong is false. / kvm_vcpu_reset() { kvm_lapic_reset(struct kvm_vcpu vcpu, bool init_event) { vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE \| MSR_IA32_APICBASE_ENABLE; } } return 0; fail_free_lapic: kvm_free_lapic() { /* vcpu->arch.apic_base is not yet initialized when something_went_wrong is true. */ if (!(vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE)) static_branch_slow_dec_deferred(&apic_hw_disabled); // <= underflow bug. } return r; } } } This (mostly) reverts commit `421221234a`. Fixes: `421221234a` ("KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET") Reported-by: syzbot+9fc046ab2b0cf295a063@syzkaller.appspotmail.com Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211013003554.47705-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:18 -04:00
Paolo Bonzini	fa13843d15	KVM: X86: fix lazy allocation of rmaps If allocation of rmaps fails, but some of the pointers have already been written, those pointers can be cleaned up when the memslot is freed, or even reused later for another attempt at allocating the rmaps. Therefore there is no need to WARN, as done for example in memslot_rmap_alloc, but the allocation must be skipped lest KVM will overwrite the previous pointer and will indeed leak memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:17 -04:00
Peter Gonda	baa1e5ca17	KVM: SEV-ES: Set guest_state_protected after VMSA update The refactoring in commit `bb18a67774` ("KVM: SEV: Acquire vcpu mutex when updating VMSA") left behind the assignment to svm->vcpu.arch.guest_state_protected; add it back. Signed-off-by: Peter Gonda <pgonda@google.com> [Delta between v2 and v3 of Peter's patch, which had already been committed; the commit message is my own. - Paolo] Fixes: `bb18a67774` ("KVM: SEV: Acquire vcpu mutex when updating VMSA") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:17 -04:00
Andrei Vagin	a7cc099f2e	KVM: x86/mmu: kvm_faultin_pfn has to return false if pfh is returned This looks like a typo in `8f32d5e563`. This change didn't intend to do any functional changes. The problem was caught by gVisor tests. Fixes: `8f32d5e563` ("KVM: x86/mmu: allow kvm_faultin_pfn to return page fault handling code") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Message-Id: <20211015163221.472508-1-avagin@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 08:19:18 -04:00
Paolo Bonzini	019057bd73	KVM: SEV-ES: fix length of string I/O The size of the data in the scratch buffer is not divided by the size of each port I/O operation, so vcpu->arch.pio.count ends up being larger than it should be by a factor of size. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-15 04:47:36 -04:00
Peter Zijlstra	b08cadbd3b	Merge branch 'objtool/urgent' Fixup conflicts. # Conflicts: # tools/objtool/check.c	2021-10-07 00:40:17 +02:00
Tom Lendacky	4d96f91091	x86/sev: Replace occurrences of sev_active() with cc_platform_has() Replace uses of sev_active() with the more generic cc_platform_has() using CC_ATTR_GUEST_MEM_ENCRYPT. If future support is added for other memory encryption technologies, the use of CC_ATTR_GUEST_MEM_ENCRYPT can be updated, as required. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210928191009.32551-7-bp@alien8.de	2021-10-04 11:46:58 +02:00
Linus Torvalds	291073a566	kvm: fix objtool relocation warning The recent change to make objtool aware of more symbol relocation types (commit `24ff652573`: "objtool: Teach get_alt_entry() about more relocation types") also added another check, and resulted in this objtool warning when building kvm on x86: arch/x86/kvm/emulate.o: warning: objtool: __ex_table+0x4: don't know how to handle reloc symbol type: kvm_fastop_exception The reason seems to be that kvm_fastop_exception() is marked as a global symbol, which causes the relocation to ke kept around for objtool. And at the same time, the kvm_fastop_exception definition (which is done as an inline asm statement) doesn't actually set the type of the global, which then makes objtool unhappy. The minimal fix is to just not mark kvm_fastop_exception as being a global symbol. It's only used in that one compilation unit anyway, so it was always pointless. That's how all the other local exception table labels are done. I'm not entirely happy about the kinds of games that the kvm code plays with doing its own exception handling, and the fact that it confused objtool is most definitely a symptom of the code being a bit too subtle and ad-hoc. But at least this trivial one-liner makes objtool no longer upset about what is going on. Fixes: `24ff652573` ("objtool: Teach get_alt_entry() about more relocation types") Link: https://lore.kernel.org/lkml/CAHk-=wiZwq-0LknKhXN4M+T8jbxn_2i9mcKpO+OaBSSq_Eh7tg@mail.gmail.com/ Cc: Borislav Petkov <bp@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-10-03 13:34:19 -07:00
David Stevens	deae4a10f1	KVM: x86: only allocate gfn_track when necessary Avoid allocating the gfn_track arrays if nothing needs them. If there are no external to KVM users of the API (i.e. no GVT-g), then page tracking is only needed for shadow page tables. This means that when tdp is enabled and there are no external users, then the gfn_track arrays can be lazily allocated when the shadow MMU is actually used. This avoid allocations equal to .05% of guest memory when nested virtualization is not used, if the kernel is compiled without GVT-g. Signed-off-by: David Stevens <stevensd@chromium.org> Message-Id: <20210922045859.2011227-3-stevensd@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:58 -04:00
David Stevens	e9d0c0c4f7	KVM: x86: add config for non-kvm users of page tracking Add a config option that allows kvm to determine whether or not there are any external users of page tracking. Signed-off-by: David Stevens <stevensd@chromium.org> Message-Id: <20210922045859.2011227-2-stevensd@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:57 -04:00
Krish Sadhukhan	174a921b69	nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB According to section "TLB Flush" in APM vol 2, "Support for TLB_CONTROL commands other than the first two, is optional and is indicated by CPUID Fn8000_000A_EDX[FlushByAsid]. All encodings of TLB_CONTROL not defined in the APM are reserved." Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Message-Id: <20210920235134.101970-3-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:57 -04:00
Juergen Gross	78b497f2e6	kvm: use kvfree() in kvm_arch_free_vm() By switching from kfree() to kvfree() in kvm_arch_free_vm() Arm64 can use the common variant. This can be accomplished by adding another macro __KVM_HAVE_ARCH_VM_FREE, which will be used only by x86 for now. Further simplification can be achieved by adding __kvm_arch_free_vm() doing the common part. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-Id: <20210903130808.30142-5-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:57 -04:00
Babu Moger	b73a54321a	KVM: x86: Expose Predictive Store Forwarding Disable Predictive Store Forwarding: AMD Zen3 processors feature a new technology called Predictive Store Forwarding (PSF). PSF is a hardware-based micro-architectural optimization designed to improve the performance of code execution by predicting address dependencies between loads and stores. How PSF works: It is very common for a CPU to execute a load instruction to an address that was recently written by a store. Modern CPUs implement a technique known as Store-To-Load-Forwarding (STLF) to improve performance in such cases. With STLF, data from the store is forwarded directly to the load without having to wait for it to be written to memory. In a typical CPU, STLF occurs after the address of both the load and store are calculated and determined to match. PSF expands on this by speculating on the relationship between loads and stores without waiting for the address calculation to complete. With PSF, the CPU learns over time the relationship between loads and stores. If STLF typically occurs between a particular store and load, the CPU will remember this. In typical code, PSF provides a performance benefit by speculating on the load result and allowing later instructions to begin execution sooner than they otherwise would be able to. The details of security analysis of AMD predictive store forwarding is documented here. https://www.amd.com/system/files/documents/security-analysis-predictive-store-forwarding.pdf Predictive Store Forwarding controls: There are two hardware control bits which influence the PSF feature: - MSR 48h bit 2 – Speculative Store Bypass (SSBD) - MSR 48h bit 7 – Predictive Store Forwarding Disable (PSFD) The PSF feature is disabled if either of these bits are set. These bits are controllable on a per-thread basis in an SMT system. By default, both SSBD and PSFD are 0 meaning that the speculation features are enabled. While the SSBD bit disables PSF and speculative store bypass, PSFD only disables PSF. PSFD may be desirable for software which is concerned with the speculative behavior of PSF but desires a smaller performance impact than setting SSBD. Support for PSFD is indicated in CPUID Fn8000_0008 EBX[28]. All processors that support PSF will also support PSFD. Linux kernel does not have the interface to enable/disable PSFD yet. Plan here is to expose the PSFD technology to KVM so that the guest kernel can make use of it if they wish to. Signed-off-by: Babu Moger <Babu.Moger@amd.com> Message-Id: <163244601049.30292.5855870305350227855.stgit@bmoger-ubuntu> [Keep feature private to KVM, as requested by Borislav Petkov. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:57 -04:00
David Matlack	53597858db	KVM: x86/mmu: Avoid memslot lookup in make_spte and mmu_try_to_unsync_pages mmu_try_to_unsync_pages checks if page tracking is active for the given gfn, which requires knowing the memslot. We can pass down the memslot via make_spte to avoid this lookup. The memslot is also handy for make_spte's marking of the gfn as dirty: we can test whether dirty page tracking is enabled, and if so ensure that pages are mapped as writable with 4K granularity. Apart from the warning, no functional change is intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:56 -04:00
David Matlack	8a9f566ae4	KVM: x86/mmu: Avoid memslot lookup in rmap_add Avoid the memslot lookup in rmap_add, by passing it down from the fault handling code to mmu_set_spte and then to rmap_add. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:56 -04:00
Paolo Bonzini	a12f43818b	KVM: MMU: pass struct kvm_page_fault to mmu_set_spte mmu_set_spte is called for either PTE prefetching or page faults. The three boolean arguments write_fault, speculative and host_writable are always respectively false/true/true for prefetching and coming from a struct kvm_page_fault for page faults. Let mmu_set_spte distinguish these two situation by accepting a possibly NULL struct kvm_page_fault argument. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:56 -04:00
Paolo Bonzini	7158bee4b4	KVM: MMU: pass kvm_mmu_page struct to make_spte The level and A/D bit support of the new SPTE can be found in the role, which is stored in the kvm_mmu_page struct. This merges two arguments into one. For the TDP MMU, the kvm_mmu_page was not used (kvm_tdp_mmu_map does not use it if the SPTE is already present) so we fetch it just before calling make_spte. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:55 -04:00
Paolo Bonzini	87e888eafd	KVM: MMU: set ad_disabled in TDP MMU role Prepare for removing the ad_disabled argument of make_spte; instead it can be found in the role of a struct kvm_mmu_page. First of all, the TDP MMU must set the role accurately. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:55 -04:00
Paolo Bonzini	eb5cd7ffe1	KVM: MMU: remove unnecessary argument to mmu_set_spte The level of the new SPTE can be found in the kvm_mmu_page struct; there is no need to pass it down. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:55 -04:00
Paolo Bonzini	ad67e4806e	KVM: MMU: clean up make_spte return value Now that make_spte is called directly by the shadow MMU (rather than wrapped by set_spte), it only has to return one boolean value. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:54 -04:00
Paolo Bonzini	4758d47e0d	KVM: MMU: inline set_spte in FNAME(sync_page) Since the two callers of set_spte do different things with the results, inlining it actually makes the code simpler to reason about. For example, FNAME(sync_page) already has a struct kvm_mmu_page *, but set_spte had to fish it back out of sptep's private page data. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:54 -04:00
Paolo Bonzini	d786c7783b	KVM: MMU: inline set_spte in mmu_set_spte Since the two callers of set_spte do different things with the results, inlining it actually makes the code simpler to reason about. For example, mmu_set_spte looks quite like tdp_mmu_map_handle_target_level, but the similarity is hidden by set_spte. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:54 -04:00
David Matlack	888104138c	KVM: x86/mmu: Avoid memslot lookup in page_fault_handle_page_track Now that kvm_page_fault has a pointer to the memslot it can be passed down to the page tracking code to avoid a redundant slot lookup. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:53 -04:00
David Matlack	e710c5f6be	KVM: x86/mmu: Pass the memslot around via struct kvm_page_fault The memslot for the faulting gfn is used throughout the page fault handling code, so capture it in kvm_page_fault as soon as we know the gfn and use it in the page fault handling code that has direct access to the kvm_page_fault struct. Replace various tests using is_noslot_pfn with more direct tests on fault->slot being NULL. This, in combination with the subsequent patch, improves "Populate memory time" in dirty_log_perf_test by 5% when using the legacy MMU. There is no discerable improvement to the performance of the TDP MMU. No functional change intended. Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:53 -04:00
Paolo Bonzini	6ccf443882	KVM: MMU: unify tdp_mmu_map_set_spte_atomic and tdp_mmu_set_spte_atomic_no_dirty_log tdp_mmu_map_set_spte_atomic is not taking care of dirty logging anymore, the only difference that remains is that it takes a vCPU instead of the struct kvm. Merge the two functions. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:53 -04:00
Paolo Bonzini	bcc4f2bc50	KVM: MMU: mark page dirty in make_spte This simplifies set_spte, which we want to remove, and unifies code between the shadow MMU and the TDP MMU. The warning will be added back later to make_spte as well. There is a small disadvantage in the TDP MMU; it may unnecessarily mark a page as dirty twice if two vCPUs end up mapping the same page twice. However, this is a very small cost for a case that is already rare. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:53 -04:00
David Matlack	68be1306ca	KVM: x86/mmu: Fold rmap_recycle into rmap_add Consolidate rmap_recycle and rmap_add into a single function since they are only ever called together (and only from one place). This has a nice side effect of eliminating an extra kvm_vcpu_gfn_to_memslot(). In addition it makes mmu_set_spte(), which is a very long function, a little shorter. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Sean Christopherson	b1a429fb18	KVM: x86/mmu: Verify shadow walk doesn't terminate early in page faults WARN and bail if the shadow walk for faulting in a SPTE terminates early, i.e. doesn't reach the expected level because the walk encountered a terminal SPTE. The shadow walks for page faults are subtle in that they install non-leaf SPTEs (zapping leaf SPTEs if necessary!) in the loop body, and consume the newly created non-leaf SPTE in the loop control, e.g. __shadow_walk_next(). In other words, the walks guarantee that the walk will stop if and only if the target level is reached by installing non-leaf SPTEs to guarantee the walk remains valid. Opportunistically use fault->goal-level instead of it.level in FNAME(fetch) to further clarify that KVM always installs the leaf SPTE at the target level. Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210906122547.263316-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Paolo Bonzini	f0066d94c9	KVM: MMU: change tracepoints arguments to kvm_page_fault Pass struct kvm_page_fault to tracepoints instead of extracting the arguments from the struct. This also lets the kvm_mmu_spte_requested tracepoint pick the gfn directly from fault->gfn, instead of using the address. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Paolo Bonzini	536f0e6ace	KVM: MMU: change disallowed_hugepage_adjust() arguments to kvm_page_fault Pass struct kvm_page_fault to disallowed_hugepage_adjust() instead of extracting the arguments from the struct. Tweak a bit the conditions to avoid long lines. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	73a3c65947	KVM: MMU: change kvm_mmu_hugepage_adjust() arguments to kvm_page_fault Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of extracting the arguments from the struct; the results are also stored in the struct, so the callers are adjusted consequently. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	3c8ad5a675	KVM: MMU: change fast_page_fault() arguments to kvm_page_fault Pass struct kvm_page_fault to fast_page_fault() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	cdc47767a0	KVM: MMU: change tdp_mmu_map_handle_target_level() arguments to kvm_page_fault Pass struct kvm_page_fault to tdp_mmu_map_handle_target_level() instead of extracting the arguments from the struct. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	2f6305dd56	KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault Pass struct kvm_page_fault to kvm_tdp_mmu_map() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	9c03b1821a	KVM: MMU: change FNAME(fetch)() arguments to kvm_page_fault Pass struct kvm_page_fault to FNAME(fetch)() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	43b74355ef	KVM: MMU: change __direct_map() arguments to kvm_page_fault Pass struct kvm_page_fault to __direct_map() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	3a13f4fea3	KVM: MMU: change handle_abnormal_pfn() arguments to kvm_page_fault Pass struct kvm_page_fault to handle_abnormal_pfn() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	3647cd04b7	KVM: MMU: change kvm_faultin_pfn() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to outputs of kvm_faultin_pfn(). For now they have to be extracted again from struct kvm_page_fault in the subsequent steps, but this is temporary until other functions in the chain are switched over as well. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00

... 2 3 4 5 6 ...

8040 Commits